-
Notifications
You must be signed in to change notification settings - Fork 238
Add CUDA version compatibility check #1412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/ok to test 7ce325c |
7ce325c to
1962e35
Compare
|
/ok to test 1962e35 |
|
8a0d248 to
73e5a43
Compare
|
/ok to test 73e5a43 |
73e5a43 to
ddd92bd
Compare
|
/ok to test ddd92bd |
rparolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should issue a warning to the user if we can't fetch the driver version.
Add warn_if_cuda_major_version_mismatch() to cuda-bindings that warns when cuda-bindings was compiled for a newer CUDA major version than the installed driver supports. Called by cuda.core on first Device access.
# Conflicts: # cuda_core/pixi.lock
|
/ok to test 62dfcca |
Import warn_if_cuda_major_version_mismatch locally in Device.__new__ after cuInit, using try/except/else pattern instead of module-level import with lambda fallback.
07bf4a4 to
b2083ed
Compare
|
/ok to test b2083ed |
Extract Device.__new__ logic into cdef helper functions: - Device_ensure_cuda_initialized(): cuInit + version check - Device_resolve_device_id(): resolve None to current device or 0 - Device_ensure_tls_devices(): create thread-local singletons Reduces Device.__new__ from ~60 lines to ~12 lines. Helpers placed after Device class following memory module pattern.
a92ed3b to
fdb3a7e
Compare
|
/ok to test fdb3a7e |
|
|
||
| def setup_method(self): | ||
| """Reset the version compatibility check flag before each test.""" | ||
| _version_check._major_version_compatibility_checked = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively going to reset if the version was checked already and after the unit tests executes the check could potentially execute again. Shouldn't we cache the current state and before setting it to false for the unit tests and then restore to its previous value on unit test teardown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone with a dodgy build might get multiple warnings during testing. Not sure it's worth fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either mock it or restore whatever the original value was during teardown, I agree with Rob.
|
/ok to test acadb31 |
Integrate RAII handles architecture from main while preserving the version compatibility check feature. Resolved conflict in _device.pyx by keeping the refactored helper functions and updating field names to match RAII object structure.
acadb31 to
3a5c210
Compare
|
/ok to test 3a5c210 |
|
/ok to test 424a113 |
|
/ok to test 6071609 |
leofang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late review.
| global _major_version_compatibility_checked | ||
| if _major_version_compatibility_checked: | ||
| return | ||
| _major_version_compatibility_checked = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- need a lock + double check pattern
- move
_major_version_compatibility_checked = Trueto the code later because we could raise below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added a lock
- This was intentional. If the code attempting to perform the check fails for some reason, we would never want to retry it.
|
|
||
| - ``CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM`` : When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See `Stream Synchronization Behavior <https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html>`_ for an explanation of the legacy and per-thread default streams. | ||
|
|
||
| - ``CUDA_PYTHON_DISABLE_MAJOR_VERSION_WARNING`` : When set to 1, suppresses warnings about CUDA major version mismatches between ``cuda-bindings`` and the installed driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I am not sure if this env var is really useful, because the added API is entirely opt-in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version check can be triggered indirectly from the end user's POV. For example, by using cuda-core. Without this, each library that calls the version-check function would need its own opt-out, and users might need to set them all. To me it makes sense to place the opt-out alongside the implementation and there is little/no cost.
|
|
||
| def setup_method(self): | ||
| """Reset the version compatibility check flag before each test.""" | ||
| _version_check._major_version_compatibility_checked = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either mock it or restore whatever the original value was during teardown, I agree with Rob.
cuda_core/cuda/core/_device.pyx
Outdated
| return GraphBuilder._init(stream=self.create_stream(), is_stream_owner=True) | ||
|
|
||
|
|
||
| cdef inline void Device_ensure_cuda_initialized() except *: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer cdef inline int Device_ensure_cuda_initialized() except? -1: to avoid an unconditional exception check (which Cython warns at build time).
| try: | ||
| from cuda.bindings.utils import warn_if_cuda_major_version_mismatch | ||
| except ImportError: | ||
| pass | ||
| else: | ||
| warn_if_cuda_major_version_mismatch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to move imports like this to the top so as to avoid the little overheads in a hot loop. Please benchmark the constructor to see the perf diff before/after this PR. As per @mdboom such imports, despite they are cached, have a noticeable perf overhead (on the order that we care about).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the general advice. Note that in this case, the block is guarded by _is_cuInit so the import is only attempted once. I decided to bury this import because 1) it would otherwise add clutter to the top-of-file import list, which human readers parse relatively often, 2) an import failure is ignored, so it can't hurt anything, and 3) there is no performance impact, since the import is only attempted once.
|
/ok to test 96532b2 |
96532b2 to
973af9b
Compare
|
/ok to test 973af9b |
973af9b to
27732f2
Compare
Replace setup_method/teardown_method with a pytest fixture that uses monkeypatch to properly save and restore the original value of _major_version_compatibility_checked after each test. Minor change to Cython cdef inline helper function signature.
27732f2 to
7f23b08
Compare
|
/ok to test 7f23b08 |
Summary
Adds a version compatibility check that warns users when cuda-bindings was compiled against a newer CUDA major version than the installed driver supports.
Changes
cuda-bindings
check_cuda_version_compatibility()function incuda/bindings/utils/_version_check.pyCUDA_VERSIONvs runtimecuDriverGetVersion()cuda.bindings.utilstests/test_version_check.pycuda-core
Device.__new__callscheck_cuda_version_compatibility()aftercuInitsucceedscuda.bindings.utilsRationale
When cuda-bindings is built against CUDA 13 headers but the user's driver only supports CUDA 12, many features will silently fail or behave unexpectedly. This check provides early, clear feedback:
Design
cuda.bindings.utilssince it checks cuda-bindings' compile-time versionDevicefirst triggers CUDA initializationCUDA_PYTHON_DISABLE_VERSION_CHECK=1to disableFuture Work
We could not find a suitable place to invoke the version check automatically within cuda-bindings itself (e.g., hooking into
cuInit), so the check is currently triggered by cuda-core. This may be revisited in the future.Test Coverage
7 tests in cuda-bindings covering:
Related Work