-
-
Notifications
You must be signed in to change notification settings - Fork 124
Closed
Description
I am currently trying to test CUDA-aware MPI on this machine, using Nvidia GH200s.
I first tried the alltoall_test_cuda.jl test recommended here which works fine, but when I move on to testing the functioning of multiple GPUs with alltoall_test_cuda_multigpu.jl, I end up with the following issue:
rank=0 rank_loc=0 (gpu_id=CuDevice(0)), size=4, dst=1, src=3
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
Stacktrace:
Stacktrace:
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
[2] check
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
[3] cuDeviceGet
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
[1] throw_api_error(res::CUDA.cudaError_enum)
[2] check
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
[3] cuDeviceGet
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
[2] check
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
[3] cuDeviceGet
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
[4] CuDevice(ordinal::Int64)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/devices.jl:17
[5] device!
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324 [inlined]
[6] device!(dev::Int64)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324
[7] top-level scope
@ ~/alltoall_test_cuda_multigpu.jl:9
in expression starting at /cluster/home/aklocker/alltoall_test_cuda_multigpu.jl:9
[4] CuDevice(ordinal::Int64)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/devices.jl:17
[5] device!
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324 [inlined]
[6] device!(dev::Int64)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324
[7] top-level scope
@ ~/alltoall_test_cuda_multigpu.jl:9
in expression starting at /cluster/home/aklocker/alltoall_test_cuda_multigpu.jl:9
[4] CuDevice(ordinal::Int64)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/devices.jl:17
[5] device!
@ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324 [inlined]
[6] device!(dev::Int64)
@ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324
[7] top-level scope
@ ~/alltoall_test_cuda_multigpu.jl:9
in expression starting at /cluster/home/aklocker/alltoall_test_cuda_multigpu.jl:9
srun: error: gpu-1-1: tasks 1-3: Exited with exit code 1
srun: Terminating StepId=58810.0
[2025-12-11T09:52:48.778] error: *** STEP 58810.0 ON gpu-1-1 CANCELLED AT 2025-12-11T09:52:48 DUE TO TASK FAILURE ***
My system info:
julia> CUDA.versioninfo()
CUDA toolchain:
- runtime 12.6, local installation
- driver 565.57.1 for 13.0
- compiler 12.9
CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 12.6.0)
- NVML: 12.0.0+565.57.1
Julia packages:
M- CUDA: 5.9.5
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0
- CUDA_Runtime_Discovery: 1.0.0
Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7
Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false
Preferences:
- CUDA_Runtime_jll.local: true
2 devices:
0: NVIDIA GH200 120GB (sm_90, 94.997 GiB / 95.577 GiB available)
1: NVIDIA GH200 120GB (sm_90, 94.997 GiB / 95.577 GiB available)
and
julia> MPI.versioninfo()
MPIPreferences:
binary: system
abi: MPICH
libmpi: libmpi_cray.so
mpiexec: ["srun", "-C", "gpu"]
Package versions
MPI.jl: 0.20.23
MPIPreferences.jl: 0.1.11
Library information:
libmpi: libmpi_cray.so
libmpi dlpath: /opt/cray/pe/lib64/libmpi_cray.so
MPI version: 3.1.0
Library version:
MPI VERSION : CRAY MPICH version 8.1.32.110 (ANL base 3.4a2)
MPI BUILD INFO : Thu Feb 06 22:42 2025 (git hash f9c5634-dirty)
MPI launcher: srun
MPI launcher path: /usr/bin/srun
Could anyone here point me in the right direction of what is going wrong here? I'm relatively new to Julia and CUDA, so any help would be much appreciated!