Skip to content

issues with test of CUDA-aware MPI support #924

@aklocker42

Description

@aklocker42

I am currently trying to test CUDA-aware MPI on this machine, using Nvidia GH200s.

I first tried the alltoall_test_cuda.jl test recommended here which works fine, but when I move on to testing the functioning of multiple GPUs with alltoall_test_cuda_multigpu.jl, I end up with the following issue:

rank=0 rank_loc=0 (gpu_id=CuDevice(0)), size=4, dst=1, src=3
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
Stacktrace:
Stacktrace:
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
 [2] check
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
 [3] cuDeviceGet
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
 [1] throw_api_error(res::CUDA.cudaError_enum)
 [2] check
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
 [3] cuDeviceGet
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
 [2] check
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
 [3] cuDeviceGet
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
 [4] CuDevice(ordinal::Int64)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/devices.jl:17
 [5] device!
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324 [inlined]
 [6] device!(dev::Int64)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324
 [7] top-level scope
   @ ~/alltoall_test_cuda_multigpu.jl:9
in expression starting at /cluster/home/aklocker/alltoall_test_cuda_multigpu.jl:9
 [4] CuDevice(ordinal::Int64)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/devices.jl:17
 [5] device!
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324 [inlined]
 [6] device!(dev::Int64)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324
 [7] top-level scope
   @ ~/alltoall_test_cuda_multigpu.jl:9
in expression starting at /cluster/home/aklocker/alltoall_test_cuda_multigpu.jl:9
 [4] CuDevice(ordinal::Int64)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/devices.jl:17
 [5] device!
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324 [inlined]
 [6] device!(dev::Int64)
   @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/state.jl:324
 [7] top-level scope
   @ ~/alltoall_test_cuda_multigpu.jl:9
in expression starting at /cluster/home/aklocker/alltoall_test_cuda_multigpu.jl:9
srun: error: gpu-1-1: tasks 1-3: Exited with exit code 1
srun: Terminating StepId=58810.0
[2025-12-11T09:52:48.778] error: *** STEP 58810.0 ON gpu-1-1 CANCELLED AT 2025-12-11T09:52:48 DUE TO TASK FAILURE ***

My system info:

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.6, local installation
- driver 565.57.1 for 13.0
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 12.6.0)
- NVML: 12.0.0+565.57.1

Julia packages: 
M- CUDA: 5.9.5
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0
- CUDA_Runtime_Discovery: 1.0.0

Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false

Preferences:
- CUDA_Runtime_jll.local: true

2 devices:
  0: NVIDIA GH200 120GB (sm_90, 94.997 GiB / 95.577 GiB available)
  1: NVIDIA GH200 120GB (sm_90, 94.997 GiB / 95.577 GiB available)

and

julia> MPI.versioninfo()
MPIPreferences:
  binary:  system
  abi:     MPICH
  libmpi:  libmpi_cray.so
  mpiexec: ["srun", "-C", "gpu"]

Package versions
  MPI.jl:             0.20.23
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi_cray.so
  libmpi dlpath:  /opt/cray/pe/lib64/libmpi_cray.so
  MPI version:  3.1.0
  Library version:  
    MPI VERSION    : CRAY MPICH version 8.1.32.110 (ANL base 3.4a2)
    MPI BUILD INFO : Thu Feb 06 22:42 2025 (git hash f9c5634-dirty)
    
  MPI launcher: srun
  MPI launcher path: /usr/bin/srun

Could anyone here point me in the right direction of what is going wrong here? I'm relatively new to Julia and CUDA, so any help would be much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions