From 61e0edebadfcfdec5e8c96fac36de664c6e65e40 Mon Sep 17 00:00:00 2001 From: Ludovic Raess Date: Wed, 17 Dec 2025 10:21:35 +0100 Subject: [PATCH 1/4] Improve GPU-aware section --- docs/src/usage.md | 43 +++++++++++++++++++++++++++++-------------- 1 file changed, 29 insertions(+), 14 deletions(-) diff --git a/docs/src/usage.md b/docs/src/usage.md index c57eae1af..2835a1c38 100644 --- a/docs/src/usage.md +++ b/docs/src/usage.md @@ -74,33 +74,48 @@ with: $ mpiexecjl --project=/path/to/project -n 20 julia script.jl ``` -## CUDA-aware MPI support +## GPU-aware MPI support -If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the -[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as -send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). +If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from +[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as +send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires to in most cases to use a [system provided MPI installation](configuration.md#using_system_mpi). -Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) -should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the -[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm +### CUDA + +Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) +should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the +[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank). If using OpenMPI, the status of CUDA support can be checked via the [`MPI.has_cuda()`](@ref) function. -## ROCm-aware MPI support - -If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the -[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). +### ROCm -Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) -should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the -[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm +Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) +should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the +[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank). If using OpenMPI, the status of ROCm support can be checked via the [`MPI.has_rocm()`](@ref) function. +> [!NOTE] +> On Cray machines, you may need to ensure the following preloads to be set in the preferences: +> ``` +> preloads = ["libmpi_gtl_hsa.so"] +> preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED" +> ``` + +> [!NOTE] +> In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly. +> If using (1), one can use the node-local rank `rank_loc` to select the GPU device: +> ``` +> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) +> rank_loc = MPI.Comm_rank(comm_loc) +> ``` +> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`. + ## Writing MPI tests It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`: From b8312ea6cc751f58be2fa5f9f1c9396cd160ff66 Mon Sep 17 00:00:00 2001 From: Ludovic Raess Date: Wed, 17 Dec 2025 19:56:05 +0100 Subject: [PATCH 2/4] Add suggestions --- docs/src/usage.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/src/usage.md b/docs/src/usage.md index 2835a1c38..dca16be84 100644 --- a/docs/src/usage.md +++ b/docs/src/usage.md @@ -78,7 +78,7 @@ $ mpiexecjl --project=/path/to/project -n 20 julia script.jl If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as -send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires to in most cases to use a [system provided MPI installation](configuration.md#using_system_mpi). +send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi). ### CUDA @@ -100,21 +100,21 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank). If using OpenMPI, the status of ROCm support can be checked via the [`MPI.has_rocm()`](@ref) function. -> [!NOTE] -> On Cray machines, you may need to ensure the following preloads to be set in the preferences: -> ``` -> preloads = ["libmpi_gtl_hsa.so"] -> preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED" -> ``` - -> [!NOTE] -> In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly. -> If using (1), one can use the node-local rank `rank_loc` to select the GPU device: -> ``` -> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) -> rank_loc = MPI.Comm_rank(comm_loc) -> ``` -> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`. +!!! note "Preloads" + On Cray machines, you may need to ensure the following preloads to be set in the preferences: + ``` + preloads = ["libmpi_gtl_hsa.so"] + preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED" + ``` + +!!! note "Multiple GPUs per node" + In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly. + For (1), using the node-local rank `rank_loc` is a way to select the GPU device: + ``` + comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) + rank_loc = MPI.Comm_rank(comm_loc) + ``` + For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`. ## Writing MPI tests From a01708036656ecdf767b0c22afa48e9006e3a76f Mon Sep 17 00:00:00 2001 From: Ludovic Raess Date: Wed, 17 Dec 2025 21:05:50 +0100 Subject: [PATCH 3/4] Update --- docs/src/usage.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/src/usage.md b/docs/src/usage.md index dca16be84..1150ee9ea 100644 --- a/docs/src/usage.md +++ b/docs/src/usage.md @@ -80,6 +80,13 @@ If your MPI implementation has been compiled with CUDA or ROCm support, then `CU [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi). +!!! note "Preloads" + On Cray machines, you may need to ensure the following preloads to be set in the preferences: + ``` + preloads = ["libmpi_gtl_hsa.so"] + preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED" + ``` + ### CUDA Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) @@ -100,21 +107,15 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank). If using OpenMPI, the status of ROCm support can be checked via the [`MPI.has_rocm()`](@ref) function. -!!! note "Preloads" - On Cray machines, you may need to ensure the following preloads to be set in the preferences: - ``` - preloads = ["libmpi_gtl_hsa.so"] - preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED" - ``` +### Multiple GPUs per node -!!! note "Multiple GPUs per node" - In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly. - For (1), using the node-local rank `rank_loc` is a way to select the GPU device: - ``` - comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) - rank_loc = MPI.Comm_rank(comm_loc) - ``` - For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`. +In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly. +For (1), using the node-local rank `rank_loc` is a way to select the GPU device: +``` +comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) +rank_loc = MPI.Comm_rank(comm_loc) +``` +For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`. ## Writing MPI tests From d64e598ee05ff9ddbf3da6c48d9cefad77e69cf9 Mon Sep 17 00:00:00 2001 From: Ludovic Raess Date: Wed, 17 Dec 2025 21:25:00 +0100 Subject: [PATCH 4/4] Add examples --- docs/examples/alltoall_test_cuda.jl | 27 ++++++++++++++ docs/examples/alltoall_test_cuda_multigpu.jl | 38 ++++++++++++++++++++ docs/examples/alltoall_test_rocm.jl | 27 ++++++++++++++ docs/examples/alltoall_test_rocm_multigpu.jl | 38 ++++++++++++++++++++ docs/src/usage.md | 8 ++--- 5 files changed, 134 insertions(+), 4 deletions(-) create mode 100644 docs/examples/alltoall_test_cuda.jl create mode 100644 docs/examples/alltoall_test_cuda_multigpu.jl create mode 100644 docs/examples/alltoall_test_rocm.jl create mode 100644 docs/examples/alltoall_test_rocm_multigpu.jl diff --git a/docs/examples/alltoall_test_cuda.jl b/docs/examples/alltoall_test_cuda.jl new file mode 100644 index 000000000..05011985c --- /dev/null +++ b/docs/examples/alltoall_test_cuda.jl @@ -0,0 +1,27 @@ +# This example demonstrates your MPI implementation to have the CUDA support enabled. + +using MPI +using CUDA + +MPI.Init() + +comm = MPI.COMM_WORLD +rank = MPI.Comm_rank(comm) + +size = MPI.Comm_size(comm) +dst = mod(rank+1, size) +src = mod(rank-1, size) +println("rank=$rank, size=$size, dst=$dst, src=$src") + +N = 4 + +send_mesg = CuArray{Float64}(undef, N) +recv_mesg = CuArray{Float64}(undef, N) + +fill!(send_mesg, Float64(rank)) +CUDA.synchronize() + +println("start sending...") +MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) +println("recv_mesg on proc $rank: $recv_mesg") +rank==0 && println("done.") diff --git a/docs/examples/alltoall_test_cuda_multigpu.jl b/docs/examples/alltoall_test_cuda_multigpu.jl new file mode 100644 index 000000000..cc4838153 --- /dev/null +++ b/docs/examples/alltoall_test_cuda_multigpu.jl @@ -0,0 +1,38 @@ +confirm +# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank) + +using MPI +using CUDA + +MPI.Init() + +comm = MPI.COMM_WORLD +rank = MPI.Comm_rank(comm) + +# select device (specifically relevant if >1 GPU per node) +# using node-local communicator to retrieve node-local rank +comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) +rank_l = MPI.Comm_rank(comm_l) + +# select device +gpu_id = CUDA.device!(rank_l) +# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`) +# gpu_id = CUDA.device!(0) + +size = MPI.Comm_size(comm) +dst = mod(rank+1, size) +src = mod(rank-1, size) +println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src") + +N = 4 + +send_mesg = CuArray{Float64}(undef, N) +recv_mesg = CuArray{Float64}(undef, N) + +fill!(send_mesg, Float64(rank)) +CUDA.synchronize() + +rank==0 && println("start sending...") +MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) +println("recv_mesg on proc $rank_l: $recv_mesg") +rank==0 && println("done.") diff --git a/docs/examples/alltoall_test_rocm.jl b/docs/examples/alltoall_test_rocm.jl new file mode 100644 index 000000000..e8be85b34 --- /dev/null +++ b/docs/examples/alltoall_test_rocm.jl @@ -0,0 +1,27 @@ +# This example demonstrates your MPI implementation to have the ROCm support enabled. + +using MPI +using AMDGPU + +MPI.Init() + +comm = MPI.COMM_WORLD +rank = MPI.Comm_rank(comm) + +size = MPI.Comm_size(comm) +dst = mod(rank+1, size) +src = mod(rank-1, size) +println("rank=$rank, size=$size, dst=$dst, src=$src") + +N = 4 + +send_mesg = ROCArray{Float64}(undef, N) +recv_mesg = ROCArray{Float64}(undef, N) + +fill!(send_mesg, Float64(rank)) +AMDGPU.synchronize() + +println("start sending...") +MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) +println("recv_mesg on proc $rank: $recv_mesg") +rank==0 && println("done.") diff --git a/docs/examples/alltoall_test_rocm_multigpu.jl b/docs/examples/alltoall_test_rocm_multigpu.jl new file mode 100644 index 000000000..c26348261 --- /dev/null +++ b/docs/examples/alltoall_test_rocm_multigpu.jl @@ -0,0 +1,38 @@ +# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank) + +using MPI +using AMDGPU + +MPI.Init() + +comm = MPI.COMM_WORLD +rank = MPI.Comm_rank(comm) + +# select device (specifically relevant if >1 GPU per node) +# using node-local communicator to retrieve node-local rank +comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) +rank_l = MPI.Comm_rank(comm_l) + +# select device +device = AMDGPU.device_id!(rank_l+1) +# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`) +# device = AMDGPU.device_id!(1) +gpu_id = AMDGPU.device_id(AMDGPU.device()) + +size = MPI.Comm_size(comm) +dst = mod(rank+1, size) +src = mod(rank-1, size) +println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src") + +N = 4 + +send_mesg = ROCArray{Float64}(undef, N) +recv_mesg = ROCArray{Float64}(undef, N) + +fill!(send_mesg, Float64(rank)) +AMDGPU.synchronize() + +rank==0 && println("start sending...") +MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) +println("recv_mesg on proc $rank: $recv_mesg") +rank==0 && println("done.") diff --git a/docs/src/usage.md b/docs/src/usage.md index 1150ee9ea..969cb6d3d 100644 --- a/docs/src/usage.md +++ b/docs/src/usage.md @@ -89,9 +89,9 @@ send and receive buffers for point-to-point and collective operations (they may ### CUDA -Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) +Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl) should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the -[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm +[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank). If using OpenMPI, the status of CUDA support can be checked via the @@ -99,9 +99,9 @@ If using OpenMPI, the status of CUDA support can be checked via the ### ROCm -Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) +Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl) should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the -[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm +[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank). If using OpenMPI, the status of ROCm support can be checked via the