From 61e0edebadfcfdec5e8c96fac36de664c6e65e40 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 10:21:35 +0100
Subject: [PATCH 1/4] Improve GPU-aware section

---
 docs/src/usage.md | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index c57eae1af..2835a1c38 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -74,33 +74,48 @@ with:
 $ mpiexecjl --project=/path/to/project -n 20 julia script.jl
 ```
 
-## CUDA-aware MPI support
+## GPU-aware MPI support
 
-If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the
-[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as
-send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
+If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
+[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
+send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires to in most cases to use a [system provided MPI installation](configuration.md#using_system_mpi).
 
-Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) 
-should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the 
-[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm 
+### CUDA
+
+Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
+should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
+[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
 your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
 
 If using OpenMPI, the status of CUDA support can be checked via the
 [`MPI.has_cuda()`](@ref) function.
 
-## ROCm-aware MPI support
-
-If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the
-[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
+### ROCm
 
-Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) 
-should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the 
-[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm 
+Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
+should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
+[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
 your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
+> [!NOTE]
+> On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+> ```
+> preloads = ["libmpi_gtl_hsa.so"]
+> preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+> ```
+
+> [!NOTE]
+> In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+> If using (1), one can use the node-local rank `rank_loc` to select the GPU device:
+> ```
+> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+> rank_loc = MPI.Comm_rank(comm_loc)
+> ```
+> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`.
+
 ## Writing MPI tests
 
 It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`:

From b8312ea6cc751f58be2fa5f9f1c9396cd160ff66 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 19:56:05 +0100
Subject: [PATCH 2/4] Add suggestions

---
 docs/src/usage.md | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index 2835a1c38..dca16be84 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -78,7 +78,7 @@ $ mpiexecjl --project=/path/to/project -n 20 julia script.jl
 
 If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
 [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
-send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires to in most cases to use a [system provided MPI installation](configuration.md#using_system_mpi).
+send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
 
 ### CUDA
 
@@ -100,21 +100,21 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
-> [!NOTE]
-> On Cray machines, you may need to ensure the following preloads to be set in the preferences:
-> ```
-> preloads = ["libmpi_gtl_hsa.so"]
-> preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
-> ```
-
-> [!NOTE]
-> In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
-> If using (1), one can use the node-local rank `rank_loc` to select the GPU device:
-> ```
-> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
-> rank_loc = MPI.Comm_rank(comm_loc)
-> ```
-> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`.
+!!! note "Preloads"
+    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+    ```
+    preloads = ["libmpi_gtl_hsa.so"]
+    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+    ```
+
+!!! note "Multiple GPUs per node"
+    In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+    For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
+    ```
+    comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+    rank_loc = MPI.Comm_rank(comm_loc)
+    ```
+    For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
 
 ## Writing MPI tests
 

From a01708036656ecdf767b0c22afa48e9006e3a76f Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 21:05:50 +0100
Subject: [PATCH 3/4] Update

---
 docs/src/usage.md | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index dca16be84..1150ee9ea 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -80,6 +80,13 @@ If your MPI implementation has been compiled with CUDA or ROCm support, then `CU
 [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
 send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
 
+!!! note "Preloads"
+    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+    ```
+    preloads = ["libmpi_gtl_hsa.so"]
+    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+    ```
+
 ### CUDA
 
 Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
@@ -100,21 +107,15 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
-!!! note "Preloads"
-    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
-    ```
-    preloads = ["libmpi_gtl_hsa.so"]
-    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
-    ```
+### Multiple GPUs per node
 
-!!! note "Multiple GPUs per node"
-    In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
-    For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
-    ```
-    comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
-    rank_loc = MPI.Comm_rank(comm_loc)
-    ```
-    For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
+In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
+```
+comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_loc = MPI.Comm_rank(comm_loc)
+```
+For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
 
 ## Writing MPI tests
 

From d64e598ee05ff9ddbf3da6c48d9cefad77e69cf9 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 21:25:00 +0100
Subject: [PATCH 4/4] Add examples

---
 docs/examples/alltoall_test_cuda.jl          | 27 ++++++++++++++
 docs/examples/alltoall_test_cuda_multigpu.jl | 38 ++++++++++++++++++++
 docs/examples/alltoall_test_rocm.jl          | 27 ++++++++++++++
 docs/examples/alltoall_test_rocm_multigpu.jl | 38 ++++++++++++++++++++
 docs/src/usage.md                            |  8 ++---
 5 files changed, 134 insertions(+), 4 deletions(-)
 create mode 100644 docs/examples/alltoall_test_cuda.jl
 create mode 100644 docs/examples/alltoall_test_cuda_multigpu.jl
 create mode 100644 docs/examples/alltoall_test_rocm.jl
 create mode 100644 docs/examples/alltoall_test_rocm_multigpu.jl

diff --git a/docs/examples/alltoall_test_cuda.jl b/docs/examples/alltoall_test_cuda.jl
new file mode 100644
index 000000000..05011985c
--- /dev/null
+++ b/docs/examples/alltoall_test_cuda.jl
@@ -0,0 +1,27 @@
+# This example demonstrates your MPI implementation to have the CUDA support enabled.
+
+using MPI
+using CUDA
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank, size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = CuArray{Float64}(undef, N)
+recv_mesg = CuArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+CUDA.synchronize()
+
+println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_cuda_multigpu.jl b/docs/examples/alltoall_test_cuda_multigpu.jl
new file mode 100644
index 000000000..cc4838153
--- /dev/null
+++ b/docs/examples/alltoall_test_cuda_multigpu.jl
@@ -0,0 +1,38 @@
+confirm
+# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank)
+
+using MPI
+using CUDA
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+# select device (specifically relevant if >1 GPU per node)
+# using node-local communicator to retrieve node-local rank
+comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_l = MPI.Comm_rank(comm_l)
+
+# select device
+gpu_id = CUDA.device!(rank_l)
+# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
+# gpu_id = CUDA.device!(0)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = CuArray{Float64}(undef, N)
+recv_mesg = CuArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+CUDA.synchronize()
+
+rank==0 && println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank_l: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_rocm.jl b/docs/examples/alltoall_test_rocm.jl
new file mode 100644
index 000000000..e8be85b34
--- /dev/null
+++ b/docs/examples/alltoall_test_rocm.jl
@@ -0,0 +1,27 @@
+# This example demonstrates your MPI implementation to have the ROCm support enabled.
+
+using MPI
+using AMDGPU
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank, size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = ROCArray{Float64}(undef, N)
+recv_mesg = ROCArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+AMDGPU.synchronize()
+
+println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_rocm_multigpu.jl b/docs/examples/alltoall_test_rocm_multigpu.jl
new file mode 100644
index 000000000..c26348261
--- /dev/null
+++ b/docs/examples/alltoall_test_rocm_multigpu.jl
@@ -0,0 +1,38 @@
+# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank)
+
+using MPI
+using AMDGPU
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+# select device (specifically relevant if >1 GPU per node)
+# using node-local communicator to retrieve node-local rank
+comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_l = MPI.Comm_rank(comm_l)
+
+# select device
+device = AMDGPU.device_id!(rank_l+1)
+# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
+# device = AMDGPU.device_id!(1)
+gpu_id = AMDGPU.device_id(AMDGPU.device())
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = ROCArray{Float64}(undef, N)
+recv_mesg = ROCArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+AMDGPU.synchronize()
+
+rank==0 && println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/src/usage.md b/docs/src/usage.md
index 1150ee9ea..969cb6d3d 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -89,9 +89,9 @@ send and receive buffers for point-to-point and collective operations (they may
 
 ### CUDA
 
-Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
+Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl)
 should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
-[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
+[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm
 your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
 
 If using OpenMPI, the status of CUDA support can be checked via the
@@ -99,9 +99,9 @@ If using OpenMPI, the status of CUDA support can be checked via the
 
 ### ROCm
 
-Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
+Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl)
 should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
-[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
+[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm
 your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 
 If using OpenMPI, the status of ROCm support can be checked via the