Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions docs/examples/alltoall_test_cuda.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# This example demonstrates your MPI implementation to have the CUDA support enabled.

using MPI
using CUDA

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)

size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")

N = 4

send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)

fill!(send_mesg, Float64(rank))
CUDA.synchronize()

println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
rank==0 && println("done.")
38 changes: 38 additions & 0 deletions docs/examples/alltoall_test_cuda_multigpu.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
confirm
# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank)

using MPI
using CUDA

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)

# select device (specifically relevant if >1 GPU per node)
# using node-local communicator to retrieve node-local rank
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_l = MPI.Comm_rank(comm_l)

# select device
gpu_id = CUDA.device!(rank_l)
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
# gpu_id = CUDA.device!(0)

size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")

N = 4

send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)

fill!(send_mesg, Float64(rank))
CUDA.synchronize()

rank==0 && println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank_l: $recv_mesg")
rank==0 && println("done.")
27 changes: 27 additions & 0 deletions docs/examples/alltoall_test_rocm.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# This example demonstrates your MPI implementation to have the ROCm support enabled.

using MPI
using AMDGPU

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)

size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")

N = 4

send_mesg = ROCArray{Float64}(undef, N)
recv_mesg = ROCArray{Float64}(undef, N)

fill!(send_mesg, Float64(rank))
AMDGPU.synchronize()

println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
rank==0 && println("done.")
38 changes: 38 additions & 0 deletions docs/examples/alltoall_test_rocm_multigpu.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank)

using MPI
using AMDGPU

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)

# select device (specifically relevant if >1 GPU per node)
# using node-local communicator to retrieve node-local rank
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_l = MPI.Comm_rank(comm_l)

# select device
device = AMDGPU.device_id!(rank_l+1)
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
# device = AMDGPU.device_id!(1)
gpu_id = AMDGPU.device_id(AMDGPU.device())

size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")

N = 4

send_mesg = ROCArray{Float64}(undef, N)
recv_mesg = ROCArray{Float64}(undef, N)

fill!(send_mesg, Float64(rank))
AMDGPU.synchronize()

rank==0 && println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
rank==0 && println("done.")
44 changes: 30 additions & 14 deletions docs/src/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,33 +74,49 @@ with:
$ mpiexecjl --project=/path/to/project -n 20 julia script.jl
```

## CUDA-aware MPI support
## GPU-aware MPI support

If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the
[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as
send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).

Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
!!! note "Preloads"
On Cray machines, you may need to ensure the following preloads to be set in the preferences:
```
preloads = ["libmpi_gtl_hsa.so"]
preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
```

### CUDA

Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl)
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm
your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).

If using OpenMPI, the status of CUDA support can be checked via the
[`MPI.has_cuda()`](@ref) function.

## ROCm-aware MPI support
### ROCm

If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the
[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).

Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl)
should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm
your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).

If using OpenMPI, the status of ROCm support can be checked via the
[`MPI.has_rocm()`](@ref) function.

### Multiple GPUs per node

In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
```
comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_loc = MPI.Comm_rank(comm_loc)
```
For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.

## Writing MPI tests

It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`:
Expand Down
Loading