Add kernel based alltoallv and cuda backend for MoE dispatch and combine #5863

samnordmann · 2026-01-22T17:35:20Z

No description provided.

samnordmann · 2026-01-22T17:37:53Z

!test

greptile-apps · 2026-01-22T17:39:17Z

Greptile Summary

This PR adds a kernel-based alltoallv implementation and CUDA backend support for MoE dispatch/combine operations. The implementation introduces a custom CUDA kernel (alltoallv.cu) that performs scatter operations across ranks using remote memory pointers, compiled at runtime via NVRTC. The CUDA backend path allocates symmetric memory buffers and exchanges IPC handles to enable direct peer-to-peer transfers, providing an alternative to the existing NCCL backend.

Key changes:

New alltoallv_kernel performs byte-level scatter from local send buffer to remote recv buffers across all ranks
prepareAlltoallvMetadata exchanges send counts via TCPStore and computes offsets/metadata for alltoallv
doMoEDispatch and doMoECombine now support both NCCL and CUDA backends with symmetric tensor allocation
Tests parameterized to validate both backend implementations

Issues found:

Test incorrectly checks hardcoded kNccl backend availability instead of parameterized backend variable

Confidence Score: 4/5

Safe to merge after fixing the backend check bug in the test
Score reflects solid implementation with proper error handling, NVRTC compilation safeguards, and comprehensive testing, but reduced by one point due to the logic bug in test backend validation that could cause tests to skip when they should run
Pay close attention to tests/cpp/test_multidevice_dispatch_combine.cpp - fix the backend availability check before merging

Important Files Changed

Filename	Overview
csrc/multidevice/cuda_p2p.cpp	added kernel-based alltoallv implementation with NVRTC compilation, metadata preparation, and barrier synchronization for CUDA backend
csrc/multidevice/dispatch_combine.cpp	integrated CUDA backend for MoE dispatch/combine using kernel-based alltoallv, allocating symmetric buffers and handling metadata exchange
tests/cpp/test_multidevice_dispatch_combine.cpp	parameterized test for both NCCL and CUDA backends, but contains backend availability check bug

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Dispatch as doMoEDispatch
    participant Combine as doMoECombine
    participant Metadata as prepareAlltoallvMetadata
    participant Kernel as launchAlltoallvKernel
    participant SymTensor as SymmetricTensor
    participant Barrier as alltoallvBarrier
    
    App->>Dispatch: x, topk_idx, topk_weights, is_token_in_rank
    Note over Dispatch: Sort tokens by destination rank
    Dispatch->>Dispatch: Reorder x, topk_idx, topk_weights by rank
    
    alt Backend == NCCL
        Dispatch->>Dispatch: Use ProcessGroup alltoall_base
    else Backend == CUDA
        Dispatch->>Metadata: prepareAlltoallvMetadata(n_tokens_to_rank)
        Note over Metadata: Exchange counts via TCPStore<br/>Compute offsets and metadata
        Metadata-->>Dispatch: AlltoallvMetadata
        
        Dispatch->>SymTensor: Allocate symmetric send/recv buffers
        SymTensor-->>Dispatch: Send/recv tensor handles
        
        Dispatch->>SymTensor: setupRemoteHandles() for each buffer
        Note over SymTensor: Exchange IPC handles across ranks
        
        loop For each payload (x, topk_idx, topk_weights, src_idx, src_rank)
            Dispatch->>Kernel: launchAlltoallvKernel(send, recv_ptrs, metadata)
            Note over Kernel: CUDA kernel copies data<br/>from send to recv buffers<br/>across all ranks
        end
        
        Dispatch->>Barrier: alltoallvBarrier()
        Note over Barrier: Synchronize all ranks
    end
    
    Note over Dispatch: Reorder by expert ID locally
    Dispatch-->>App: recv_x, recv_topk_idx, recv_topk_weights, recv_src_idx, recv_src_rank
    
    Note over App: ... Expert computation happens here ...
    
    App->>Combine: x, topk_weights, src_idx, src_rank, n_tokens_to/from_rank
    Note over Combine: Sort by source rank
    
    alt Backend == NCCL
        Combine->>Combine: Use ProcessGroup alltoall_base
    else Backend == CUDA
        Combine->>Metadata: prepareAlltoallvMetadata(n_tokens_from_rank)
        Metadata-->>Combine: AlltoallvMetadata
        
        Combine->>SymTensor: Allocate symmetric buffers
        Combine->>SymTensor: setupRemoteHandles()
        
        loop For each payload (x, topk_weights, src_idx)
            Combine->>Kernel: launchAlltoallvKernel(send, recv_ptrs, metadata)
        end
        
        Combine->>Barrier: alltoallvBarrier()
    end
    
    Note over Combine: Scatter by original token index
    Combine-->>App: combined_x, combined_topk_weights

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-22T17:39:17Z

tests/cpp/test_multidevice_dispatch_combine.cpp

+  if (!communicator_->isBackendAvailable(CommunicatorBackend::kNccl)) {
+    GTEST_SKIP() << "Backend " << backend << " not available.";
+  }


logic: checking wrong backend constant - should check backend parameter, not hardcoded kNccl

Suggested change

if (!communicator_->isBackendAvailable(CommunicatorBackend::kNccl)) {

GTEST_SKIP() << "Backend " << backend << " not available.";

}

if (!communicator_->isBackendAvailable(backend)) {

GTEST_SKIP() << "Backend " << backend << " not available.";

}

add kernel based a2av and cuda backend for d/c

dda9aa7

samnordmann changed the title ~~add kernel based a2av and cuda backend for d/c~~ Add kernel based alltoallv and cuda backend for MoE dispatch and combine Jan 22, 2026

greptile-apps bot reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kernel based alltoallv and cuda backend for MoE dispatch and combine #5863

Add kernel based alltoallv and cuda backend for MoE dispatch and combine #5863

samnordmann commented Jan 22, 2026

Uh oh!

samnordmann commented Jan 22, 2026

Uh oh!

greptile-apps bot commented Jan 22, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add kernel based alltoallv and cuda backend for MoE dispatch and combine #5863

Are you sure you want to change the base?

Add kernel based alltoallv and cuda backend for MoE dispatch and combine #5863

Conversation

samnordmann commented Jan 22, 2026

Uh oh!

samnordmann commented Jan 22, 2026

Uh oh!

greptile-apps bot commented Jan 22, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant