Skip to content

Conversation

@samnordmann
Copy link
Collaborator

No description provided.

@samnordmann
Copy link
Collaborator Author

!test

@samnordmann samnordmann changed the title add kernel based a2av and cuda backend for d/c Add kernel based alltoallv and cuda backend for MoE dispatch and combine Jan 22, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Summary

This PR adds a kernel-based alltoallv implementation and CUDA backend support for MoE dispatch/combine operations. The implementation introduces a custom CUDA kernel (alltoallv.cu) that performs scatter operations across ranks using remote memory pointers, compiled at runtime via NVRTC. The CUDA backend path allocates symmetric memory buffers and exchanges IPC handles to enable direct peer-to-peer transfers, providing an alternative to the existing NCCL backend.

Key changes:

  • New alltoallv_kernel performs byte-level scatter from local send buffer to remote recv buffers across all ranks
  • prepareAlltoallvMetadata exchanges send counts via TCPStore and computes offsets/metadata for alltoallv
  • doMoEDispatch and doMoECombine now support both NCCL and CUDA backends with symmetric tensor allocation
  • Tests parameterized to validate both backend implementations

Issues found:

  • Test incorrectly checks hardcoded kNccl backend availability instead of parameterized backend variable

Confidence Score: 4/5

  • Safe to merge after fixing the backend check bug in the test
  • Score reflects solid implementation with proper error handling, NVRTC compilation safeguards, and comprehensive testing, but reduced by one point due to the logic bug in test backend validation that could cause tests to skip when they should run
  • Pay close attention to tests/cpp/test_multidevice_dispatch_combine.cpp - fix the backend availability check before merging

Important Files Changed

Filename Overview
csrc/multidevice/cuda_p2p.cpp added kernel-based alltoallv implementation with NVRTC compilation, metadata preparation, and barrier synchronization for CUDA backend
csrc/multidevice/dispatch_combine.cpp integrated CUDA backend for MoE dispatch/combine using kernel-based alltoallv, allocating symmetric buffers and handling metadata exchange
tests/cpp/test_multidevice_dispatch_combine.cpp parameterized test for both NCCL and CUDA backends, but contains backend availability check bug

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Dispatch as doMoEDispatch
    participant Combine as doMoECombine
    participant Metadata as prepareAlltoallvMetadata
    participant Kernel as launchAlltoallvKernel
    participant SymTensor as SymmetricTensor
    participant Barrier as alltoallvBarrier
    
    App->>Dispatch: x, topk_idx, topk_weights, is_token_in_rank
    Note over Dispatch: Sort tokens by destination rank
    Dispatch->>Dispatch: Reorder x, topk_idx, topk_weights by rank
    
    alt Backend == NCCL
        Dispatch->>Dispatch: Use ProcessGroup alltoall_base
    else Backend == CUDA
        Dispatch->>Metadata: prepareAlltoallvMetadata(n_tokens_to_rank)
        Note over Metadata: Exchange counts via TCPStore<br/>Compute offsets and metadata
        Metadata-->>Dispatch: AlltoallvMetadata
        
        Dispatch->>SymTensor: Allocate symmetric send/recv buffers
        SymTensor-->>Dispatch: Send/recv tensor handles
        
        Dispatch->>SymTensor: setupRemoteHandles() for each buffer
        Note over SymTensor: Exchange IPC handles across ranks
        
        loop For each payload (x, topk_idx, topk_weights, src_idx, src_rank)
            Dispatch->>Kernel: launchAlltoallvKernel(send, recv_ptrs, metadata)
            Note over Kernel: CUDA kernel copies data<br/>from send to recv buffers<br/>across all ranks
        end
        
        Dispatch->>Barrier: alltoallvBarrier()
        Note over Barrier: Synchronize all ranks
    end
    
    Note over Dispatch: Reorder by expert ID locally
    Dispatch-->>App: recv_x, recv_topk_idx, recv_topk_weights, recv_src_idx, recv_src_rank
    
    Note over App: ... Expert computation happens here ...
    
    App->>Combine: x, topk_weights, src_idx, src_rank, n_tokens_to/from_rank
    Note over Combine: Sort by source rank
    
    alt Backend == NCCL
        Combine->>Combine: Use ProcessGroup alltoall_base
    else Backend == CUDA
        Combine->>Metadata: prepareAlltoallvMetadata(n_tokens_from_rank)
        Metadata-->>Combine: AlltoallvMetadata
        
        Combine->>SymTensor: Allocate symmetric buffers
        Combine->>SymTensor: setupRemoteHandles()
        
        loop For each payload (x, topk_weights, src_idx)
            Combine->>Kernel: launchAlltoallvKernel(send, recv_ptrs, metadata)
        end
        
        Combine->>Barrier: alltoallvBarrier()
    end
    
    Note over Combine: Scatter by original token index
    Combine-->>App: combined_x, combined_topk_weights
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +36 to +38
if (!communicator_->isBackendAvailable(CommunicatorBackend::kNccl)) {
GTEST_SKIP() << "Backend " << backend << " not available.";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: checking wrong backend constant - should check backend parameter, not hardcoded kNccl

Suggested change
if (!communicator_->isBackendAvailable(CommunicatorBackend::kNccl)) {
GTEST_SKIP() << "Backend " << backend << " not available.";
}
if (!communicator_->isBackendAvailable(backend)) {
GTEST_SKIP() << "Backend " << backend << " not available.";
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant