Skip to content

Conversation

@wujingyue
Copy link
Collaborator

Otherwise I couldn't even locate the Expr in a printed fusion.

Otherwise I couldn't even locate the Expr in a printed fusion.
@wujingyue
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 30, 2026

Description

  • Enhanced error messages in broadcast validation by including TensorView string representations

  • Replaced generic mesh rank validation messages with detailed debugging information

  • Added input_tv->toString() and output_tv->toString() to error messages for better debugging

Changes walkthrough

Relevant files
Enhancement
lower_to_communication.cpp
Enhanced error messages with TensorView debugging               

csrc/host_ir/lower_to_communication.cpp

  • Modified NVF_ERROR_EQ calls for sender_mesh and receiver_mesh rank
    validation
  • Replaced generic error messages with detailed TensorView string
    representations
  • Added toString() calls to input_tv and output_tv for debugging
    purposes
  • +2/-10   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Error Message Improvement

    The PR successfully replaces generic error messages with more informative ones that include TensorView string representations. This change should help developers debug broadcast operations by providing tensor view details in error messages. The modification is safe and maintains the same error checking logic while improving debuggability.

    NVF_ERROR_EQ(sender_mesh.rank(), 1, "sender: ", input_tv->toString());
    NVF_ERROR_EQ(receiver_mesh.rank(), 1, "receiver: ", output_tv->toString());

    Test failures (partial, pipeline still running)

    • (Medium, 1) Thunder nvFuser scalar mismatch in nanogpt autograd test (test_networks)

      Test Name A100 Source
      thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 30, 2026

    Greptile Overview

    Greptile Summary

    This PR adjusts the error messages in lowerToBroadcast to print TensorView strings when validating that sender/receiver device meshes are 1D, making it easier to locate the corresponding TV/Expr when inspecting a printed fusion.

    The functional behavior of broadcast lowering remains the same; only the diagnostics emitted on rank mismatch are changed.

    Confidence Score: 4/5

    • This PR is likely safe to merge; it only changes diagnostic strings in an error path.
    • Change is localized to lowerToBroadcast and does not alter the lowering logic; main concern is a potential loss of useful debug context by no longer printing the offending DeviceMesh values.
    • csrc/host_ir/lower_to_communication.cpp

    Important Files Changed

    Filename Overview
    csrc/host_ir/lower_to_communication.cpp Tweaks Broadcast rank-check error messages to print TensorView strings instead of the device meshes; behavior unchanged but debug context may be less informative.

    Sequence Diagram

    sequenceDiagram
      participant Caller as convertSingleOpToCommunication
      participant Info as getCommunicationInfo
      participant Bcast as lowerToBroadcast
      participant Mesh as DeviceMesh
    
      Caller->>Info: compute CommunicationInfo(type)
      Info-->>Caller: type = Broadcast
      Caller->>Bcast: lowerToBroadcast(input_tv, output_tv)
      Bcast->>Mesh: sender_mesh = input_tv.getDeviceMesh()
      Bcast->>Mesh: receiver_mesh = output_tv.getDeviceMesh()
      Bcast->>Bcast: NVF_ERROR_EQ(sender_mesh.rank(), 1, ...)
      Bcast->>Bcast: NVF_ERROR_EQ(receiver_mesh.rank(), 1, ...)
      Bcast->>Mesh: root = sender_mesh.at(0)
      Bcast->>Bcast: team = receiver_mesh.vector()
      Bcast-->>Caller: comms.push_back(Communication::Broadcast)
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    Comment on lines +151 to +152
    NVF_ERROR_EQ(sender_mesh.rank(), 1, "sender: ", input_tv->toString());
    NVF_ERROR_EQ(receiver_mesh.rank(), 1, "receiver: ", output_tv->toString());
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    [P2] Error message regression: the previous message included the actual DeviceMesh ("Given ", sender_mesh/receiver_mesh). The new message prints TensorView::toString() but no longer prints the offending mesh, which can make debugging rank mismatches harder (you can’t see whether rank() is wrong because the mesh is unexpected vs the TV being different).

    Consider keeping the original context and appending the TV string, e.g. include sender_mesh/receiver_mesh in the message as well.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant