Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

To solve issue #5883 for better performance

@liqiangxl
Copy link
Collaborator Author

!test --diff

@github-actions
Copy link

github-actions bot commented Jan 27, 2026

Review updated until commit cbeb32d

Description

  • Add domain restriction logic to transpose scheduler to prevent incorrect usage with broadcast tensors

  • Check if allocation domains are fully mapped to detect broadcast-induced groupings

  • Force pointwise scheduler for broadcast cases instead of transpose scheduler

  • Update tests to verify correct scheduler selection for broadcast scenarios

Changes walkthrough

Relevant files
Bug fix
domain_map.cpp
Add broadcast detection to transpose scheduler domain mapping

csrc/scheduler/tools/domain_map.cpp

  • Added C++20 ranges include for range-based operations
  • Enhanced hasAtLeastTwoValidGroups function with broadcast detection
    logic
  • Added validation to check if allocation domains are fully mapped
    (indicates broadcast)
  • Return false for broadcast cases to force pointwise scheduler usage
  • Added error checking for assumption validation
  • +40/-1   
    Tests
    test_transpose.cpp
    Update transpose tests to verify correct scheduler selection

    tests/cpp/test_transpose.cpp

  • Added type.h include for data type utilities
  • Modified FusionScheduleBroadcastOnly test to use FusionExecutorCache
  • Updated test to verify PointWise scheduler is selected for broadcast
    cases
  • Modified FusionScheduleTransposeMissingDim test with similar scheduler
    verification
  • Added NoTransposeMaverick17B test case for specific broadcast scenario
    validation
  • +81/-13 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Logic validation

    The new logic correctly identifies broadcast cases by checking if all allocation domains are mapped. The validation that all_mapped implies any_bcast provides a good safety check. The approach is sound and addresses the performance issue mentioned in the PR description.

    const bool all_mapped = std::ranges::equal(
        ref1_loop, ref2_loop, [&](IterDomain* id1, IterDomain* id2) {
          return ca_map.areMapped(id1, id2, IdMappingMode::PERMISSIVE);
        });
    if (all_mapped) {
      // Not required, just to validate the assumption that all_mapped implies
      // any_bcast
      const bool any_bcast =
          std::ranges::any_of(
              ref1_loop, [](IterDomain* id) { return id->isBroadcast(); }) ||
          std::ranges::any_of(
              ref2_loop, [](IterDomain* id) { return id->isBroadcast(); });
      NVF_ERROR(
          any_bcast,
          "all_mapped implies any_bcast, ca_map:\n",
          ca_map.toString());
      return false;

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl marked this pull request as ready for review January 29, 2026 16:14
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 29, 2026

    Greptile Overview

    Greptile Summary

    Adds logic to TransposeDomainMap::hasAtLeastTwoValidGroups to distinguish between tensor groups caused by actual permutation versus those caused by broadcasting. When all allocation domains between two reference tensors are mapped element-wise (using PERMISSIVE mapping mode), the code now recognizes this as a broadcast pattern and returns false to prevent the transpose scheduler from being selected. Instead, the pointwise scheduler is used, which handles broadcast operations more efficiently through unrolling and caching.

    Key changes:

    • Added #include <ranges> for C++20 range algorithms
    • Checks if allocation domains of both references are fully mapped using std::ranges::equal
    • Returns false when all domains are mapped (indicating broadcast pattern)
    • Includes validation assertion to ensure broadcast dimensions exist when domains are fully mapped
    • Updates three existing tests to verify pointwise scheduler selection for broadcast cases
    • Adds new test NoTransposeMaverick17B demonstrating the fix for issue Should use pointwise scheudler but transpose scheduler is used #5883

    Confidence Score: 4/5

    • This PR is safe to merge with low risk - it refines scheduler selection logic and includes comprehensive test coverage
    • Score reflects a well-reasoned change with good documentation and tests, but includes a strict assertion (NVF_ERROR) that could potentially fail in unexpected edge cases if the assumption doesn't hold universally across all fusion patterns
    • csrc/scheduler/tools/domain_map.cpp - verify the NVF_ERROR assumption holds for all broadcast patterns in production workloads

    Important Files Changed

    Filename Overview
    csrc/scheduler/tools/domain_map.cpp Adds broadcast detection to prevent transpose scheduler selection when groups are caused by broadcasting rather than true permutation
    tests/cpp/test_transpose.cpp Updates three tests to verify pointwise scheduler is selected for broadcast-only cases instead of transpose scheduler

    Sequence Diagram

    sequenceDiagram
        participant Scheduler as Transpose Scheduler
        participant TDM as TransposeDomainMap
        participant CAMap as ComputeAtMap
        
        Scheduler->>TDM: hasAtLeastTwoValidGroups(fusion)
        TDM->>TDM: groupInputsOutputsByInnerDim()
        alt Less than 2 groups
            TDM-->>Scheduler: false (reject transpose)
        end
        
        TDM->>TDM: findReferenceFor(group1)
        TDM->>TDM: findReferenceFor(group2)
        alt Either reference is null
            TDM-->>Scheduler: false (reject transpose)
        end
        
        TDM->>TDM: getMappedAllocDimIn(ref1, innermost2)
        alt mapped_id is null
            TDM-->>Scheduler: false (reject transpose)
        end
        
        Note over TDM: NEW CHECK ADDED IN THIS PR
        TDM->>TDM: Get allocation domains for ref1 and ref2
        TDM->>CAMap: Check if all domains are mapped (PERMISSIVE)
        
        alt All domains mapped
            TDM->>TDM: Check for broadcast dimensions
            alt Has broadcast
                Note over TDM: Groups caused by broadcast,<br/>not transpose
                TDM-->>Scheduler: false (use pointwise instead)
            else No broadcast
                TDM->>TDM: ERROR - all_mapped implies broadcast
            end
        else Some domains not mapped
            Note over TDM: Groups caused by permutation<br/>(real transpose case)
            TDM-->>Scheduler: true (use transpose scheduler)
        end
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl liqiangxl requested review from naoyam and rdspring1 January 29, 2026 18:03
    return false;
    }

    // For grouping caused by permutation, the corresponding loop domains should
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Allocation domains instead of loop domains?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    yes, revised.

    }

    // For grouping caused by permutation, the corresponding loop domains should
    // not be all mapped to each other. If they are, it means the two groups are
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you add an example pattern here as a comment?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    added.

      // For example, in TransposeTest.NoTransposeMaverick17B, two inputs
      // are tv0[i0, i1] and tv1[i2, b3] where i0/i2 and i1/b3 are mapped to each
      // other. However, tv0 and tv1 are in two different groups because of the
      // broadcast. In this case, we should use the pointwise scheduler instead of
      // the transpose scheduler.
    

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants