Skip to content

Conversation

@naoyam
Copy link
Collaborator

@naoyam naoyam commented Jan 31, 2026

Gather allows non-gathered indices to have smaller output dimensions, which complicates indexing and is not yet supported by TensorIndexer. Note that takeAlongAxis, which is a limited case of gather, is supported.

One way to support it is to decompose it into a takeAlongAxis and slice. For now, this PR disables codegen of gather and delegates to ExprEval.

@naoyam
Copy link
Collaborator Author

naoyam commented Jan 31, 2026

!test

@github-actions
Copy link

github-actions bot commented Jan 31, 2026

Description

  • Dropped codegen support for non-exact gather operations

  • Added validation in TensorIndexer to ensure fusion support

  • Updated scheduler to reject fusions with non-exact gather ops

  • Modified tests to use ExprEval scheduler or disabled non-exact gather tests

Changes walkthrough

Relevant files
Enhancement
indexing.cpp
Add validation check for fusion support                                   

csrc/id_model/indexing.cpp

  • Added NVF_ERROR validation check in TensorIndexer constructor
  • Ensures fusion is supported before building loop index map
  • +2/-0     
    expr_eval_sched.cpp
    Block GatherOp from ExprEval scheduling                                   

    csrc/scheduler/expr_eval_sched.cpp

  • Added GatherOp to list of unsupported operations for ExprEvalScheduler
  • Prevents gather operations from using expression evaluation scheduler
  • +1/-0     
    registry.cpp
    Add non-exact gather validation in scheduler                         

    csrc/scheduler/registry.cpp

  • Added check for non-exact gather operations in scheduler validation
  • Rejects fusions with non-exact gather ops with specific error message
  • Maintains support for exact gather operations
  • +10/-0   
    Tests
    test_gather.cpp
    Update and disable gather-related tests                                   

    tests/cpp/test_gather.cpp

  • Updated test segmentation validation to use ExprEval scheduler
  • Disabled two tests for non-exact gather operations
  • Added explanatory comments about dropped codegen support
  • +5/-3     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Gather validation logic

    The check for non-exact gather operations uses gather->exactSizes() method. Need to verify this method correctly identifies all cases where gather operations would have non-gathered indices with smaller output dimensions, as mentioned in the PR description.

    if (std::ranges::any_of(
            ir_utils::getOpsOfType<GatherOp>(fusion),
            [](GatherOp* gather) { return !gather->exactSizes(); })) {
      scheduler_debug_utils::canScheduleRejectReason(
          scheduler_type, "Non-exact gather ops");
      return false;
    }
    Test coverage for disabled gather tests

    Two tests are disabled (DISABLED_GatherIterGoupedReduction and DISABLED_SameTvUsedAsLookupAndIndex). Ensure these tests are properly documented and that there are alternative tests or validation mechanisms to verify the functionality still works through ExprEval delegation.

    TEST_F(GatherTest, DISABLED_GatherIterGoupedReduction) {
      const int max_dim_size = 128;
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
    
      int rank = 3;
      int dim = 2;
    
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      TensorView* tv1 = makeContigTensor(rank);
      TensorView* tv_idx = makeContigTensor(rank, DataType::Int);
      fusion.addInput(tv1);
      fusion.addInput(tv_idx);
      auto tv_gather = gather(tv1, dim, tv_idx);
      auto tv_sum = sum(tv_gather, {0}, false);
      fusion.addOutput(tv_sum);
    
      // simply gather all elements
      auto input_dims =
          std::vector<int64_t>({max_dim_size, max_dim_size, max_dim_size});
      auto index_dims = input_dims;
      std::vector<int64_t> input2_dims(rank - 1, 0);
      for (int idim = 0; idim < rank - 1; ++idim) {
        input2_dims[idim] = index_dims[idim + 1];
      }
    
      at::Tensor t0 = at::randn(input_dims, options);
      at::Tensor idx = at::randint(0, input_dims[dim], index_dims, options_i);
    
      auto reduction_scheduler =
          SchedulerEntry::makeSchedulerInstance(SchedulerType::Reduction);
      SchedulerRuntimeInfo runtime_info(&fusion, {t0, idx});
      auto heuristic_params =
          reduction_scheduler->computeHeuristics(&fusion, runtime_info);
      auto rparams = heuristic_params->as<ReductionParams>();
    
      // Enforce vectorization so we can group them
      const int vect_factor = 2;
      rparams->vectorize_iter_dom = true;
      rparams->unroll_factor_iter_dom = vect_factor;
      // Enforce grid reduction, which requires a determined BIDy
      // If the heuristic does not have a BIDy, bind it to 2
      rparams->cross_grid_inner_reduction = true;
      rparams->split_grid_dim_inner_reduction = true;
      rparams->grid_dim_inner_reduction = ParallelType::BIDy;
      if (!rparams->lparams.hasDim(ParallelType::BIDy)) {
        rparams->lparams.bind(2L, ParallelType::BIDy);
      }
    
      reduction_scheduler->schedule(&fusion, rparams);
    
      // lowering & check iteration grouped reductions
      GpuLower gpulw(&fusion);
      gpulw.run();
      NVF_CHECK(
          gpulw.kernel()->summary().has_iter_grouped_reductions,
          "There must be iter domain grouped reductions.");
      NVF_CHECK(
          gpulw.kernel()->summary().num_grouped_iterations == vect_factor,
          "Expected ",
          vect_factor,
          " grouped iterations, found ",
          gpulw.kernel()->summary().num_grouped_iterations);
    
      KernelExecutor ke;
      auto lparams = rparams->lparams;
      ke.compile(&fusion, {t0, idx}, lparams);
      auto cg_outputs = ke.run({t0, idx}, {}, lparams);
    
      auto t_gather = at::gather(t0, dim, idx);
      testValidate(
          &fusion,
          cg_outputs,
          {t0, idx},
          {t_gather.sum(0)},
          __LINE__,
          __FILE__,
          "",
          lparams);
    }
    
    // Codegen support of non-exact gather dropped
    TEST_F(GatherTest, DISABLED_SameTvUsedAsLookupAndIndex) {

    Test failures

    • (Medium, 6) NVFuser internal assert (TensorIndexer isSupported) in PersistentBufferTest.BufferGatherLookupTv and ReductionTest.CrossEntropyGatherPattern

      Test Name A100 GB200 H100 Source
      PersistentBufferTest.BufferGatherLookupTv Link
      ReductionTest.CrossEntropyGatherPattern Link
    • (Medium, 2) Thunder nvFuser nanoGPT autograd returns zero scalar on CUDA

      Test Name A100 GB200 Source
      thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant