Skip to content

[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention#2237

Open
justinrosner wants to merge 6 commits intodevelopfrom
488-functional-fix
Open

[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention#2237
justinrosner wants to merge 6 commits intodevelopfrom
488-functional-fix

Conversation

@justinrosner
Copy link
Contributor

@justinrosner justinrosner commented Feb 11, 2026

Motivation

This PR fixes two issues that were first uncovered with #2203. This PR introduced new quick tune lists for Attention, and these new perfConfigs revealed issues when running some attention unit tests in the MIGraphX CI.

Implements: https://amd-hub.atlassian.net/browse/AIROCMLIR-488

Technical Details

First Issue (BlockwiseGemmToThreadwise)

The root cause is that when blockSize is not evenly divisible by the non-reduction dimension, the thread mapping creates LDS aliasing into the next row's data.

This fix ensures that rthreads evenly divides rDimSize by finding the largest rthreads <= maxRThreads that is a divisor of rDim.

Second Issue (RockPipeline)

In the pipelined loop, stages from different iterations run in the same time slot. When two stages share a register buffer (a RAW dependency), we need the reader to run first so it consumes the old value before the writer overwrites it. The problem with this was that the old code was doing pairwise swaps to put each reader before it's writer, e.g., swap(2,3) then swap(3,4), bit since both swaps touch index 3, two swaps on overlapping indices cannot achieve a 3-element rotation. You would always end up with one pair in the wrong order. See test_rock_pipeline for a detailed example of this.

The solution to this is to replace the pairwise swap loop with a topological sort to find the ordering that satisfies all of the RAW constraints.

Test Plan

  • Nightly CI
  • Rerun the failing MIGraphX unit tests on gfx950 and gfx90a

Test Result

  • Nightly CI
  • gfx950 MIGraphX Unit Tests
  • gfx90a MIGraphX Unit Tests

Submission Checklist

@justinrosner justinrosner changed the title Fix functional issues from new quick tuned perf configs for attention [AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes two functional issues that were uncovered by PR #2203, which introduced new quick-tune performance configurations for Attention operations. The issues manifested during MIGraphX CI testing on gfx950 and gfx90a architectures.

Changes:

  • Fixed LDS aliasing bug in BlockwiseGemmToThreadwise when block size is not evenly divisible by the non-reduction dimension
  • Fixed incorrect stage ordering in RockPipeline by replacing pairwise swaps with topological sort for handling chained RAW dependencies
  • Added regression tests for both fixes and architecture-specific attention tests

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Adds logic to find the largest rthreads value that evenly divides rDimSize, preventing LDS aliasing issues
mlir/lib/Dialect/Rock/Transforms/RockPipeline.cpp Replaces simple pairwise stage swaps with Kahn's topological sort algorithm to correctly handle multi-stage RAW dependency chains
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir Adds test case verifying the rthreads fix with blockSize=10 and 3x10 tensor shape
mlir/test/Dialect/Rock/test_rock_pipeline.mlir Adds test case for three-way stage rotation with chained RAW dependencies
mlir/test/fusion/pr-e2e/attention/gfx950/mixr-attention-padded-nperblock.mlir Regression test for gfx950-specific perf config that triggered the issues
mlir/test/fusion/pr-e2e/attention/gfx950/lit.local.cfg Configuration to run gfx950 tests only on gfx950 hardware
mlir/test/fusion/pr-e2e/attention/gfx90a/mixr-attention-padded-nperblock.mlir Regression test for gfx90a-specific perf config that triggered the issues
mlir/test/fusion/pr-e2e/attention/gfx90a/lit.local.cfg Configuration to run gfx90a tests only on gfx90a hardware

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

%regD = memref.view %rawRegD[%c0][] : memref<16xi8, #gpu.address_space<private>> to memref<16xi8, #gpu.address_space<private>>

// CHECK: scf.for
// Three-way rotation: S4 before S3 before S32
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "S32" should be "S2" to correctly describe the three-way rotation (S4 before S3 before S2).

Suggested change
// Three-way rotation: S4 before S3 before S32
// Three-way rotation: S4 before S3 before S2

Copilot uses AI. Check for mistakes.
@bdevorem
Copy link
Contributor

I have confirmed this fixes the regular attention and flash decoding failures I've been seeing as well, on the migraphx side. ty!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants