[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention by justinrosner · Pull Request #2237 · ROCm/rocMLIR

justinrosner · 2026-02-11T14:52:10Z

Motivation

This PR fixes two issues that were first uncovered with #2203. This PR introduced new quick tune lists for Attention, and these new perfConfigs revealed issues when running some attention unit tests in the MIGraphX CI.

Implements: https://amd-hub.atlassian.net/browse/AIROCMLIR-488

Technical Details

First Issue (BlockwiseGemmToThreadwise)

The root cause is that when blockSize is not evenly divisible by the non-reduction dimension, the thread mapping creates LDS aliasing into the next row's data.

This fix ensures that rthreads evenly divides rDimSize by finding the largest rthreads <= maxRThreads that is a divisor of rDim.

Second Issue (RockPipeline)

In the pipelined loop, stages from different iterations run in the same time slot. When two stages share a register buffer (a RAW dependency), we need the reader to run first so it consumes the old value before the writer overwrites it. The problem with this was that the old code was doing pairwise swaps to put each reader before it's writer, e.g., swap(2,3) then swap(3,4), bit since both swaps touch index 3, two swaps on overlapping indices cannot achieve a 3-element rotation. You would always end up with one pair in the wrong order. See test_rock_pipeline for a detailed example of this.

The solution to this is to replace the pairwise swap loop with a topological sort to find the ordering that satisfies all of the RAW constraints.

Test Plan

Nightly CI
Rerun the failing MIGraphX unit tests on gfx950 and gfx90a

Test Result

Nightly CI
gfx950 MIGraphX Unit Tests
gfx90a MIGraphX Unit Tests

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request fixes two functional issues that were uncovered by PR #2203, which introduced new quick-tune performance configurations for Attention operations. The issues manifested during MIGraphX CI testing on gfx950 and gfx90a architectures.

Changes:

Fixed LDS aliasing bug in BlockwiseGemmToThreadwise when block size is not evenly divisible by the non-reduction dimension
Fixed incorrect stage ordering in RockPipeline by replacing pairwise swaps with topological sort for handling chained RAW dependencies
Added regression tests for both fixes and architecture-specific attention tests

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp	Adds logic to find the largest rthreads value that evenly divides rDimSize, preventing LDS aliasing issues
mlir/lib/Dialect/Rock/Transforms/RockPipeline.cpp	Replaces simple pairwise stage swaps with Kahn's topological sort algorithm to correctly handle multi-stage RAW dependency chains
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir	Adds test case verifying the rthreads fix with blockSize=10 and 3x10 tensor shape
mlir/test/Dialect/Rock/test_rock_pipeline.mlir	Adds test case for three-way stage rotation with chained RAW dependencies
mlir/test/fusion/pr-e2e/attention/gfx950/mixr-attention-padded-nperblock.mlir	Regression test for gfx950-specific perf config that triggered the issues
mlir/test/fusion/pr-e2e/attention/gfx950/lit.local.cfg	Configuration to run gfx950 tests only on gfx950 hardware
mlir/test/fusion/pr-e2e/attention/gfx90a/mixr-attention-padded-nperblock.mlir	Regression test for gfx90a-specific perf config that triggered the issues
mlir/test/fusion/pr-e2e/attention/gfx90a/lit.local.cfg	Configuration to run gfx90a tests only on gfx90a hardware

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T14:59:55Z

mlir/test/Dialect/Rock/test_rock_pipeline.mlir

+    %regD = memref.view %rawRegD[%c0][] : memref<16xi8, #gpu.address_space<private>> to memref<16xi8, #gpu.address_space<private>>
+
+    // CHECK: scf.for
+    //   Three-way rotation: S4 before S3 before S32


Typo in comment: "S32" should be "S2" to correctly describe the three-way rotation (S4 before S3 before S2).

Suggested change

// Three-way rotation: S4 before S3 before S32

// Three-way rotation: S4 before S3 before S2

bdevorem · 2026-02-11T19:23:17Z

I have confirmed this fixes the regular attention and flash decoding failures I've been seeing as well, on the migraphx side. ty!!

justinrosner added 5 commits February 10, 2026 19:51

Add fix for rthreads calculation

a471996

Fix bug in pipeline reordering

e3edb0a

Add E2E tests

142e085

Add LIT test for BlockwiseGemmToThreadwise changes

1960626

Add LIT test for RockPipeline

ba13dd8

justinrosner requested review from Copilot, dhernandez0, pabloantoniom and umangyadav February 11, 2026 14:52

justinrosner requested a review from causten as a code owner February 11, 2026 14:52

justinrosner changed the title ~~Fix functional issues from new quick tuned perf configs for attention~~ [AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention Feb 11, 2026

Copilot started reviewing on behalf of justinrosner February 11, 2026 14:54 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Merge branch 'develop' into 488-functional-fix

1a5f13d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention#2237

[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention#2237
justinrosner wants to merge 6 commits intodevelopfrom
488-functional-fix

justinrosner commented Feb 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

bdevorem commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Three-way rotation: S4 before S3 before S32
	// Three-way rotation: S4 before S3 before S2

Conversation

justinrosner commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

First Issue (BlockwiseGemmToThreadwise)

Second Issue (RockPipeline)

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

bdevorem commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinrosner commented Feb 11, 2026 •

edited

Loading