[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention#2237
[AIROCMLIR-488] Fix functional issues from new quick tuned perf configs for attention#2237justinrosner wants to merge 6 commits intodevelopfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request fixes two functional issues that were uncovered by PR #2203, which introduced new quick-tune performance configurations for Attention operations. The issues manifested during MIGraphX CI testing on gfx950 and gfx90a architectures.
Changes:
- Fixed LDS aliasing bug in BlockwiseGemmToThreadwise when block size is not evenly divisible by the non-reduction dimension
- Fixed incorrect stage ordering in RockPipeline by replacing pairwise swaps with topological sort for handling chained RAW dependencies
- Added regression tests for both fixes and architecture-specific attention tests
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp | Adds logic to find the largest rthreads value that evenly divides rDimSize, preventing LDS aliasing issues |
| mlir/lib/Dialect/Rock/Transforms/RockPipeline.cpp | Replaces simple pairwise stage swaps with Kahn's topological sort algorithm to correctly handle multi-stage RAW dependency chains |
| mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir | Adds test case verifying the rthreads fix with blockSize=10 and 3x10 tensor shape |
| mlir/test/Dialect/Rock/test_rock_pipeline.mlir | Adds test case for three-way stage rotation with chained RAW dependencies |
| mlir/test/fusion/pr-e2e/attention/gfx950/mixr-attention-padded-nperblock.mlir | Regression test for gfx950-specific perf config that triggered the issues |
| mlir/test/fusion/pr-e2e/attention/gfx950/lit.local.cfg | Configuration to run gfx950 tests only on gfx950 hardware |
| mlir/test/fusion/pr-e2e/attention/gfx90a/mixr-attention-padded-nperblock.mlir | Regression test for gfx90a-specific perf config that triggered the issues |
| mlir/test/fusion/pr-e2e/attention/gfx90a/lit.local.cfg | Configuration to run gfx90a tests only on gfx90a hardware |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| %regD = memref.view %rawRegD[%c0][] : memref<16xi8, #gpu.address_space<private>> to memref<16xi8, #gpu.address_space<private>> | ||
|
|
||
| // CHECK: scf.for | ||
| // Three-way rotation: S4 before S3 before S32 |
There was a problem hiding this comment.
Typo in comment: "S32" should be "S2" to correctly describe the three-way rotation (S4 before S3 before S2).
| // Three-way rotation: S4 before S3 before S32 | |
| // Three-way rotation: S4 before S3 before S2 |
|
I have confirmed this fixes the regular attention and flash decoding failures I've been seeing as well, on the migraphx side. ty!! |
Motivation
This PR fixes two issues that were first uncovered with #2203. This PR introduced new quick tune lists for Attention, and these new perfConfigs revealed issues when running some attention unit tests in the MIGraphX CI.
Implements: https://amd-hub.atlassian.net/browse/AIROCMLIR-488
Technical Details
First Issue (BlockwiseGemmToThreadwise)
The root cause is that when
blockSizeis not evenly divisible by the non-reduction dimension, the thread mapping creates LDS aliasing into the next row's data.This fix ensures that
rthreadsevenly dividesrDimSizeby finding the largestrthreads <= maxRThreadsthat is a divisor ofrDim.Second Issue (RockPipeline)
In the pipelined loop, stages from different iterations run in the same time slot. When two stages share a register buffer (a RAW dependency), we need the reader to run first so it consumes the old value before the writer overwrites it. The problem with this was that the old code was doing pairwise swaps to put each reader before it's writer, e.g., swap(2,3) then swap(3,4), bit since both swaps touch index 3, two swaps on overlapping indices cannot achieve a 3-element rotation. You would always end up with one pair in the wrong order. See
test_rock_pipelinefor a detailed example of this.The solution to this is to replace the pairwise swap loop with a topological sort to find the ordering that satisfies all of the RAW constraints.
Test Plan
Test Result
Submission Checklist