[WIP - NOT READY FOR REVIEW] Paged Attention: rocmlir-gen changes#2222
Open
justinrosner wants to merge 3 commits into42-paged-attention-rocmlirfrom
Open
[WIP - NOT READY FOR REVIEW] Paged Attention: rocmlir-gen changes#2222justinrosner wants to merge 3 commits into42-paged-attention-rocmlirfrom
justinrosner wants to merge 3 commits into42-paged-attention-rocmlirfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request adds paged attention support to rocmlir-gen, a code generation tool for MLIR-based ROCm kernels. Paged attention is an optimization technique that allows attention mechanisms to work with non-contiguous memory pages, improving memory efficiency for large language models.
Changes:
- Adds command-line options (
--paged-attention,--page-size,--num-pages) to enable and configure paged attention mode - Modifies attention kernel generation to use page tables (arrays of i64 pointers) instead of direct K/V tensor inputs
- Implements GPU kernel logic with
rock.derefoperations to dereference page tables and transform paged data to attention-compatible shapes - Adds CPU validation path that reconstructs regular K/V tensors from paged cache for correctness verification
- Includes comprehensive test coverage with MLIR test file and e2e test configurations
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/tools/rocmlir-gen/rocmlir-gen.cpp | Core implementation: adds paged attention command-line options, validation logic, GPU kernel generation with page table dereferencing and transforms, CPU validation with cache buffer management and shuffling, and host harness logic for page table population |
| mlir/test/rocmlir-gen/paged-attention-kernel.mlir | Comprehensive test file verifying paged attention kernel signature, rock.deref operations, transforms, and validation function with both single-head and GQA configurations |
| mlir/test/e2e/PrAttentionSchedule.toml | Adds e2e test case for paged attention with schedule version 2 |
| mlir/test/e2e/PrAttentionI8.toml | Adds e2e test case for paged attention with int8 quantization |
| mlir/test/e2e/PrAttentionF32.toml | Adds e2e test case for paged attention with f32 data type |
| mlir/test/e2e/PrAttentionF16.toml | Adds e2e test case for paged attention with f16 data type |
| mlir/test/e2e/PrAttentionDirectToLDS.toml | Adds e2e test case for paged attention with direct-to-LDS optimization |
| mlir/test/e2e/PrAttentionBF16.toml | Adds e2e test case for paged attention with bf16 data type |
| mlir/test/e2e/AttentionSchedule.toml | Adds e2e test case for paged attention with standard schedule |
| mlir/test/e2e/AttentionNonPowerOfTwoTileSize.toml | Adds e2e test case for paged attention with non-power-of-two tile sizes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9959a7d to
fa551da
Compare
3a81a17 to
a172ac8
Compare
fa551da to
034180c
Compare
a172ac8 to
5f36777
Compare
5f36777 to
dbea8f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds end-to-end testing infrastructure for paged attention in rocmlir-gen, enabling generation of both GPU kernels and CPU validation functions that properly handle paged K/V caches with shuffled page table addressing.
Implements: https://amd-hub.atlassian.net/browse/AIROCMLIR-439
Technical Details
New command line options:
Example Usage:
Key Changes:
rock.derefops to create virtual views of paged K/V datarock.attentionwithkeyAddresses/valueAddressespointing to deref outputsTest Plan
Test Result
Submission Checklist