-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Better vector interleaves on x86 #8925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The previous comment reported a time that seemed to have regressed. It was not 8.2ms on main - more like 11
|
Also notable: LLVM is happy to undo all this shuffle factorization work, fuse them back together, and just make a big mess. So a new mechanism in this PR is optimization_fence, which abuses llvm's arithmetic fence intrinsic to prevent fusion of shuffle instructions (it's supposed to be used to prevent of floating point ops). |
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comment nits, but otherwise, everything read very clearly. Did you write most of these comments yourself, or did Claude?
|
I wrote them all myself. And re-reading it all myself I found a bunch of stuff I didn't like and made more changes. Maybe it existed in the original python too but at this point I've rewritten enough of the code that I don't think this counts as coauthored by claude anymore. |
Using the strategy in CodeGen_LLVM to do big vector interleaves (repeated 2-way interleaves), LLVM generates pretty poor code on x86. This is because x86 has no two-way vector interleave instruction until avx-512, and that instruction requires a runtime shuffle table, using up a register. The instructions x86 does have that take immediates are weird and hard to think about. It's important to stick to instructions that take immediates because interleaves often happen in high register pressure contexts (e.g. block transposes). This PR redoes vector interleaving for power of two blocks on x86 to use unpckl and shufi/vperm2/vinsert instructions only. The algorithm is somewhat complex and requires reasoning about permutations of the bits of the indices of each element. Hopefully it is understandable given the jumbo comment. I first got it working in python and Claude correctly translated that to C++ for me, after which I made extensive rewrites.
On my machine, this makes block transposes significantly faster and shorter in terms of code size and avoids some of the pathological cases on main. E.g. a 16x16 transpose of uint16s on avx2 on main is 621 instructions total, taking 419 cycles. I'd paste it but it's just a huge mess of various instructions. In this PR it's 134 instructions and 64 cycles:
This changes what block sizes are best used for transposing. Here are the best block sizes for each type before and after this change:
AVX512:
AVX2:
A good rule of thumb seems to be that you now want to use 512-byte blocks on avx2, and 1024-byte blocks on avx512.