Investigating the NVIDIA Ampere GPU Instruction Set Binary Encoding
this repo contains some scattered notes and code about my findings
- hello_cuda.c is for getting ptx from CUDA code
- hello.c is for running ptx code
- mice.c for now is just a hexdump and SASS viewer with vim motion controls (hjkl - also ctrl+hjkl to go faster)
- ptx_gen.py generates a bunch of different ptx to assemble into SASS to analyze
- generated.cubin is the cubin resulting from running ptx_gen and ptxas
Each of the above (except generated.cubin) have shell commands to compile/run/inspect at the top of the file.
I eventually want to make mice.c aka Machine Instruction Code Editor into a code editor that works directly on machine code (x64, sass, arm64, arm32, riscv64). It would make various visuals directly from machine code and allow for editing in either of the visualizations with controls that match the visual. We'll see if that ever comes to fruition.
below stuff are also in the Handmade Network Project blog in chronological order
- website with SM90a ISA
- SASS control code viewer
- SASS control code explanation
- Turning/Ampere assembler
- Volta architecture pdf
- Ampere architecture pdf
- Ampere architecture talk
- above link pdf (no signin req)
- PTX ISA
- pdf with Volta encoding info
- DocumentSASS (nvdisasm strings)
- the Martins of nvidia forums
- Casey's Cuda/Tensor Core video
the default control bits output by ptxas on some random ptx I made seem to be 0x003fde (why I think this is default is in the table below):
| field | #bits | val | why I think this is default |
| --------------------------------------------------------------- |
| reuse | 4 | 0 | can still reuse regs, we just dont give the hint to the processor that we ARE reusing it |
| wait barrier mask | 6 | 3 | probably could also be 0 since the barriers for all prev instructions are 7 aka unset) |
| read barrier index | 3 | 7 | unset/not making a barrier |
| write barrier index | 3 | 7 | unset/not making a barrier |
| yield flag | 1 | 0 | dont yield (in combination with a high stall this seems to kinda say "take as long as it takes")
| stall cycles | 4 | F | take as long as it takes |
I've got most of the bits of the IADD3 instruction figured out, there's still a couple things I don't know and can't find an answer to.
here's some output from cuobjdump (reformatted to take less horizontal space)
IADD3 R4, P0, R4, R4, RZ ; // 0x0000000404047210
// 0x003fde0007f1e0ff
IADD3.X R5, P3, P6, R5, R5, RZ, P0, P5 ?PM3; // 0x0000000505057210
// 0x003fdec00066a4ff
// I edited the bits of the IADD3.X instruction to see how the assembly changes
// I don't know if the ?PM3 would actually do anything or appear in normal code
// also I don't know if the second predicate input/output are actually used
// a more detailed breakdown of the above is below in the commented dump from DocumentSASS
here's a helpful graphic to reference for the following from this pdf: 
sm_86_instructions.txt generated by DocumentSASS with some comments added by me:
OPCODES
IADD3int_pipe = 0b1000010000; // dunno what int_pipe means
IADD3 = 0b1000010000; // this is the 0x210 in Opcode saying it's IADD3
ENCODING
!iadd3_noimm__RRR_RRR_unused; // dunno what this is
// for the following, the first number is the number of bits
// the rest of the numbers are pairs of bit indices
// for an example, see the comments on the Opcode field below
// also note that these bit indices differ from the above graphic, but if you flip the top and bottom labels in the graphic then they match up
BITS_3_14_12_Pg = Pg; // Predicate Guard - if the predicate at the index specified by this field is false, don't run this instruction (7 is a hardcoded True predicate, 0 is a hardcoded False predicate - similar to how 255 is a hardcoded 0 register)
BITS_1_15_15_Pg_not = Pg@not; // negate the predicate specified by Predicate Guard
BITS_13_91_91_11_0_opcode=Opcode; // 13 total bits - 12 are at 0-11, a 13th is bit 91
BITS_1_74_74_Sc_absolute=0; // this bit determines if it's a IADD3.X (aka extended IADD3?) which means use the carry in predicate(s)
BITS_8_23_16_Rd=Rd; // destination register
BITS_3_83_81_Pu=Pu; // index of a predicate to write the carry out to
BITS_3_86_84_cop=Pv; // index of another predicate to write the carry out to? - still need to verify this theory
BITS_8_31_24_Ra=Ra; // first source register
BITS_1_72_72_e=Ra@negate; // negate the first source register if this bit is 1, also if Sc_absolute is 1 then this turns into a bitwise NOT instead of negate for whatever reason (the same is true for the other source register negate bits)
BITS_8_39_32_Rb=Rb; // second source register
BITS_1_63_63_Sc_negate=Rb@negate; // negate second source register if this bit is 1
BITS_8_71_64_Rc=Rc; // third source register (I'm guessing the reason it's called IADD3 is because there's 3 source registers)
BITS_1_75_75_Sc_negate=Rc@negate; // negate third source register if this bit is 1
BITS_3_89_87_Pp =* 7; // index of a predicate to read a carry in from
BITS_1_90_90_input_reg_sz_32_dist =*1; // negate the Pp predicate if this bit is 1
BITS_3_79_77_Pq =* 7; // index of a another predicate to read a carry in from? - still need to verify this theory
BITS_1_80_80_ftz =*1; // negate the Pq predicate if this bit is 1
BITS_6_121_116_req_bit_set=req_bit_set; // barrier mask
BITS_3_115_113_src_rel_sb=*7; // read barrier
BITS_3_112_110_dst_wr_sb=*7; // write barrier
BITS_2_103_102_pm_pred=pm_pred; // don't know what this is - setting it to a value other than 0 causes cuobjdump to put a ?PM[value] at the end of the assembly - would like to find out what this actually means
BITS_8_124_122_109_105_opex=TABLES_opex_3(batch_t,usched_info,reuse_src_a,reuse_src_b,reuse_src_c); // this seems to merge the reuse, yield and stall bits and uses a lookup table for something
also here's a more in depth explanation of what the control bits mean
My main goal for this project was to understand my GPU's machine code (SASS) better and I accomplished that. I did want to have more of a functioning disassembler and editor for the SASS code - right now I'm just disassembling the IADD3 instruction. I'll still be working on that viewer/editor so if anyone's interested in that stay tuned. Here's a screenshot of the current state of the viewer/editor:
The numbers next to the IADD3 instructions are in order:
- destination register
- 1st source register (optionally negated/inverted)
- 2nd source register (optionally negated/inverted)
- 3rd source register (optionally negated/inverted)
- first carry out predicate
- first carry in predicate (optionally negated)
- second carry out predicate
- second carry in predicate (optionally negated)
