Skip to content

NoxNode/AmpItUp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AmpItUp

Investigating the NVIDIA Ampere GPU Instruction Set Binary Encoding

this repo contains some scattered notes and code about my findings

  • hello_cuda.c is for getting ptx from CUDA code
  • hello.c is for running ptx code
  • mice.c for now is just a hexdump and SASS viewer with vim motion controls (hjkl - also ctrl+hjkl to go faster)
  • ptx_gen.py generates a bunch of different ptx to assemble into SASS to analyze
  • generated.cubin is the cubin resulting from running ptx_gen and ptxas

Each of the above (except generated.cubin) have shell commands to compile/run/inspect at the top of the file.

I eventually want to make mice.c aka Machine Instruction Code Editor into a code editor that works directly on machine code (x64, sass, arm64, arm32, riscv64). It would make various visuals directly from machine code and allow for editing in either of the visualizations with controls that match the visual. We'll see if that ever comes to fruition.

below stuff are also in the Handmade Network Project blog in chronological order

link dump

Some guesses on what the default control bits are and why

the default control bits output by ptxas on some random ptx I made seem to be 0x003fde (why I think this is default is in the table below):

| field               | #bits | val | why I think this is default |
| --------------------------------------------------------------- |
| reuse               |   4   |  0  | can still reuse regs, we just dont give the hint to the processor that we ARE reusing it |
| wait barrier mask   |   6   |  3  | probably could also be 0 since the barriers for all prev instructions are 7 aka unset) |
| read barrier index  |   3   |  7  | unset/not making a barrier  |
| write barrier index |   3   |  7  | unset/not making a barrier  |
| yield flag          |   1   |  0  | dont yield (in combination with a high stall this seems to kinda say "take as long as it takes")
| stall cycles        |   4   |  F  | take as long as it takes    |

IADD3 Binary Encoding Breakdown

I've got most of the bits of the IADD3 instruction figured out, there's still a couple things I don't know and can't find an answer to.

here's some output from cuobjdump (reformatted to take less horizontal space)

IADD3 R4, P0, R4, R4, RZ ;                    // 0x0000000404047210
                                              // 0x003fde0007f1e0ff
IADD3.X R5, P3, P6, R5, R5, RZ, P0, P5  ?PM3; // 0x0000000505057210
                                              // 0x003fdec00066a4ff
// I edited the bits of the IADD3.X instruction to see how the assembly changes
// I don't know if the ?PM3 would actually do anything or appear in normal code
// also I don't know if the second predicate input/output are actually used
// a more detailed breakdown of the above is below in the commented dump from DocumentSASS

here's a helpful graphic to reference for the following from this pdf: ampere_encoding.png here's a dump from sm_86_instructions.txt generated by DocumentSASS with some comments added by me:

OPCODES
        IADD3int_pipe =  0b1000010000; // dunno what int_pipe means
        IADD3 =  0b1000010000; // this is the 0x210 in Opcode saying it's IADD3
ENCODING
!iadd3_noimm__RRR_RRR_unused; // dunno what this is
// for the following, the first number is the number of bits
// the rest of the numbers are pairs of bit indices
// for an example, see the comments on the Opcode field below
// also note that these bit indices differ from the above graphic, but if you flip the top and bottom labels in the graphic then they match up
BITS_3_14_12_Pg = Pg;             // Predicate Guard - if the predicate at the index specified by this field is false, don't run this instruction (7 is a hardcoded True predicate, 0 is a hardcoded False predicate - similar to how 255 is a hardcoded 0 register)
BITS_1_15_15_Pg_not = Pg@not;     // negate the predicate specified by Predicate Guard
BITS_13_91_91_11_0_opcode=Opcode; // 13 total bits - 12 are at 0-11, a 13th is bit 91
BITS_1_74_74_Sc_absolute=0;       // this bit determines if it's a IADD3.X (aka extended IADD3?) which means use the carry in predicate(s)
BITS_8_23_16_Rd=Rd;               // destination register
BITS_3_83_81_Pu=Pu;               // index of a predicate to write the carry out to
BITS_3_86_84_cop=Pv;              // index of another predicate to write the carry out to? - still need to verify this theory
BITS_8_31_24_Ra=Ra;               // first source register
BITS_1_72_72_e=Ra@negate;         // negate the first source register if this bit is 1, also if Sc_absolute is 1 then this turns into a bitwise NOT instead of negate for whatever reason (the same is true for the other source register negate bits)
BITS_8_39_32_Rb=Rb;               // second source register
BITS_1_63_63_Sc_negate=Rb@negate; // negate second source register if this bit is 1
BITS_8_71_64_Rc=Rc;               // third source register (I'm guessing the reason it's called IADD3 is because there's 3 source registers)
BITS_1_75_75_Sc_negate=Rc@negate; // negate third source register if this bit is 1
BITS_3_89_87_Pp =* 7;             // index of a predicate to read a carry in from
BITS_1_90_90_input_reg_sz_32_dist =*1; // negate the Pp predicate if this bit is 1
BITS_3_79_77_Pq =* 7;             // index of a another predicate to read a carry in from? - still need to verify this theory
BITS_1_80_80_ftz =*1;             // negate the Pq predicate if this bit is 1
BITS_6_121_116_req_bit_set=req_bit_set; // barrier mask
BITS_3_115_113_src_rel_sb=*7;           // read barrier
BITS_3_112_110_dst_wr_sb=*7;            // write barrier
BITS_2_103_102_pm_pred=pm_pred;         // don't know what this is - setting it to a value other than 0 causes cuobjdump to put a ?PM[value] at the end of the assembly - would like to find out what this actually means
BITS_8_124_122_109_105_opex=TABLES_opex_3(batch_t,usched_info,reuse_src_a,reuse_src_b,reuse_src_c); // this seems to merge the reuse, yield and stall bits and uses a lookup table for something

also here's a more in depth explanation of what the control bits mean

Wrap Up / Conclusion

My main goal for this project was to understand my GPU's machine code (SASS) better and I accomplished that. I did want to have more of a functioning disassembler and editor for the SASS code - right now I'm just disassembling the IADD3 instruction. I'll still be working on that viewer/editor so if anyone's interested in that stay tuned. Here's a screenshot of the current state of the viewer/editor: Screenshot 2025-06-15 143418.png

The numbers next to the IADD3 instructions are in order:

  • destination register
  • 1st source register (optionally negated/inverted)
  • 2nd source register (optionally negated/inverted)
  • 3rd source register (optionally negated/inverted)
  • first carry out predicate
  • first carry in predicate (optionally negated)
  • second carry out predicate
  • second carry in predicate (optionally negated)

About

Investigating the NVIDIA Ampere GPU Instruction Set Binary Encoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published