8-bit accumulator A

8-bit indices X and Y

8-bit processor status P

8-bit stack pointer S

16-bit program counter PC

The ATARI 410 Program Recorder enjoyed not fully reading your programs, which you discovered upon careening off the end.

5 DIM A$(1) 10 A$="A"

10 A = 65

Resource 1.0/1.1 SM 1.2/1.3 SM 2.x SM 3.0 SMX Max warp size 32 32 32 32 Max warps 24 32 48 64 Max blocks 8 8 8 16 Max threads

(derived) 768 1024 1536 2048 32-bit registers 8192 16384 32768 65536 Max registers/thread at full occupancy

(derived) 10 16 21 32

#include <cuda.h> #include <assert.h> #include <stdint.h> #include <stdlib.h> static __global__ void medadic(void){ } static __global__ void monadic(uint32_t a0){ } static __global__ void dyadic(uint32_t a0,uint32_t a1){ } int main(void){ assert(cuInit(0) == CUDA_SUCCESS); medadic<<<1,1>>>(); monadic<<<1,1>>>(1); dyadic<<<1,1>>>(1,2); assert(cudaThreadSynchronize() == cudaSuccess); return EXIT_SUCCESS; }

nvcc c.cu -Xptxas -O3 -arch compute_20 -code sm_21 -lcuda -lcudart

-code

sm_21

-O3

[skynet](0) $ cuobjdump -sass a.out Fatbin elf code: ================ arch = sm_21 code version = [1,4] producer = cuda host = linux compile_size = 64bit identifier = c.cu code for sm_21 Function : _Z7medadicv /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x00001de780000000*/ EXIT; ............................ Function : _Z6dyadicjj /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x00001de780000000*/ EXIT; ............................ Function : _Z7monadicj /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x00001de780000000*/ EXIT; ............................ [skynet](0) $

medadic

dyadic

-abi=no

ptxas

ptxas

static __global__ void monadic(uint32_t *a0){ *a0 = 0; } static __global__ void dyadic(const uint32_t *a0,uint32_t *a1){ *a1 = *a0; } int main(void){ uint32_t *out; assert(cuInit(0) == CUDA_SUCCESS); assert(cudaMalloc(&out,sizeof(uint32_t) * 3) == cudaSuccess); medadic<<<1,1>>>(); monadic<<<1,1>>>(out); dyadic<<<1,1>>>(out + 1,out + 2); assert(cudaThreadSynchronize() == cudaSuccess); return EXIT_SUCCESS; }

code for sm_21 Function : _Z7medadicv /*0000*/ /*0x00001de780000000*/ EXIT; ............................ Function : _Z6dyadicPjS_ /*0000*/ /*0x80001de428004000*/ MOV R0, c [0x0] [0x20]; /*0008*/ /*0x90005de428004000*/ MOV R1, c [0x0] [0x24]; /*0010*/ /*0xa0009de428004000*/ MOV R2, c [0x0] [0x28]; /*0018*/ /*0x00001c858c000000*/ LDU.E R0, [R0]; /*0020*/ /*0xb000dde428004000*/ MOV R3, c [0x0] [0x2c]; /*0028*/ /*0x00201c8594000000*/ ST.E [R2], R0; /*0030*/ /*0x00001de780000000*/ EXIT; .............................. Function : _Z7monadicPj /*0000*/ /*0x80001de428004000*/ MOV R0, c [0x0] [0x20]; /*0008*/ /*0x90005de428004000*/ MOV R1, c [0x0] [0x24]; /*0010*/ /*0x000fdc8594000000*/ ST.E [R0], RZ; /*0018*/ /*0x00001de780000000*/ EXIT; .............................

nvcc

-O3

monadic()

cudaMemset()

.param

ptxas

dyadic

-m32

nvcc

ptxas info : Compiling entry function '_Z7monadicPj' for 'sm_21' ptxas info : Used 1 registers, 36 bytes cmem[0] ptxas info : Compiling entry function '_Z6dyadicPjS_' for 'sm_21' ptxas info : Used 2 registers, 40 bytes cmem[0] ptxas info : Compiling entry function '_Z7medadicv' for 'sm_21' ptxas info : Used 0 registers, 32 bytes cmem[0]

-maxrregcount=2

ptxas

ptxas info : For profile sm_21 adjusting per thread register count of 2 to lower bound of 16

-maxrregcount=0

ptxas

static __global__ void monadic(uint32_t *a0){ *a0 = *a0 + 1u; } static __global__ void dyadic(const uint32_t *a0,uint32_t *a1){ uint32_t c; for(c = 0 ; c < *a0 ; ++c){ *a1 += *a0; } }

monadic()

dyadic()

Function : _Z6dyadicPjS_ /*0000*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20]; /*0008*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24]; /*0010*/ /*0xa0011de428004000*/ MOV R4, c [0x0] [0x28]; /*0018*/ /*0x00601c858c000000*/ LDU.E R0, [R6]; /*0020*/ /*0xb0015de428004000*/ MOV R5, c [0x0] [0x2c]; /*0028*/ /*0xfc01dc23190e0000*/ ISETP.EQ.AND P0, pt, R0, RZ, pt; /*0030*/ /*0x000001e780000000*/ @P0 EXIT; /*0038*/ /*0x00405c858c000000*/ LDU.E R1, [R4]; /*0040*/ /*0xfc009de428000000*/ MOV R2, RZ; /*0048*/ /*0x00001de440000000*/ NOP CC.T; /*0050*/ /*0x00105c0348000000*/ IADD R1, R1, R0; /*0058*/ /*0x04209c034800c000*/ IADD R2, R2, 0x1; /*0060*/ /*0x00405c8594000000*/ ST.E [R4], R1; /*0068*/ /*0x00601c8584000000*/ LD.E R0, [R6]; /*0070*/ /*0x0021dc03188e0000*/ ISETP.LT.U32.AND P0, pt, R2, R0, pt; /*0078*/ /*0x400001e74003ffff*/ @P0 BRA 0x50; /*0080*/ /*0x00001de780000000*/ EXIT; .............................. Function : _Z7monadicPj /*0000*/ /*0x80009de428004000*/ MOV R2, c [0x0] [0x20]; /*0008*/ /*0x9000dde428004000*/ MOV R3, c [0x0] [0x24]; /*0010*/ /*0x00201c858c000000*/ LDU.E R0, [R2]; /*0018*/ /*0x04001c034800c000*/ IADD R0, R0, 0x1; /*0020*/ /*0x00201c8594000000*/ ST.E [R2], R0; /*0028*/ /*0x00001de780000000*/ EXIT; .............................

ptxas

LDU

R0 , our addend (*a0),

, our addend (*a0), trusty R1 , our accumulator/augend (*a1),

, our accumulator/augend (*a1), R2 : clearly our counter variable (c),

: clearly our counter variable (c), R3 : ...appears to be missing,

: ...appears to be missing, R4, R5 : address of augend (a1)

: address of augend (a1) R6, R7: address of addend (a0)

monadic

ptxas

restrict

dyadic()s

Function : _Z6dyadicPjS_ /*0000*/ /*0x80001de428004000*/ MOV R0, c [0x0] [0x20]; /*0008*/ /*0x90005de428004000*/ MOV R1, c [0x0] [0x24]; /*0010*/ /*0xa0011de428004000*/ MOV R4, c [0x0] [0x28]; /*0018*/ /*0x00009c858c000000*/ LDU.E R2, [R0]; /*0020*/ /*0xb0015de428004000*/ MOV R5, c [0x0] [0x2c]; /*0028*/ /*0xfc21dc23190e0000*/ ISETP.EQ.AND P0, pt, R2, RZ, pt; /*0030*/ /*0x000001e780000000*/ @P0 EXIT; /*0038*/ /*0x00401c858c000000*/ LDU.E R0, [R4]; /*0040*/ /*0xfc005de428000000*/ MOV R1, RZ; /*0048*/ /*0x00001de440000000*/ NOP CC.T; /*0050*/ /*0x04105c034800c000*/ IADD R1, R1, 0x1; /*0058*/ /*0x08001c0348000000*/ IADD R0, R0, R2; /*0060*/ /*0x0811dc03188e0000*/ ISETP.LT.U32.AND P0, pt, R1, R2, pt; /*0068*/ /*0x800001e74003ffff*/ @P0 BRA 0x50; /*0070*/ /*0x00401c8594000000*/ ST.E [R4], R0; /*0078*/ /*0x00001de780000000*/ EXIT; ..............................

__restrict__

ptxas

#pragma unroll

#pragma unroll 2

Function : _Z6dyadicPjS_ /*0000*/ /*0x80001de428004000*/ MOV R0, c [0x0] [0x20]; /*0008*/ /*0x90005de428004000*/ MOV R1, c [0x0] [0x24]; /*0010*/ /*0xa0011de428004000*/ MOV R4, c [0x0] [0x28]; /*0018*/ /*0x00001c858c000000*/ LDU.E R0, [R0]; /*0020*/ /*0xb0015de428004000*/ MOV R5, c [0x0] [0x2c]; /*0028*/ /*0xfc01dc23190e0000*/ ISETP.EQ.AND P0, pt, R0, RZ, pt; /*0030*/ /*0xa00081e740000000*/ @P0 BRA.U 0x60; /*0038*/ /*0x004060858c000000*/ @!P0 LDU.E R1, [R4]; /*0040*/ /*0x0400a003081ec000*/ @!P0 IMNMX.U32 R2, R0, 0x1, !pt; /*0048*/ /*0x080060a320020000*/ @!P0 IMAD R1, R0, R2, R1; /*0050*/ /*0x0040608594000000*/ @!P0 ST.E [R4], R1; /*0058*/ /*0x00001de780000000*/ EXIT; /*0060*/ /*0x00001de780000000*/ EXIT; ..............................

IMAD

*a1 += *a0 + *a0

*a1 = *a0 + *a0 + *a0

ISCADD 0x1

IMAD 0x3

static __global__ void triadic(uint32_t * __restrict__ a0, uint32_t * __restrict__ a1,uint32_t * __restrict__ a2){ *a1 = *a0 + *a1 + *a2; }

Function : _Z7triadicPjS_S_ /*0000*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20]; /*0008*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24]; /*0010*/ /*0xa0011de428004000*/ MOV R4, c [0x0] [0x28]; /*0018*/ /*0xb0015de428004000*/ MOV R5, c [0x0] [0x2c]; /*0020*/ /*0xc0021de428004000*/ MOV R8, c [0x0] [0x30]; /*0028*/ /*0xd0025de428004000*/ MOV R9, c [0x0] [0x34]; /*0030*/ /*0x00605c858c000000*/ LDU.E R1, [R6]; /*0038*/ /*0x00409c858c000000*/ LDU.E R2, [R4]; /*0040*/ /*0x00801c858c000000*/ LDU.E R0, [R8]; /*0048*/ /*0x04205c0348000000*/ IADD R1, R2, R1; /*0050*/ /*0x00101c0348000000*/ IADD R0, R1, R0; /*0058*/ /*0x00401c8594000000*/ ST.E [R4], R0; /*0060*/ /*0x00001de780000000*/ EXIT; .................................

VADD.U32.U32.ACC

asm ( "vadd.u32.u32.u32.add %0,%0,%1,%2;" : "+r"(*a1) : "r"(*a0), "r"(*a2) );

Where do all the free registers go? When I took up programming—a wee lad of six in that merry summer of 1987—my ATARI 400's MOSTech 6502 offered 6 user-accessible registers (there were 22 “registers” with which to drive sound, player-missile graphics and I/O via CTIA+ANTIC, but you couldn't perform computation with them):You'd generally compose 2–3 values within A's 8 bits and forget load/store (beyond whatever large-scale phasing your main loop might exhibit), instead operating directly on the first 256 words. Among the (at least) ten addressing modes of the 6502 wasaddressing, requiring only a byte to specify the word and thus saving a secondary instruction fetch for the remainder of the address. Back in the 1.02MHz days, your CPU clock wasn't some egregious multiple of your DRAM/MMU clock, and thus this wasn't as bad as it sounds—in a way, the zero page was a user-managed cache. Unfortunately, your OS (or) crapped all over a good chunk of page zero, and thus you had to write assembly (and then chain-load—BASIC and assemblyprograms, warbled differently, when played outside the ATARI 410 Program (cassette) Recorder) to meaningfully use it. Of course, if you were writing BASIC, none of this was visible to you anyway—unless you timed access to a“array” vs access to a regular ol'(no, I did not perform this experiment as a six year old). Anyway: very few registers. That's the upshot.Fast forward to the age of, and we're doing a bit better, register-wise:The X86 world is significantly more complicated (as X86 tends to be). The scandalously low number of general purpose registers is partially offset by the past decade's complicated, dataflow-driven cores: use of register renaming helps facilitate out-of-order execution across a much wider hardware register file, and thus the small number of architectural registers doesn't restrict ILP. This doesn't help programmers, though, but the advanced caches standard on X86 cores at least absorb much of the delay due to register spills.Let's take the following simple CUDA program:and compile it thusly:We need theargument toin order to generate machine-specificrather than just, because PTX doesn't tell us anything about register usage (or altogether very much else, except how to emit some kinda half-assed pseudo-SSA). I usedbecause I've got a GTX 460 in this machine; adjust to taste. We passtoto ensure we're not getting non-production crap left in there.Take a look at this withWell, it can't be a parameter, since it's showing up inand there's only one of them in. Nope, this is part of the PTX ABI, a rather unfortunate affair all told which we won't explore further here. Know that withpassed to, this MOV—and the register it consumes—are eliminated, and you thus lose your stack.is smart enough to eliminate most dead code (code which has no effect on the world outside the function), so in order to get some data we'll need a more complicated program:and get the rather more interesting results:We've learned thatdoesn't appear to perform whole-program analysis: even withsupplied to it (not shown here), it doesn't optimize away the constant functionto aof size equivalent to the product of block and grid geometries (here, a trivially constant 1). Thestate space has been mapped into the first constant area at 0x20.will happily blow away dead registers used by parameters, as seen in. Why are we using two registers per parameter? Because our machine model is 64-bit, of course. Addingto ourcommand line loses a register per parameter:Note that trying to pass, say,toelicits whatever a computer does when it's being silently contemptuous and the following:disables the register ceiling, which just goes to show, once again, that nothing we learned fromis true (I had assumedwould cycle through all possible permutations of instruction schedules, faster and faster, until my Dell IPS U3011 blew up and the GTX 460 spoke aloud, “A strange program. The only optimal choice is: not to compile. How about a nice game of Nethack?” Alas!)Let's step up the game:is pretty much what we'd expect., though, is suffering fromNote thatinserted an optimization for the case where *a0 is 0, exiting early. This makes sense since all loads are from the same place (hence use of, Load Data Uniform). If threads were loading from different locations, this optimization wouldbe a win due to divergence (only if all threads in a warp loaded 0 would you win). What is all this garbage, though? 8 registers? Count 'em:Most likely, “64-bit” registers have some kind of R(2*n) alignment requirement, hence R3's use as padding (look at). Placing these in R0..R3 would have eliminated the need for this padding, but perhaps with so few registers being used,doesn't care. Maybe my hypothesis is just wrong, or even a canard. Collecting evidence for or against it is left as an exercise for the reader. Why aren't R6 and R7 reusable?I can't say for sure, but observe what happens when, using the CUDA __restrict__ extension keyword (why this was necessary is beyond my comprehension, ashas been present since ANSI/ISO C99) to amendcontract, specifying that the two parameters cannot overlap:We drop to 6 total registers, with a1 being freely blown away. Interesting, especially as the CUDA Programming Guide claims that use of“may increase register pressure” (this is true due to other possibilities, but it seems just as likely to reduce register pressure).It's also interesting to note that, despite the -O3 to, partial loop unrolling has not been performed, nor has the entire loop been optimized away into a single multiplication followed by an addition modulo 2. Adding a barechanges nothing, but(or indeed any other integer argument to unroll) results in:Ahhh, here's our...but it seems odd that we needed specify a value to unroll, though it didn't matter what that value was so long as it was greater than 1! Explicit sequences such asandare converted into sensibleandinstructions, respectively. But what about this tart medicine?...drumroll, please...Holy snapping duck fucks, what is this hippie horseshit? Why no condensation of the IADDs into a single VADD, eliminating the need to blow away R1? Well, it's not as if R1's actually alive following 0x0048, and there's no point in saving someone if they're already dead. If we throw a loop around this, we get two IADDs into R0. It'd be interesting to analyze the data dependence/throughput characteristics of VADD vs two IADDs—perhaps one requires less parallelism to hide the dependency delays than the other— but it's 0733 EDT, and I have other stuff to do. Hack on, cool kids!P.S.: Wegeneratevia inline PTX assembly: