Aussie AI

Chapter 7. Timing CUDA C++ Programs

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 7. Timing CUDA C++ Programs

CUDA C++ Timers

You can embed timing code into your CUDA C++ applications. Usually, you do this in the host code, rather than the device code, because you’re trying to time the overall performance of your parallelization optimizations, rather than just one GPU thread.

There are various different C++ code libraries for tracking and recording time information. Here are some of them:

clock standard C++ function (in <time.h)
CUDA event timers
time standard C++ function (in <time.h)

The clock function

The C++ clock function records “clock ticks” at a microsecond level of granularity. The basic method of use is to wrap your code between two calls to the clock function:

    #include <time.h>  // clock() 
       ...
    clock_t before = clock();  // clock ticks
    // Launch the kernels here...
    cudaDeviceSynchronize();  // Wait for kernel completion
    clock_t after = clock();  // clock ticks

    double secs = (after - before) / (double) CLOCKS_PER_SEC;
    double usecs = secs * 1000000.0;   // Microseconds

CUDA event timers

CUDA C++ has a builtin set of “event timers” that track and record timing events at a millisecond granularity. The basic method of use is also a “before-and-after” wrapper around the code.

    // Set up the timer objects
    cudaEvent_t event_before;
    cudaEventCreate(&event_before);
    cudaEvent_t event_after;
    cudaEventCreate(&event_after);

    // Wrap the kernel
    cudaEventRecord(event_before);
    // Launch the kernels here...
    cudaDeviceSynchronize();  // Wait for kernel completion
    cudaEventRecord(event_after);

    cudaEventSynchronize(event_after);  // Wait for the "after" event
    float event_ms;  // milliseconds
    cudaEventElapsedTime(&event_ms, event_before, event_after);
    float event_secs = event_ms / 1000.0f;

Note that no #include is needed for CUDA timers, because these are part of the CUDA runtime API. The nvcc compiler auto-includes this header file.

How Is This CUDA Kernel Even Fast?

Looking at a basic CUDA kernel, it’s hard to believe it could be fast. Here’s an example that just clears a vector in an element-by-element parallel kernel:

    __global__ void aussie_clear_vector_kernel_basic(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n)
            v[id] = 0.0;  // Clear vector element
    }

The grand unified theory of CUDA states that this kernel code is fast, because it gets split into warps of 32 threads, which are then spread over blocks of multiple warps. The idea is that we get lots of threads all running in lock-step parallel, each clearing one vector element, so the whole vector gets cleared in one big parallel step.

But there’s a lot of statements! Let’s count them out:

Kernel function argument setup: v and n
Builtin CUDA object setup: blockIdx, threadIdx, blockDim.
Structure field access: .x (thrice)
Multiply: blockIdx.x * blockDim.x
Add: + threadIdx.x
Assignment: id =
Less-than: id < n
Conditional branching: if
Array index: v[id]
Assignment (floating point): = 0.0

I’m up to 10 distinct operations now, but every time I look at this code, I see another one. How in the name of Edsger Dijstra is that even close to fast?

And what about the multiplication operation? Every AI developer runs screaming at the sight of an asterisk, and there’s a whole stream of research papers on “zero multiplication” AI models. And yet, here we are wasting an entire multiplication operation just to compute an index.

What gives?

Questions. This analysis of all the operations triggers a bunch of questions about optimizing this CUDA C++ code sequence:

Would the kernel be 10% faster if I chose N as a value without extra cases, so I can remove the safety “if(id < n)” statement?
Would it be faster if I make N a const or a #define rather than passing it in?
Or should I use template specialization so that N is constant?
If I removed the “id” temporary variable, would it be faster?
Should I replace blockDim.x with a constant? Or some kind of pre-computed global variable?
Is that multiplication operation expensive? i.e., blockIdx.x * blockDim.x

Short answer: no, to all.

The longer answer is that global memory access cost dwarfs all other statements. For example, in a V100 GPU, it costs 400-800 GPU clock cycles to access global memory. Hence, the cost to write a value to the global array, v[id], is hundreds of GPU clock cycles, whereas all of the rest of the operations are one cycle, or at most single-digit cycles.

Furthermore, the compiler is auto-optimizing a lot of these things. For example, the id temporary variable would be trivial for the compiler to auto-remove, so it’ll be doing a register operation anyway.

Even the multiplication is probably not a real multiplication. The expression “blockIdx.x * blockDim.x + threadIdx.x” is such a standard idiom in CUDA C++ that the compiler will surely auto-optimize it to some much faster builtin operations. We can look at the PTX assembly statements to see what’s really going on.

And one final point: yes, there is a CUDA runtime function called cudaMemset, which would clear my vector in a much faster way. There’s also obviously memset in standard C++ code.

Timing The Kernels

I decided to try timing some simple kernels while rearranging some of these internal operations. The findings were basically — spoiler alert! — that the main cost was the global memory accesses, but that there was a lot of overhead cost in launching lots of blocks. However, it wasn’t all that clear in some ways.

The timing is for four kernels:

Basic kernel (all operations)
No assignment (all operations except the global memory assignment)
Do nothing (no operations)
Two assignments (two global memory assignments)

Here’s the timing code:

    #include <time.h>  // clock() 
       ...
    clock_t before = clock();  // clock ticks
    for (int i = 0; i < repeats; i++) {  // Repeats for timing....
        kernelfnptr<<< blocks, threads_per_block >>> (device_v, n); 
        cudaDeviceSynchronize();  // Wait till the whole kernel is done...
    }
    clock_t after = clock();  // clock ticks
    double secs = (after - before) / (double) CLOCKS_PER_SEC;
    double usecs = secs * 1000000 / (double)repeats;
    fprintf(stderr, "TIMING: %s: %d iterations of N=%d blocks=%d/%d took %3.2f seconds, us/kernel=%3.2f\n", 
           name_of_kernel, (int)repeats, (int)n, blocks, threads_per_block, secs, usecs);

Note that the timing code has to call cudaDeviceSynchronize, because kernel launches are asynchronous. Without this explicit synchronization, we’d just be timing kernel launch time, rather than kernel execution time. Note that this does not measure the time cost of any other overheads before and after kernel launch, such as memory allocation, memory copies, and memory de-allocations.

Here’s the basic kernel code:

    __global__ void aussie_clear_vector_kernel_basic(float* v, int n)
    {
        // Kernel runs on GPU -- safely!
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) { // Safety
                v[id] = 0.0;  // Clear one vector element..
        }
    }

Here’s the change to the kernel without any assignment:

        // Commented out // v[id] = 0.0;

And here’s the “two assignment” kernel with a few tricks to try to stop the compiler optimizing too much:

    __global__ void aussie_clear_vector_kernel_two_assign(float* v, int n)
    {
        volatile int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) { // Safety
                v[id] = 0.0;  // Clear one vector element..
                v[id + ((unsigned)n>>30)] = 0.0;  // And again..
        }
    }

You can probably guess the “do nothing” kernel code:

    __global__ void aussie_clear_vector_kernel_do_nothing(float* v, int n)
    {
    }

Timing Results

I used these settings:

   Repeats = 100,000 
   N=2,097,152 
   blocks=2048
   threads-per-block=1024

The results varied significantly with block size and vector size. Generally, I had to make block size at least 512 threads and use N of at least a million vector elements before significant divergence in timing arose. With block size of 32 threads and small N, there was almost no difference in any of the four kernels.

Here are the final results for the above settings:

TIMING: Basic: 100000 iterations of N=2097152 blocks=2048/1024 took 4.30 seconds, us/kernel=42.96
TIMING: No Assign: 100000 iterations of N=2097152 blocks=2048/1024 took 1.62 seconds, us/kernel=16.23
TIMING: Do nothing: 100000 iterations of N=2097152 blocks=2048/1024 took 1.62 seconds, us/kernel=16.19
TIMING: Two assign: 100000 iterations of N=2097152 blocks=2048/1024 took 4.24 seconds, us/kernel=42.43

If we look at these results, here are some conclusions:

The assignment to global memory was 62% of the cost, with 38% overhead cost.
Two assignments to global memory were the same (!) as one assignment (was one optimized away?).
The non-assignment code in the kernel, such as index calculations, had almost zero cost (because the “no assignment” and “do nothing” versions cost almost exactly the same).
The 38% overhead must be from something other than kernel instructions or global memory assignments (i.e., overhead from launching blocks and threads).

PTX Assembly Analysis

To confirm that the optimizer was not removing too many of the statements, I had a look at the PTX assembly. The command to “keep” these intermediate files using nvcc on Linux was:

    nvcc -keep -keep-dir=. aussie-time-vector-kernels1.cu

The files that were created by this are here, using the command “ls -la” in Google Colab:

-rw-r--r-- 1 root root     976 Sep 28 10:38 a_dlink.fatbin
-rw-r--r-- 1 root root    3256 Sep 28 10:38 a_dlink.fatbin.c
-rw-r--r-- 1 root root    2936 Sep 28 10:38 a_dlink.o
-rw-r--r-- 1 root root      32 Sep 28 10:38 a_dlink.reg.c
-rw-r--r-- 1 root root     896 Sep 28 10:38 a_dlink.sm_52.cubin
-rwxr-xr-x 1 root root  976576 Sep 28 10:38 a.out*
-rw-r--r-- 1 root root 1249860 Sep 28 10:38 aussie-time-vector-kernels1.cpp1.ii
-rw-r--r-- 1 root root 1146101 Sep 28 10:38 aussie-time-vector-kernels1.cpp4.ii
-rw-r--r-- 1 root root    7767 Sep 28 10:33 aussie-time-vector-kernels1.cu
-rw-r--r-- 1 root root     386 Sep 28 10:38 aussie-time-vector-kernels1.cudafe1.c
-rw-r--r-- 1 root root 1066055 Sep 28 10:38 aussie-time-vector-kernels1.cudafe1.cpp
-rw-r--r-- 1 root root   17165 Sep 28 10:38 aussie-time-vector-kernels1.cudafe1.gpu
-rw-r--r-- 1 root root    4080 Sep 28 10:38 aussie-time-vector-kernels1.cudafe1.stub.c
-rw-r--r-- 1 root root    7168 Sep 28 10:38 aussie-time-vector-kernels1.fatbin
-rw-r--r-- 1 root root   19862 Sep 28 10:38 aussie-time-vector-kernels1.fatbin.c
-rw-r--r-- 1 root root      52 Sep 28 10:38 aussie-time-vector-kernels1.module_id
-rw-r--r-- 1 root root   23312 Sep 28 10:38 aussie-time-vector-kernels1.o
-rw-r--r-- 1 root root    2693 Sep 28 10:38 aussie-time-vector-kernels1.ptx
-rw-r--r-- 1 root root    6248 Sep 28 10:38 aussie-time-vector-kernels1.sm_52.cubin
drwxr-xr-x 4 root root    4096 Sep 25 18:24 .config/
drwxr-xr-x 1 root root    4096 Sep 25 18:24 sample_data/

To understand what is going on here, you need to know the nvcc compiler does two separate things (and does a good job of hiding this, unless you use the “-keep” option):

Device code is compiled to PTX assembly (“.ptx” file) while is them assembled to binary formats.
Host code is source-to-source compiled to standard C++ (e.g., the “.ii” and “.c” files) which is then sent to the native C++ compiler (e.g., g++ on Linux)

All of the PTX versions of the kernels are in the one “.ptx” file. Here’s the PTX for the “do nothing” kernel:

    // .globl        _Z37aussie_clear_vector_kernel_do_nothingPfi
    .visible .entry _Z37aussie_clear_vector_kernel_do_nothingPfi(
        .param .u64 _Z37aussie_clear_vector_kernel_do_nothingPfi_param_0,
        .param .u32 _Z37aussie_clear_vector_kernel_do_nothingPfi_param_1
    )
    {
        ret;
    }

Here’s the longer PTX for the “basic” kernel, without any optimizations:

    // .globl        _Z32aussie_clear_vector_kernel_basicPfi
    .visible .entry _Z32aussie_clear_vector_kernel_basicPfi(
        .param .u64 _Z32aussie_clear_vector_kernel_basicPfi_param_0,
        .param .u32 _Z32aussie_clear_vector_kernel_basicPfi_param_1
    )
    {
        .reg .pred      %p<2>;
        .reg .b32       %r<7>;
        .reg .b64       %rd<5>;

        ld.param.u64    %rd1, [_Z32aussie_clear_vector_kernel_basicPfi_param_0];
        ld.param.u32    %r2, [_Z32aussie_clear_vector_kernel_basicPfi_param_1];
        mov.u32         %r3, %ctaid.x;
        mov.u32         %r4, %ntid.x;
        mov.u32         %r5, %tid.x;
        mad.lo.s32      %r1, %r3, %r4, %r5;
        setp.ge.s32     %p1, %r1, %r2;
        @%p1 bra        $L__BB0_2;

        cvta.to.global.u64         %rd2, %rd1;
        mul.wide.s32    %rd3, %r1, 4;
        add.s64         %rd4, %rd2, %rd3;
        mov.u32         %r6, 0;
        st.global.u32   [%rd4], %r6;
    $L__BB0_2:
        ret;
    }

Note that the main global memory assignment is the st.global “store global” instruction. Let’s annotate some of this PTX gobbledegook to clarify the executable instructions in these assembly statements:

        // Load parameters: RD1=void*v, R2 = int n
        ld.param.u64         %rd1, [_Z32aussie_clear_vector_kernel_basicPfi_param_0];
        ld.param.u32         %r2, [_Z32aussie_clear_vector_kernel_basicPfi_param_1];

        // C++: int id = blockIdx.x* blockDim.x + threadIdx.x;
        mov.u32         %r3, %ctaid.x;  // R3 = blockIdx.x
        mov.u32         %r4, %ntid.x;   // R4 = blockDim.x
        mov.u32         %r5, %tid.x;    // R5 = threadIdx.x
        mad.lo.s32         %r1, %r3, %r4, %r5; // R1=(R3*R4)+R5 (multiply-add)

        // C++: if (id < n) -- The PTX does: !(id >= n)
        setp.ge.s32         %p1, %r1, %r2;  // greater-equal: R1 (id) >= R2 (n)
        @%p1 bra         $L__BB0_2;      // Branch if true

        // C++: v[id] = 0.0; (global assignment)
        cvta.to.global.u64         %rd2, %rd1; // RD2=RD1 (v)
        mul.wide.s32         %rd3, %r1, 4;       // RD3 = R1 * 4 (id * 4 bytes)
        add.s64         %rd4, %rd2, %rd3;   // RD4 &v[id] = RD2 (v) + RD3 (id*4)

        mov.u32         %r6, 0;        // R6 = 0
        st.global.u32         [%rd4], %r6;   // *RD4 = R6 (Store global: v[id*4]=0)
$L__BB0_2:    // Branch label for if
        ret;  // Return

Here is the “no assignment” version in PTX:

    // .globl        _Z40aussie_clear_vector_kernel_no_assignmentPfi
    .visible .entry _Z40aussie_clear_vector_kernel_no_assignmentPfi(
        .param .u64 _Z40aussie_clear_vector_kernel_no_assignmentPfi_param_0,
        .param .u32 _Z40aussie_clear_vector_kernel_no_assignmentPfi_param_1
    )
    {
        .reg .b32    %r<7>;
        .reg .b64    %rd<3>;

        mov.u32      %r1, %ctaid.x;
        mov.u32      %r2, %ntid.x;
        mov.u32      %r3, %tid.x;
        mad.lo.s32   %r4, %r1, %r2, %r3;
        mov.u32      %r6, %r4;
        mov.u32      %r5, %r6;
        ret;
    }

Note that it does indeed have no assignment, since the st.global “store global” instruction is gone. However, the if statement is also gone, with no greater-equal test, no branch instruction and no branch label, since the compiler noticed that it was empty. However, not everything has been optimized away, and it’s still doing more arithmetic instructions in registers than the “do nothing” kernel.

Finally, here is the PTX of the “two assignment” version:

    // .globl        _Z37aussie_clear_vector_kernel_two_assignPfi
    .visible .entry _Z37aussie_clear_vector_kernel_two_assignPfi(
        .param .u64 _Z37aussie_clear_vector_kernel_two_assignPfi_param_0,
        .param .u32 _Z37aussie_clear_vector_kernel_two_assignPfi_param_1
    )
    {
        .reg .pred     %p<2>;
        .reg .b32      %r<13>;
        .reg .b64      %rd<9>;

        ld.param.u64   %rd2, [_Z37aussie_clear_vector_kernel_two_assignPfi_param_0];
        ld.param.u32   %r1, [_Z37aussie_clear_vector_kernel_two_assignPfi_param_1];
        mov.u32        %r2, %ntid.x;
        mov.u32        %r3, %ctaid.x;
        mov.u32        %r4, %tid.x;
        mad.lo.s32     %r5, %r3, %r2, %r4;
        mov.u32        %r12, %r5;
        mov.u32        %r6, %r12;
        setp.ge.s32    %p1, %r6, %r1;
        @%p1 bra       $L__BB1_2;

        cvta.to.global.u64    %rd4, %rd2;
        mov.u32        %r7, %r12;
        mul.wide.s32   %rd5, %r7, 4;
        add.s64        %rd6, %rd4, %rd5;
        mov.u32        %r8, 0;       // R8 =0
        st.global.u32  [%rd6], %r8;  // Store global v[0]

        // C++: v[id + ((unsigned)n>>30)] = 0.0;
        shr.u32        %r9, %r1, 30;   // Shift-right: R9=n>>30
        mov.u32        %r10, %r12;     // R10 = R12 (id)
        add.s32        %r11, %r10, %r9;// R11 = R10 (id) + R9
        mul.wide.u32   %rd7, %r11, 4;  // RD7 = R11 * 4 bytes
        add.s64        %rd8, %rd4, %rd7; // RD8 = v + (id*4)
        st.global.u32  [%rd8], %r8;    // Store global v[RD8]=R8
    $L__BB1_2:
        ret;
    }

Note that there are two “st.global” instructions, so it is doing the second global memory assignment. Hence, we can conclude that, no, the second assignment was not optimized away at the instruction level, so it must have been sorted at some other hardware level. For example, a low-level hardware pipeline of global memory assignments for the threads noticed that the second one was redundant, or did them both in parallel.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization