Aussie AI

Chapter 11. Basic CUDA C++ Bugs

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Common Bugs in CUDA C++

Some of the main classes of bugs include:

General C/C++ types of bugs (many!)
Synchronization errors
Memory errors and other resource problems

It should be noted that an empirical study of CUDA bugs found that “general” errors in C++ are the most common cause of GPU kernel coding errors, moreso than the CUDA-specific capabilities (Wu et.al, 2019). C++ is a difficult language to get right!

Dual programming mode errors in the control of device versus host code include:

Kernel invocation of function not defined as __global__ or __device__ (basic mistake)
Device kernel code calling a host-only function.

Memory access errors. Memory-related errors can include:

Mixing malloc/free with cudaMalloc/cudaFree addresses
Uninitialized device global memory
Address alignment errors
Array bounds access errors in device memory — often arising from incorrect computations involving thread index and dimensions.

Synchronization errors. Parallel programming can have a variety of errors in the synchronization of multiple threads. Some examples include:

Host code not waiting for asynchronous device kernel to complete
Race conditions
Barrier-related issues
Early device call reset

Novice Kernel Launch Mistakes

There are many common CUDA idioms and they exist for a reason. Break the pattern, and you might trigger a bug or a slug. There are problems including.

Not enough threads
Not enough blocks
Not enough threads per block
Too many blocks (each with threads)
Vector underflows (silently)
Vector buffer overflows (gasp!)

All these combinations are enough to make your head spin! Let’s examine some of these problems.

Run multiple threads. This is the most beginner issue when you’re creating your first CUDA program. Using this style is not really a bug, but more of a slug:

    mykernel<<<1,1>>>(v,n);  // SEQUENTIAL!

It’s not parallel programming if you’re only running one thread!

Kernel launch syntax. Here’s another simple bug in a kernel launch:

    mykernel<<1,32>>(v,n);

We’ve fixed the number of threads to be 32, so that’s no longer the problem. To help you out, here’s the compiler error from nvcc:

    error: expression must have integral or unscoped enum type
          mykernel<<1,1>>(dest_v, n);
          ^

Well, that’s not much help. It’s hard to see the bug. The error is “<<” should be “<<<” (three less-than characters). Unfortunately, the compilation error for that is not very clear, because C++ thinks the “<<” is the bitwise left shift operator that works on integers, which is why it wants an “integral” or “enum” value.

Wrong Block and Thread Computations

The first part of launching a GPU kernel is to work out how many blocks and threads we need. This is called calculating the “grid dimensions,” and there are many ways to go wrong:

Threads-per-block should be a multiple of 32
Too few blocks
Too many blocks

Let’s examine these issues in turn.

Threads-per-Block Multiple of 32

The number of threads per block (aka the “block size”) should be a multiple of the warp size, which is 32 threads. Hence, it can be as low as 32, but commonly recommended block sizes in real-world kernels are often 256 or 512. The maximum permitted by CUDA is 1024 threads per block.

This is not good:

    float v[54];
    ...
    int blocks = 2;
    mykernel<<<blocks, 27>>>(v,n);  // BAD

This might actually work, but it’s very inefficient, and offends the sensibilities of any experienced CUDA programmer, for reasons discussed below.

But first, note that this is worse:

    float v[54];
    ...
    int blocks = 54;
    mykernel<<<blocks, 1>>>(v,n);  // BAD

If the threads per block is not 32, or a multiple of 32, there will be odd threads in a warp that aren’t properly utilized (or might be doing the wrong thing). The reason is that CUDA allows threads in “warps” that contain exactly 32 threads. With the threads per block of 27, there were 5 extra threads, and for 1 thread per block, there were 31 wasted threads. So, instead, you want something more like this:

    float v[64];
    int n = 64;
    ...
    int blocks = 2;
    mykernel<<<blocks,32>>>(v,n);  // BETTER

Or you can do this:

    float v[64];
    int n = 64;
    ...
    int blocks = 1;
    mykernel<<<blocks,64>>>(v,n);  // BETTER

CUDA can only schedule 32 threads (a warp) at a time, and if you try to schedule less than that, the rest of the threads in that warp still run, which is a bug or a slug, or are unavailable to run, which is a slug.

Too Few Blocks

You don’t always know the incoming size N of your vector data structure (well, actually, you often do in AI engines, because they have static dimensions, but anyway). Let’s try to generalize our computation of how many blocks with each having a fixed number of threads. There’s a few basic points:

Each block has the same number of threads (i.e., the threads-per-block)
You can’t run half a block (all its threads will run, even if you don’t need that many).

In our block calculations, we have a Goldilocks problem: it’s easy to get too many or too few, when we need it to be exactly right.

Let’s try to generalize our computations:

    int n = 33;
    float *v = (float*)malloc(n * sizeof(float));
    ...
    int threads_per_block = 32;   // or 64 or 256...
    int blocks = n / threads_per_block;  // BUGGY!
    mykernel<<<blocks,threads_per_block>>>(v,n);

Does this work? No, because the “/” operator is integer division of “33/32”, which truncates to 1, and we can’t get fractions of a block. The result is:

    threads_per_block == 32
    blocks == 1

We’re only running 32 threads, but our vector has 33 elements. Hence, if each kernel is processing one vector element, we’ve missed one of the extra elements. This is a code logic bug. It won’t crash CUDA, and nobody’s going to whisper in your ear that the end of your vector didn’t get processed.

Now, maybe the kernel code has a loop to handle the extra vector elements, which would fix the code logic bug, but it’s still a slug. But I’m jumping ahead by mentioning a loop, when your kernel probably doesn’t even have an if statement yet. It would be better just to use better block size computations.

Too Many Blocks

The previous section had too few blocks, so let’s try to fix it. Here’s a computation that adds an extra block:

    int threads_per_block = 32;  
    int blocks = (n + threads_per_block) / threads_per_block;  // BUGGY!

This is actually the same as:

    int threads_per_block = 32;  
    int blocks = (n / threads_per_block) + 1;  // BUGGY!

This seems to work better because “(33+32)/32” or “(33/32)+1” both evaluate to 2 blocks, which is what we want. However, we want the computation to work for all values of n, and it’s actually wrong if n==32 because it needlessly uses 2 blocks when 1 block of 32 threads is enough. We have a whole extra block doing nothing, or maybe crashing stuff. You’d think the GPU would be thrilled if you threw it an extra block, but no, not so much.

The trick to get it working for all values is to use a common CUDA idiom:

    int blocks = (n + threads_per_block - 1) / threads_per_block;  // CORRECT

The subtraction of 1 in the numerator fixes everything using the power of mathematics. For n==32, we get 1 and for n==33 there are correctly 2 blocks. It also works correctly for n==63 or n==64, although I feel like I should add a unit test.

Odd Vector Sizes. All of this fuss about the number 32 being special because it’s the warp size might make you realize something: all of your data should be in size 32 arrays (or matrices, or tensors). It’s really hard to fix everything if N is a prime number or other oddity.

That’s why a lot of the AI engines have parameters and “hidden dimensions” and “context window sizes” that are a multiple of 32 (e.g., 4096 or 32k or whatever). You won’t see this in ChatGPT’s source code:

    float mytensor[27][1531][17];   // Weird

Better would be:

    float mytensor[4096][256][32768];  // The power of 32!

Now you know: it’s all CUDA’s fault. Which is more true than we like to admit, since all of this AI frenzy might not have happened without CUDA’s blazing speed.

Wrong Kernel Index Calculations

Okay, so now that our computations of blocks and threads are sorted, let’s now actually show a very simple kernel. All good kernels start with the special “__global__” keyword, which means that it runs on the GPU. Here’s a vector clearing to zero kernel, which is a great piece of work if we pretend we don’t know about functions like memset or cudaMemset.

Anyway, here’s the kernel code for the GPU to run lots of times in parallel:

    __global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
    {
        int id = threadIdx.x;  // BUG!
        v[id] = 0.0;  // Clear one vector element..
    }

And the host code that runs on the CPU, only once, and launches the kernel off into the GPU, looks like this:

    int n = 33;
    float *v = (float*)malloc(n * sizeof(float));
    ...
    int threads_per_block = 32;
    int blocks = ( n + threads_per_block - 1) / threads_per_block;
    aussie_clear_vector_kernel_buggy1<<< blocks, threads_per_block >>> (v, n);

I’ve left out a lot of the details on the CPU side before and after the fancy <<<...>>> syntax (it’s called “triple chevron” syntax if you really want to know). The host code also has to allocate memory (twice) and set up the vectors and copy them from the CPU to the GPU, and back again. But for now, let’s just look at the blocks and threads.

Is there a bug? Umm, well, there is the word “BUG” in the comments, and also in the function name, so maybe I gave it away a little bit. But why is it a bug? First, let’s look at how many blocks and threads-per-block we’ve got:

    threads_per_block == 32
    blocks == 2

We’ve now got 32*2=64 total threads running, each setting one element in a vector of size 33. Aha! Obviously, it’s a buffer overflow inside the kernel when we access v[id]!

Nice try, but no cigar. Actually, the problem is this statement:

    int id = threadIdx.x;  // BUG!

What is threadIdx.x when it’s at home (on a GPU)? Well, the variable “threadIdx” stands for “thread index” and the “.x” part means the offset in the x-dimension (i.e., 1D vectors). There’s also a “.y” for 2D (matrices) and “.z” for 3D (tensors), but we’re just using a vector, so the “.x” part is correct.

The problem is the “thread index” means the offset of a thread within the current block of threads, not across all the total threads in multiple blocks. We have set blocks-per-thread as 32, so each block launches 32 threads. Hence, the offset of any thread in its current block has the range 0..31 only.

By launching 2 blocks of size 32 threads each, we get this sequence. The first block launches 32 threads, each having a “threadIdx.x” value of 0..31, so they correctly set all the first 32 vector elements to zero (in parallel). Our second block launches another 32 threads at the same time, also in parallel to our first block, but even though they’re in the second block, their “threadIdx.x” value only relates to their own block, so they also have offset values of 0..31. Hence, we get another 32 threads running the kernel function, setting exactly the same values in the vector.

Overall, the outcome is a logic bug in the vector clearing function. The vector elements v[0]..v[31] all get cleared correctly (twice!), but the 33rd element v[32] is not touched. This CUDA kernel will only work correctly if n<=threads_per_block.

Array Bounds Violations

Like the six-million dollar man, we can rebuild the kernel. We need our first block to have values 0..31 and our second block to have values 32..63. But how do we know which block we’re in?

The answer is another CUDA builtin variable called “blockIdx” which stands for “block index” and has value 0 for the first block, 1 for the next, and so on. It also has fields x, y, and z, but we only care about the “.x” values for a vector. By the way, the CUDA type is named “dim3” for these three-dimensional builtin object variables, but you don’t need to declare them yourself.

Using the block index, allows us to compute index offsets higher than the threads-per-block, such as:

    int id = blockIdx.x * 32 + threadIdx.x;  // BETTER

This is workable, but inelegant, and we could define a global constant for the threads-per-block.

    const int threads_per_block = 32;
    int id = blockIdx.x * threads_per_block + threadIdx.x;  // PRETTIER

This looks like good C++ code, but the CUDA idiom is actually to use another builtin variable called “blockDim” which stands for “block dimension” and in this case that means threads-per-block (in the x direction). Hence, the fully correct CUDA code, which is somewhat uglier, looks like:

    int id = blockIdx.x * blockDim.x + threadIdx.x;  // CORRECT

The full GPU kernel function looks like this:

    __global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;  // CORRECT
        v[id] = 0.0;  // Clear one vector element.. OKAY?
    }

Now you can run your corrected kernel, and it will clear all the elements of your vector, rather than only the first 32. But if you’re running it in production, might want to get your mop and bucket ready to clean up the mess, because this kernel is going to segfault all over the place.

Remember that array bounds violation you were worried about before? Now it’s happening for real.

The blockIdx.x value is 0..1, blockDim.x is 32 here (because that was our threads per block), and threadIdx.x has range 0..31 here. If you crank through all the options (sequentially since you don’t have a GPU in your head), you’ll find out the id array index variable has range 0..63, which all run in parallel.

Which isn’t great for computing v[id] on a vector of size 33. You get 31 array bounds violations, in 31 different threads, all at the same time.

The correction is another common CUDA kernel idiom: the safety if statement. Yes, you can use control flow statements like if statements and loops inside the kernel. The non-crashing version of the kernel looks like this:

    __global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;  // CORRECT
        if (id < n)
            v[id] = 0.0;  // Clear one vector element.. Safely!
    }

For some reason, most of the CUDA examples write it as above, without curly braces, because adding braces would slow it down or something. I’d personally prefer this minor tweak to the C++ styling:

    if (id < n) {
        v[id] = 0.0;
    }

Finally, it works! This safety test stops any of the 31 threads that would have overflowed the vector from writing to bad memory, and they simply do nothing. The other 33 threads correctly clear all the elements of the vector, 32 of them in the first block, and 1 in the second block.

This code won’t crash, but it’s somewhat sluggy for two reasons:

Redundant threads — the extra 31 threads run but do nothing.
Warp divergence — some threads take different paths at the if statement.

The redundant threads don’t actually slow anything down, since they run harmlessly at the same time in parallel with the useful threads. However, they waste electricity, and prevent the GPU from having better utilization of its silicon wafers to do other computations.

Warp divergence or “branch divergence” refers to the fact that GPUs like all their threads to follow exactly the same instruction path in parallel. Adding an if statement or a loop is problematic for performance, because some threads go left or right, and this slows down the whole process.

Mixing Host and Device Pointers

CUDA has two main classes of pointers to allocated memory: host pointers and device pointers. Host pointers are your regular pointers from malloc or new on the CPU, whereas device pointers are allocated on the GPU by functions such as cudaMalloc.

The two cannot mix!

Here’s a common bug:

    float *vdest;
    err = cudaMalloc((void**)&vdest, n * sizeof(float));
    if ( err != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed: %d\n", (int)n*(int)sizeof(float));
        return;
    }
    vdest[0] = 0.0;  // Oops!

What’s wrong with that? Isn’t it good style to initialize newly allocated memory, especially since cudaMalloc does not do the initialization of the memory block for you?

Unfortunately, this is a pointer mix-up. The cudaMalloc function allocated a block of memory on the device, and I’ve just tried to initialize it on the CPU. Accessing the device vector in host code will literally crash the program.

There are two main ways to correct this:

(a) use cudaMemset to initialize the device pointer, called from your host code on the CPU, or

(b) allocate a host memory block with the normal C++ malloc function, initialize that newly allocated host vector (like the assignment used above), and then copy the host vector up to the GPU device using cudaMemcpy.

Both of these methods are host code that you run on the CPU. The second one is rather a mouthful, but it’s actually the normal sequence of events for more advanced CUDA kernel functions.

References

Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming