Aussie AI

Chapter 12. Advanced CUDA Bugs

  • Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
  • by David Spuler

Advanced Bugs in CUDA C++

Mastery of CUDA means no bugs, right? Alas, no, it just means a better class of bugs. Here are some of the things that might go wrong:

  • Scaling beyond the GPU’s maximum thread count.
  • Exceeding shared memory size with over-large blocks.
  • Race conditions and other synchronization errors
  • Using up too many registers and spilling into local memory.
  • Kernel calls overflowing the GPU device’s stack (esp. if using alloca).
  • Shared memory access synchronization errors across threads.
  • Cross-compilation on a different architecture to execution.

Some specific errors in CUDA runtime API usage:

  • cudaSetDevice return error code needs to be checked, even at startup on a single GPU machine (it can fail sometimes).
  • Calling host-only functions from device code.
  • Kernel printf output buffer overload due to buffer not flushed.
  • Warp shuffle when the target thread is inactive is undefined behavior (the returned value is undefined).
  • Using __assume, __builtin_assume, or __builtin_unreachable, when it’s actually false (oops!).
  • Unsupported obscure printf formats (it’s not exactly the same as the standard library).
  • Kernel printf buffered output missing at program termination (needs synchronizing call, such as cudaDeviceReset or cudaDeviceSynchronize).
  • Thread group blocks must have all threads participating.
  • Trying to extend CUDA namespaces is undefined (e.g., “cuda::” or “nv::”).
  • Not checking for CUDA error return codes after every CUDA runtime call, and after every kernel launch with cudaGetLastError, which can miss errors or at least give a misleading appearance of where the error is occurring.

CUDA-specific arithmetic problems include:

  • Kernel-called math functions are silent on errors (e.g., do not set errno or emit floating-point exceptions).
  • Integer division by zero does not cause an exception in device code (and integer remainder operator).
  • Integer overflow does not cause an exception (which is also normal for standard C++!)
  • Floating point to integer conversion overflow is INT_MAX (or equivalent constant for other types) in device code.

Memory access errors include:

  • Mixing malloc/free and cudaMalloc/cudaFree.
  • Modifying __constant__ address data.
  • Accessing __device__ addresses in host code.
  • Local memory access outside that thread.
  • Shared memory address (“__shared__”) accessed in host code (probably a crash).
  • Shared memory access outside the block that defined it on the device (i.e., thread block scope).
  • Tensor fragment mismatches.
  • Kernel invocation parameter sends a local address (it should be from cudaMalloc, new or __device__ addresses)
  • Streams or events created by host code accessed in device code.
  • Event accessed outside the kernel block that created it (e.g., another block, in host code or a child grid).
  • Streams created in device code accessed outside that block.
  • Virtual alias synchronization problems with cuMemMap.
  • cudaFreeAsync synchronization issues with its allocation (cudaMallocAsync) or other usage of the address.
  • Synchronization issues with cudaMalloc (non-async version) and cudaFreeAsync.
  • Accessing __constant__ addresses in host code (likely segfault or other crash).
  • Address from cudaGetSymbolAddress can only be used in host code.
  • Mismatched virtual function call in host versus device code (i.e., object created in one, virtual function called in the other).

Portability and compatibility issues include:

  • Generally, use of compute capability level features where not supported by the GPU is undefined.
  • Arithmetic operations of the GPU may differ from x86 in areas undefined by the IEEE 754 standard.
  • Unified Memory usage on a device lacking full support is undefined (compatibility issue).

And don’t forget the slugs:

  • Low occupancy rates on SMs.
  • Poor load balancing across cores and SMs.
  • Memory transfer costs
  • Non-coalesced memory access patterns.
  • Redundant barriers (unnecessary synchronization).
  • Shared memory bank conflicts.
  • Kernel output with printf (useful for tracing, not production).
  • Register spills
  • Poor cache management

There’s also some common plain old bugs in AI algorithms that are fairly common:

  • Tensor shape errors
  • Mixing the offsets in square matrices (e.g., image data)

And since CUDA C++, is really just C++ plus a layer on top, there’s still all of the whole boatload of possible low-level C++ coding mistakes. But don’t fret too much, because soon you’ll just be writing the comments and your AI copilot will write all the C++ statements.

Python Brain Mode

What’s wrong with this CUDA C++ kernel code for GPU vector addition?

    __global__ void aussie_add_vector_kernel_buggy_python(float* v1, float *v2, float *destv3, int n)
    {
        // BUGGY: Add vectors, but C++ ain't Python!! 
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n)
            float ftmp1 = v1[id];  // Put into registers
            float ftmp2 = v2[id];
            destv3[id] = ftmp1 + ftmp2;
    }

C++ is not Python! This is what happens when a programmer is forced to learn Python, and then has to context switch to a real programming language like CUDA C++. Whereas indentation is used by Python for semantics, the C++ compiler does not use indentation for anything other than tokenization, and almost completely discards all whitespace. The above if statement without braces around its body is actually trying to do this control flow in C++:

    if (id < n) {
        float ftmp1 = v1[id];
    }
    float ftmp2 = v2[id];
    destv3[id] = ftmp1 + ftmp2;

Hence, the safety test is not actually safe. In fact, it won’t even compile, because ftmp1 in declared inside the if statement branch, and has limited scope, so it can’t be used in the addition operator at the end. The corrected code is simply to add curly braces:

    if (id < n) {
        float ftmp1 = v1[id];
        float ftmp2 = v2[id];
        destv3[id] = ftmp1 + ftmp2;
    }

The other way to fix it is to remove the temporaries so that it’s only one statement:

    if (id < n) 
        destv3[id] = v1[id] + v2[id];

Incidentally, I’m not convinced that using temporary variables in this way to force GPU register usage is really an optimization. The nvcc compiler probably puts them into registers anyway.

Confusing Host and Device Pointers

When there’s a lot of different pointers to vectors and matrices floating around, it’s easy to get confused. What’s the bug in this code to clear a vector? Note that this is host code and various error checking has been removed for clarity.

    // Set up the host vector
    float* v = (float*)malloc(n * sizeof(float));  // Dynamic array for vector

    // Set up the self-test data...
    aussie_set_vector(v, n, 3.0);  // Set all elements to non-zero (for testing)

    // Set up the device vector...
    float* device_v = NULL;
    int sz = n * sizeof(float);
    cudaMalloc((void**)&device_v, sz);

    // Copy to device vector
    cudaMemcpy(device_v, v, sz, cudaMemcpyHostToDevice);

    // Kernel launch
    int threads_per_block = 32;
    int blocks = (n + threads_per_block - 1) / threads_per_block;
    aussie_clear_vector_kernel_basic <<< blocks, threads_per_block >>> (v, n); 

    // Copy GPU data back to the CPU host vector....
    cudaMemcpy(v, device_v, sz, cudaMemcpyDeviceToHost);

    // Cleanup allocated memory
    cudaFree(device_v);   // Free the device vector
    free(v);   // Free the host vector

It’s hard to see, but the kernel launch won’t do what it’s asked, but will fail with a CUDA error code. Hopefully this is being checked by the host code, but not in the code fragment above, though! If a kernel fails in a GPU forest, and there’s no error check to hear it fail?

Anyway, this doesn’t crash, but if you call cudaGetLastError anywhere, you’ll see that this gets CUDA error 700, which is cudaErrorIllegalAddress. The error first appears after the kernel launch, so it’s happening in the device code.

If you’re stumped, one way to find out the cause of error 700 would be to run it with the compute-sanitizer tool, which finds memory access errors. It’s part of the CUDA Toolkit and is free to use. On Linux or a Google Colab virtual version of Linux, the command would be:

    command-sanitizer a.out

The error report would be something like this:

========= Invalid __global__ write of size 4 bytes
=========     at 0x40 in aussie_clear_vector_kernel_buggy1(float *, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x583fefd3e950 is out of bounds
=========     and is 37,289,833,797,296 bytes before the nearest allocation at 0x7a2a27200000 of size 131,072 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time

I particularly enjoy the fact that the faulty address is 37 quadzillion bytes away from where it should be. And that is actually quite a useful hint as to the cause.

To narrow our focus, here’s the kernel launch with the mistake:

    aussie_clear_vector_kernel_buggy1 <<< blocks, threads_per_block >>> (v, n); 

Your first thought is probably that something’s wrong with the blocks and threads calculations. But actually, the culprit is simply v, which is a host vector allocated by malloc on the CPU, not a kernel vector allocated by cudaMalloc. The GPU cannot access memory on the host, unless it’s been specially marked as “global” or similar. So, we’ve passed the GPU kernel a host vector address that device code isn’t allowed to touch!

The fix is to use the correct vector device_v, which is a device vector:

    aussie_clear_vector_kernel_buggy1 <<< blocks, threads_per_block >>> (device_v /*FIX!*/ , n); 

Copy-Paste Bugs for cudaMemcpy

It seems a little ironic that humans make copy-paste bugs when using the cudaMemcpy API. However, it’s almost always used twice, before and after a kernel launch, and there are many ways to go wrong.

Firstly, here are the two main ways to write them correctly:

    cudaMemcpy(device_v, v, sz, cudaMemcpyHostToDevice);  // Before (CPU-to-GPU)
    // ... launch the kernel here
    cudaMemcpy(v, device_v, sz, cudaMemcpyDeviceToHost);  // After (GPU-to-CPU)

Below are the many ways to call it incorrectly. In fact, I coded up some tests of cudaMemcpy to see which ones return a CUDA error status, and which are silent errors. Here are the ways to go wrong:

    cudaMemcpy(device_v, v, sz, cudaMemcpyDeviceToHost);  // device-to-host to a device pointer        
    cudaMemcpy(v, v, sz, cudaMemcpyHostToDevice);  // host-to-device to a host pointer
    cudaMemcpy(device_v, device_v, sz, cudaMemcpyHostToDevice);  // host-to-device from a device pointer
    cudaMemcpy(device_v, device_v, sz, cudaMemcpyDeviceToHost);  // device-to-host to a device pointer
    cudaMemcpy(v, NULL /*device_v*/, sz, cudaMemcpyDeviceToHost); // device-to-host from NULL pointer
    cudaMemcpy(NULL /*v*/, device_v, sz, cudaMemcpyDeviceToHost);  // device-to-host to NULL pointer
    cudaMemcpy(v, device_v, n /*sz*/, cudaMemcpyDeviceToHost); // device-to-host too few bytes
    cudaMemcpy(v, device_v, 0 /*sz*/, cudaMemcpyDeviceToHost); // device-to-host zero bytes
    cudaMemcpy(device_v, v, n /*sz*/, cudaMemcpyHostToDevice); // host-to-device too few bytes
    cudaMemcpy(device_v, v, 0 /*sz*/, cudaMemcpyHostToDevice); // host-to-device zero bytes
    cudaMemcpy(device_v, v, (int)cudaMemcpyHostToDevice /*sz*/, (cudaMemcpyKind)sz /*cudaMemcpyHostToDevice*/); // reverse sz and mode params
    cudaMemcpy(v, device_v, sz, cudaMemcpyHostToHost); // host-to-host from device pointer
    cudaMemcpy(device_v, v, sz, cudaMemcpyHostToHost); // host-to-host to device pointer
    cudaMemcpy(v, v, sz, cudaMemcpyHostToHost);  // host-to-host same host pointer
    cudaMemcpy(device_v, v, sz, cudaMemcpyDeviceToDevice);  // device-to-device from host pointer
    cudaMemcpy(v, device_v, sz, cudaMemcpyDeviceToDevice);  // device-to-device to host pointer
    cudaMemcpy(device_v, device_v, sz, cudaMemcpyDeviceToDevice); // device-to-device same device pointer
    cudaMemcpy(device_v, device_v, sz*2, cudaMemcpyDeviceToDevice); // device-to-device too many bytes
    cudaMemcpy(v, v, sz*2, cudaMemcpyHostToHost); // host-to-host too many bytes");
    cudaMemcpy(device_v, v, sz*2, cudaMemcpyHostToDevice);  // host-to-device too many bytes
    cudaMemcpy(v, device_v, sz*2, cudaMemcpyDeviceToHost);  // device-to-host too many bytes

And here is the output of my test program (abridged over multiple runs):

    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host to a device pointer
    AUSSIE CUDA ERROR: val=1 (invalid argument) - host-to-device to a host pointer
    NO error detected - host-to-device from a device pointer
    NO error detected - device-to-host to a device pointer
    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host from NULL pointer
    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host to NULL pointer
    NO error detected - device-to-host too few bytes
    NO error detected - device-to-host zero bytes
    NO error detected - host-to-device too few bytes
    NO error detected - host-to-device zero bytes
    AUSSIE CUDA ERROR: val=21 (invalid copy direction for memcpy) - reverse sz and mode params
    NO error detected - host-to-host from device pointer
    NO error detected - host-to-host to device pointer
    NO error detected - host-to-host same host pointer
    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-device from host pointer
    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-device to host pointer
    NO error detected - device-to-device same device pointer
    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-device too many bytes
    NO error detected - host-to-host too many bytes
    AUSSIE CUDA ERROR: val=1 (invalid argument) - host-to-device too many bytes
    AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host too many bytes

Some of the silent errors may be detected by compute-sanitizer when it runs, but I ran this test, and it didn’t seem to find any more of the “too many bytes” overflows, except for the ones that emitted a CUDA error code. Note that the compute-sanitizer tool also helpfully reports any of the CUDA error return statuses that get triggered by the CUDA runtime API, along with any other memory address failures detected.

Silent Kernel Launch Failures

This is another weird oddity about the CUDA Runtime API. If the kernel launch fails in a synchronous way, such as from too many threads-per-block or other grid dimension error, the error code gets lost. Rather surprisingly, this lost error code occurs in a very common CUDA idiom:

  • Kernel launch with <<<...>>> syntax.
  • cudaDeviceSynchronize immediately thereafter.

Here’s an example of code that will show this silent kernel launch failure:

    int BADTHREADS = 32 + 1024;  // More than 1024 is illegal
    int blocks = (n + BADTHREADS - 1) / BADTHREADS;

    // Launch failing kernel
    aussie_clear_vector_kernel_basic <<< blocks, BADTHREADS >>> (device_v, n); 

    errval = cudaDeviceSynchronize();
    if (errval != cudaSuccess) {
        // CUDA error...
        AUSSIE_CUDA_ERROR(errval, "cudaDeviceSynchronize failure after GPU kernel");
    }

In this case, cudaDeviceSynchronize will incorrectly return cudaSuccess. And since kernel launches cannot return an error code, the failure code is lost. The kernel will never run any threads, and also never again report any error code. It’s a silent kernel launch failure, and our poor clueless CPU probably thinks that the kernel is running.

It’s not just cudaDeviceSynchronize that fails to detect the error code, but also other calls with implicit synchronization. For example, if we use cudaMemcpy just after the kernel launch, it too will return cudaSuccess after such a failed kernel launch.

The solution and apparently the only way to detect this kernel launch error is to call cudaGetLastError or cudaPeekAtLastError before any call to cudaDeviceSynchronize.

In any case, an example of the code that works is:

    // Launch failing kernel
    aussie_clear_vector_kernel_basic <<< blocks, BADTHREADS >>> (device_v, n); 

    errval = cudaGetLastError();   // Correct!
    if (errval != cudaSuccess) {
        // CUDA error...
        AUSSIE_CUDA_ERROR(errval, "cudaGetLastError failure after GPU kernel");
    }

    errval = cudaDeviceSynchronize();
    // etc.

Hence, we probably should update our preferred CUDA idiom to:

  • Kernel launch with <<<...>>> syntax.
  • Call cudaGetLastError or cudaPeekAtLastError immediately thereafter.
  • Optionally call cudaDeviceSynchronize.

And one final oddity: if we call cudaPeekAtLastError just after a failed kernel launch, the error code is returned correctly, but is somehow missed by subsequent calls such as cudaMemcpy. It’s like it gets cleared, even though it was only supposed to be a “peek”!

Device Thread Limits

You have to feel sorry for the poor little hapless GPU chips. For a while they get to run mission-critical AI queries, like suggesting recipes using ingredients that start with ‘P’ and other important stuff. But then they overheat a little, and get sent to the remote camps to do Bitcoin mining.

There are several ways that you can exceed a GPU’s limits:

  • More than 1024 threads
  • Too many blocks

The block size limit to 1024 is a hard limit, and you really should add an assertion before every kernel launch to assure that. The problem with launching too many blocks, and thus too many threads, on a GPU is more insidious.

Launching Far Too Many Blocks. Just when you thought it was safe to go back into the water, here’s the news: this apparently safe and very simple kernel is actually broken:

    __global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n)
            v[id] = 0.0;  // Clear vector element.. Safely!
    }

This is not production-grade CUDA C++ code. The kernel fails if N is too large, because it tries to always create exactly one thread per item. Hence, a very large value of N means a large number of blocks.

This can blow the little GPU’s mind. If you try to work on a vector with a billion elements, that’s a billion divided by 32 warps. Even the amazing NVIDIA GPUs have their limits. For example, a V100 tops out at 16,384 threads, which is a lot less than a billion.

Actually, I underestimated CUDA! When I ran this code in a test just now, it worked just fine! CUDA just schedules lots and lots of blocks, and takes care of it. The program takes a few seconds to run, so I guess it’s not doing all those blocks in parallel, but there’s no crashing or CUDA error codes. Even the unit tests passed. Amazing.

It does eventually fail, but I had to use N=1<<30, which is a huge value. And the error is not even in the block scheduling, it’s that we get to the level of integer overflow, and this messes with the CUDA memory addressing scheme. Here is what compute-sanitizer finds:

========= Invalid __global__ write of size 4 bytes
=========     at 0x80 in aussie_clear_vector_kernel_basic(float *, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     and is 140,236,623,642,624 bytes before the nearest allocation at 0x7f8b62230000 of size 65,536 bytes

It actually reports errors about addresses 0x0, 0x4, 0x8, and so on. This looks like some kind of integer overflow in the memory address logic. I guess it finally stretched CUDA’s memory model to breaking point!

No, that’s wrong! But then, no, after further investigation I’ve discovered that it’s my bug, not CUDA’s. I mean, address 0x0 is the null pointer, not some random address, so I should have twigged earlier.

I added a lot of assertions and error return checks to the caller code, and here’s what I found:

  • The device vector passed as a kernel parameter was null.
  • The cudaMalloc vector was null.

Here’s the culprit:

    int sz = n * sizeof(float);

It turns out that if n==1<<30 and size(float)==4, then 1<<32 is zero, by integer overflow. Hence, the size parameter passed to cudaMalloc was zero, and it was failing. My code was effectively doing:

   int sz = 0;
   cudaMalloc((void**)&device_v, sz);  // sz==0

There were no runtime warnings or CUDA error codes. If I’d declared the variable n as const, I would have got a compiler warning about integer constants overflowing, but at runtime there are no checks for integer overflow.

Anyway, this mistake has some important points:

  • Humble pie for me!
  • cudaMalloc does not return a CUDA error code if its size is zero.
  • But cudaMalloc does return a null address, and I wasn’t checking for that (i.e., “allocation failure”).
  • cudaMalloc will act oddly if the size overflows to a negative integer, because its parameter type is size_t, which is typically an unsigned type, so a negative integer value will overflow to become a very large positive unsigned integer (silently).
  • Device pointer operations don’t crash from a null dereference on the GPU, nor for an invalid address of v[id] that is based on a null pointer, such as 0x4 from v[1], 0x8 from v[2], etc.
  • I should have safety-checked or asserted that “device_v != NULL” before passing it as a kernel launch parameter.

Solution: Use fewer blocks. The solution is obviously to modify the kernels so that they process more than one vector element, so that we can use fewer blocks overall (and don’t need to overflow any integers!). If you’re only a little too high, you can maybe manually modify the kernel so that each thread sets 2 vector elements by doing two assignments in sequence.

For a more general case, you actually need to add a loop into the kernel function, and do the calculations to work out how many elements each kernel needs to process, which is basically N divided by the maximum number of allowed threads.

There’s nothing wrong with a kernel like this, and it just looks like an ordinary C++ loop, but we’re getting a bit too advanced now. We’ll defer the discussion of how to do loops in kernels, because there are scary things like “grid-stride loops” and “coalesced data accesses” to think about.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

CUDA C++ Optimization The new CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging