Aussie AI
Chapter 12. Advanced CUDA Bugs
-
Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
-
by David Spuler
Advanced Bugs in CUDA C++
Mastery of CUDA means no bugs, right? Alas, no, it just means a better class of bugs. Here are some of the things that might go wrong:
- Scaling beyond the GPU’s maximum thread count.
- Exceeding shared memory size with over-large blocks.
- Race conditions and other synchronization errors
- Using up too many registers and spilling into local memory.
- Kernel calls overflowing the GPU device’s stack (esp. if using
alloca). - Shared memory access synchronization errors across threads.
- Cross-compilation on a different architecture to execution.
Some specific errors in CUDA runtime API usage:
cudaSetDevicereturn error code needs to be checked, even at startup on a single GPU machine (it can fail sometimes).- Calling host-only functions from device code.
- Kernel
printfoutput buffer overload due to buffer not flushed. - Warp shuffle when the target thread is inactive is undefined behavior (the returned value is undefined).
- Using
__assume,__builtin_assume, or__builtin_unreachable, when it’s actually false (oops!). - Unsupported obscure
printfformats (it’s not exactly the same as the standard library). - Kernel
printfbuffered output missing at program termination (needs synchronizing call, such ascudaDeviceResetorcudaDeviceSynchronize). - Thread group blocks must have all threads participating.
- Trying to extend CUDA namespaces is undefined (e.g., “
cuda::” or “nv::”). - Not checking for CUDA error return codes after every CUDA runtime call, and after every kernel launch
with
cudaGetLastError, which can miss errors or at least give a misleading appearance of where the error is occurring.
CUDA-specific arithmetic problems include:
- Kernel-called math functions are silent on errors (e.g., do not set
errnoor emit floating-point exceptions). - Integer division by zero does not cause an exception in device code (and integer remainder operator).
- Integer overflow does not cause an exception (which is also normal for standard C++!)
- Floating point to integer conversion overflow is
INT_MAX(or equivalent constant for other types) in device code.
Memory access errors include:
- Mixing
malloc/freeandcudaMalloc/cudaFree. - Modifying
__constant__address data. - Accessing
__device__addresses in host code. - Local memory access outside that thread.
- Shared memory address (“
__shared__”) accessed in host code (probably a crash). - Shared memory access outside the block that defined it on the device (i.e., thread block scope).
- Tensor fragment mismatches.
- Kernel invocation parameter sends a local address (it should be from
cudaMalloc,newor__device__addresses) - Streams or events created by host code accessed in device code.
- Event accessed outside the kernel block that created it (e.g., another block, in host code or a child grid).
- Streams created in device code accessed outside that block.
- Virtual alias synchronization problems with
cuMemMap. cudaFreeAsyncsynchronization issues with its allocation (cudaMallocAsync) or other usage of the address.- Synchronization issues with
cudaMalloc(non-async version) andcudaFreeAsync. - Accessing
__constant__addresses in host code (likely segfault or other crash). - Address from
cudaGetSymbolAddresscan only be used in host code. - Mismatched
virtualfunction call in host versus device code (i.e., object created in one,virtualfunction called in the other).
Portability and compatibility issues include:
- Generally, use of compute capability level features where not supported by the GPU is undefined.
- Arithmetic operations of the GPU may differ from x86 in areas undefined by the IEEE 754 standard.
- Unified Memory usage on a device lacking full support is undefined (compatibility issue).
And don’t forget the slugs:
- Low occupancy rates on SMs.
- Poor load balancing across cores and SMs.
- Memory transfer costs
- Non-coalesced memory access patterns.
- Redundant barriers (unnecessary synchronization).
- Shared memory bank conflicts.
- Kernel output with
printf(useful for tracing, not production). - Register spills
- Poor cache management
There’s also some common plain old bugs in AI algorithms that are fairly common:
- Tensor shape errors
- Mixing the offsets in square matrices (e.g., image data)
And since CUDA C++, is really just C++ plus a layer on top, there’s still all of the whole boatload of possible low-level C++ coding mistakes. But don’t fret too much, because soon you’ll just be writing the comments and your AI copilot will write all the C++ statements.
Python Brain Mode
What’s wrong with this CUDA C++ kernel code for GPU vector addition?
__global__ void aussie_add_vector_kernel_buggy_python(float* v1, float *v2, float *destv3, int n)
{
// BUGGY: Add vectors, but C++ ain't Python!!
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n)
float ftmp1 = v1[id]; // Put into registers
float ftmp2 = v2[id];
destv3[id] = ftmp1 + ftmp2;
}
C++ is not Python!
This is what happens when a programmer is forced to learn Python,
and then has to context switch to a real programming language like CUDA C++.
Whereas indentation is used by Python for semantics, the C++ compiler does not use indentation
for anything other than tokenization,
and almost completely discards all whitespace.
The above if statement without braces around its body
is actually trying to do this control flow in C++:
if (id < n) {
float ftmp1 = v1[id];
}
float ftmp2 = v2[id];
destv3[id] = ftmp1 + ftmp2;
Hence, the safety test is not actually safe.
In fact, it won’t even compile, because ftmp1 in declared inside the if statement
branch, and has limited scope, so it can’t be used in the addition operator at the end.
The corrected code is simply to add curly braces:
if (id < n) {
float ftmp1 = v1[id];
float ftmp2 = v2[id];
destv3[id] = ftmp1 + ftmp2;
}
The other way to fix it is to remove the temporaries so that it’s only one statement:
if (id < n)
destv3[id] = v1[id] + v2[id];
Incidentally, I’m not convinced that using temporary variables in this way
to force GPU register usage is really an optimization.
The nvcc compiler probably puts them into registers anyway.
Confusing Host and Device Pointers
When there’s a lot of different pointers to vectors and matrices floating around, it’s easy to get confused. What’s the bug in this code to clear a vector? Note that this is host code and various error checking has been removed for clarity.
// Set up the host vector
float* v = (float*)malloc(n * sizeof(float)); // Dynamic array for vector
// Set up the self-test data...
aussie_set_vector(v, n, 3.0); // Set all elements to non-zero (for testing)
// Set up the device vector...
float* device_v = NULL;
int sz = n * sizeof(float);
cudaMalloc((void**)&device_v, sz);
// Copy to device vector
cudaMemcpy(device_v, v, sz, cudaMemcpyHostToDevice);
// Kernel launch
int threads_per_block = 32;
int blocks = (n + threads_per_block - 1) / threads_per_block;
aussie_clear_vector_kernel_basic <<< blocks, threads_per_block >>> (v, n);
// Copy GPU data back to the CPU host vector....
cudaMemcpy(v, device_v, sz, cudaMemcpyDeviceToHost);
// Cleanup allocated memory
cudaFree(device_v); // Free the device vector
free(v); // Free the host vector
It’s hard to see, but the kernel launch won’t do what it’s asked, but will fail with a CUDA error code. Hopefully this is being checked by the host code, but not in the code fragment above, though! If a kernel fails in a GPU forest, and there’s no error check to hear it fail?
Anyway, this doesn’t crash,
but if you call cudaGetLastError anywhere, you’ll see that this gets CUDA error 700, which is
cudaErrorIllegalAddress.
The error first appears after the kernel launch, so it’s happening in the device code.
If you’re stumped, one way to find out the cause of error 700
would be to run it with the compute-sanitizer tool, which finds memory access errors.
It’s part of the CUDA Toolkit and is free to use.
On Linux or a Google Colab virtual version of Linux, the command would be:
command-sanitizer a.out
The error report would be something like this:
========= Invalid __global__ write of size 4 bytes ========= at 0x40 in aussie_clear_vector_kernel_buggy1(float *, int) ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x583fefd3e950 is out of bounds ========= and is 37,289,833,797,296 bytes before the nearest allocation at 0x7a2a27200000 of size 131,072 bytes ========= Saved host backtrace up to driver entry point at kernel launch time
I particularly enjoy the fact that the faulty address is 37 quadzillion bytes away from where it should be. And that is actually quite a useful hint as to the cause.
To narrow our focus, here’s the kernel launch with the mistake:
aussie_clear_vector_kernel_buggy1 <<< blocks, threads_per_block >>> (v, n);
Your first thought is probably that something’s wrong with the blocks and threads calculations.
But actually, the culprit is simply v,
which is a host vector allocated by malloc on the CPU, not a kernel vector allocated by cudaMalloc.
The GPU cannot access memory on the host, unless it’s been specially marked as “global” or similar.
So, we’ve passed the GPU kernel a host vector address that device code isn’t allowed to touch!
The fix is to use the correct vector device_v, which is a device vector:
aussie_clear_vector_kernel_buggy1 <<< blocks, threads_per_block >>> (device_v /*FIX!*/ , n);
Copy-Paste Bugs for cudaMemcpy
It seems a little ironic that humans make copy-paste bugs when using the cudaMemcpy API.
However, it’s almost always used twice, before and after a kernel launch,
and there are many ways to go wrong.
Firstly, here are the two main ways to write them correctly:
cudaMemcpy(device_v, v, sz, cudaMemcpyHostToDevice); // Before (CPU-to-GPU)
// ... launch the kernel here
cudaMemcpy(v, device_v, sz, cudaMemcpyDeviceToHost); // After (GPU-to-CPU)
Below are the many ways to call it incorrectly.
In fact, I coded up some tests of cudaMemcpy to see which ones return a CUDA error status, and which are silent errors.
Here are the ways to go wrong:
cudaMemcpy(device_v, v, sz, cudaMemcpyDeviceToHost); // device-to-host to a device pointer
cudaMemcpy(v, v, sz, cudaMemcpyHostToDevice); // host-to-device to a host pointer
cudaMemcpy(device_v, device_v, sz, cudaMemcpyHostToDevice); // host-to-device from a device pointer
cudaMemcpy(device_v, device_v, sz, cudaMemcpyDeviceToHost); // device-to-host to a device pointer
cudaMemcpy(v, NULL /*device_v*/, sz, cudaMemcpyDeviceToHost); // device-to-host from NULL pointer
cudaMemcpy(NULL /*v*/, device_v, sz, cudaMemcpyDeviceToHost); // device-to-host to NULL pointer
cudaMemcpy(v, device_v, n /*sz*/, cudaMemcpyDeviceToHost); // device-to-host too few bytes
cudaMemcpy(v, device_v, 0 /*sz*/, cudaMemcpyDeviceToHost); // device-to-host zero bytes
cudaMemcpy(device_v, v, n /*sz*/, cudaMemcpyHostToDevice); // host-to-device too few bytes
cudaMemcpy(device_v, v, 0 /*sz*/, cudaMemcpyHostToDevice); // host-to-device zero bytes
cudaMemcpy(device_v, v, (int)cudaMemcpyHostToDevice /*sz*/, (cudaMemcpyKind)sz /*cudaMemcpyHostToDevice*/); // reverse sz and mode params
cudaMemcpy(v, device_v, sz, cudaMemcpyHostToHost); // host-to-host from device pointer
cudaMemcpy(device_v, v, sz, cudaMemcpyHostToHost); // host-to-host to device pointer
cudaMemcpy(v, v, sz, cudaMemcpyHostToHost); // host-to-host same host pointer
cudaMemcpy(device_v, v, sz, cudaMemcpyDeviceToDevice); // device-to-device from host pointer
cudaMemcpy(v, device_v, sz, cudaMemcpyDeviceToDevice); // device-to-device to host pointer
cudaMemcpy(device_v, device_v, sz, cudaMemcpyDeviceToDevice); // device-to-device same device pointer
cudaMemcpy(device_v, device_v, sz*2, cudaMemcpyDeviceToDevice); // device-to-device too many bytes
cudaMemcpy(v, v, sz*2, cudaMemcpyHostToHost); // host-to-host too many bytes");
cudaMemcpy(device_v, v, sz*2, cudaMemcpyHostToDevice); // host-to-device too many bytes
cudaMemcpy(v, device_v, sz*2, cudaMemcpyDeviceToHost); // device-to-host too many bytes
And here is the output of my test program (abridged over multiple runs):
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host to a device pointer
AUSSIE CUDA ERROR: val=1 (invalid argument) - host-to-device to a host pointer
NO error detected - host-to-device from a device pointer
NO error detected - device-to-host to a device pointer
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host from NULL pointer
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host to NULL pointer
NO error detected - device-to-host too few bytes
NO error detected - device-to-host zero bytes
NO error detected - host-to-device too few bytes
NO error detected - host-to-device zero bytes
AUSSIE CUDA ERROR: val=21 (invalid copy direction for memcpy) - reverse sz and mode params
NO error detected - host-to-host from device pointer
NO error detected - host-to-host to device pointer
NO error detected - host-to-host same host pointer
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-device from host pointer
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-device to host pointer
NO error detected - device-to-device same device pointer
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-device too many bytes
NO error detected - host-to-host too many bytes
AUSSIE CUDA ERROR: val=1 (invalid argument) - host-to-device too many bytes
AUSSIE CUDA ERROR: val=1 (invalid argument) - device-to-host too many bytes
Some of the silent errors may be detected by compute-sanitizer when it runs,
but I ran this test, and it didn’t seem to find any more of the “too many bytes” overflows,
except for the ones that emitted a CUDA error code.
Note that the compute-sanitizer tool also helpfully
reports any of the CUDA error return statuses that get triggered
by the CUDA runtime API, along with any other memory address failures detected.
Silent Kernel Launch Failures
This is another weird oddity about the CUDA Runtime API. If the kernel launch fails in a synchronous way, such as from too many threads-per-block or other grid dimension error, the error code gets lost. Rather surprisingly, this lost error code occurs in a very common CUDA idiom:
- Kernel launch with
<<<...>>>syntax. cudaDeviceSynchronizeimmediately thereafter.
Here’s an example of code that will show this silent kernel launch failure:
int BADTHREADS = 32 + 1024; // More than 1024 is illegal
int blocks = (n + BADTHREADS - 1) / BADTHREADS;
// Launch failing kernel
aussie_clear_vector_kernel_basic <<< blocks, BADTHREADS >>> (device_v, n);
errval = cudaDeviceSynchronize();
if (errval != cudaSuccess) {
// CUDA error...
AUSSIE_CUDA_ERROR(errval, "cudaDeviceSynchronize failure after GPU kernel");
}
In this case, cudaDeviceSynchronize will incorrectly return cudaSuccess.
And since kernel launches cannot return an error code, the failure code is lost.
The kernel will never run any threads, and also never again report any error code.
It’s a silent kernel launch failure, and our poor clueless CPU probably thinks that the kernel is running.
It’s not just cudaDeviceSynchronize that fails to detect the error code,
but also other calls with implicit synchronization.
For example, if we use cudaMemcpy just after the kernel launch,
it too will return cudaSuccess after such a failed kernel launch.
The solution and apparently the only way to detect this kernel launch error
is to call cudaGetLastError or cudaPeekAtLastError
before any call to cudaDeviceSynchronize.
In any case, an example of the code that works is:
// Launch failing kernel
aussie_clear_vector_kernel_basic <<< blocks, BADTHREADS >>> (device_v, n);
errval = cudaGetLastError(); // Correct!
if (errval != cudaSuccess) {
// CUDA error...
AUSSIE_CUDA_ERROR(errval, "cudaGetLastError failure after GPU kernel");
}
errval = cudaDeviceSynchronize();
// etc.
Hence, we probably should update our preferred CUDA idiom to:
- Kernel launch with
<<<...>>>syntax. - Call
cudaGetLastErrororcudaPeekAtLastErrorimmediately thereafter. - Optionally call
cudaDeviceSynchronize.
And one final oddity: if we call cudaPeekAtLastError just after a failed kernel launch,
the error code is returned correctly, but is somehow
missed by subsequent calls such as cudaMemcpy.
It’s like it gets cleared, even though it was only supposed to be a “peek”!
Device Thread Limits
You have to feel sorry for the poor little hapless GPU chips. For a while they get to run mission-critical AI queries, like suggesting recipes using ingredients that start with ‘P’ and other important stuff. But then they overheat a little, and get sent to the remote camps to do Bitcoin mining.
There are several ways that you can exceed a GPU’s limits:
- More than 1024 threads
- Too many blocks
The block size limit to 1024 is a hard limit, and you really should add an assertion before every kernel launch to assure that. The problem with launching too many blocks, and thus too many threads, on a GPU is more insidious.
Launching Far Too Many Blocks. Just when you thought it was safe to go back into the water, here’s the news: this apparently safe and very simple kernel is actually broken:
__global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n)
v[id] = 0.0; // Clear vector element.. Safely!
}
This is not production-grade CUDA C++ code.
The kernel fails if N is too large,
because it tries to always create exactly one thread per item.
Hence, a very large value of N means a large number of blocks.
This can blow the little GPU’s mind. If you try to work on a vector with a billion elements, that’s a billion divided by 32 warps. Even the amazing NVIDIA GPUs have their limits. For example, a V100 tops out at 16,384 threads, which is a lot less than a billion.
Actually, I underestimated CUDA! When I ran this code in a test just now, it worked just fine! CUDA just schedules lots and lots of blocks, and takes care of it. The program takes a few seconds to run, so I guess it’s not doing all those blocks in parallel, but there’s no crashing or CUDA error codes. Even the unit tests passed. Amazing.
It does eventually fail, but I had to use N=1<<30, which is a huge value.
And the error is not even in the block scheduling, it’s that we get to the level of integer overflow,
and this messes with the CUDA memory addressing scheme.
Here is what compute-sanitizer finds:
========= Invalid __global__ write of size 4 bytes ========= at 0x80 in aussie_clear_vector_kernel_basic(float *, int) ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x0 is out of bounds ========= and is 140,236,623,642,624 bytes before the nearest allocation at 0x7f8b62230000 of size 65,536 bytes
It actually reports errors about addresses 0x0, 0x4, 0x8, and so on.
This looks like some kind of integer overflow in the memory address logic.
I guess it finally stretched CUDA’s memory model to breaking point!
No, that’s wrong!
But then, no, after further investigation I’ve discovered that it’s my bug, not CUDA’s.
I mean, address 0x0 is the null pointer, not some random address,
so I should have twigged earlier.
I added a lot of assertions and error return checks to the caller code, and here’s what I found:
- The device vector passed as a kernel parameter was null.
- The
cudaMallocvector was null.
Here’s the culprit:
int sz = n * sizeof(float);
It turns out that if n==1<<30 and size(float)==4, then 1<<32 is zero,
by integer overflow.
Hence, the size parameter passed to cudaMalloc was zero, and it was failing.
My code was effectively doing:
int sz = 0; cudaMalloc((void**)&device_v, sz); // sz==0
There were no runtime warnings or CUDA error codes.
If I’d declared the variable n as const,
I would have got a compiler warning about integer constants overflowing,
but at runtime there are no checks for integer overflow.
Anyway, this mistake has some important points:
- Humble pie for me!
cudaMallocdoes not return a CUDA error code if its size is zero.- But
cudaMallocdoes return a null address, and I wasn’t checking for that (i.e., “allocation failure”). cudaMallocwill act oddly if the size overflows to a negative integer, because its parameter type issize_t, which is typically anunsignedtype, so a negative integer value will overflow to become a very large positive unsigned integer (silently).- Device pointer operations don’t crash from a null dereference on the GPU, nor for an invalid address
of
v[id]that is based on a null pointer, such as0x4fromv[1],0x8fromv[2], etc. - I should have safety-checked or asserted that “
device_v != NULL” before passing it as a kernel launch parameter.
Solution: Use fewer blocks. The solution is obviously to modify the kernels so that they process more than one vector element, so that we can use fewer blocks overall (and don’t need to overflow any integers!). If you’re only a little too high, you can maybe manually modify the kernel so that each thread sets 2 vector elements by doing two assignments in sequence.
For a more general case, you actually need to add a loop
into the kernel function,
and do the calculations to work out how many elements
each kernel needs to process,
which is basically N divided by the
maximum number of allowed threads.
There’s nothing wrong with a kernel like this, and it just looks like an ordinary C++ loop, but we’re getting a bit too advanced now. We’ll defer the discussion of how to do loops in kernels, because there are scary things like “grid-stride loops” and “coalesced data accesses” to think about.
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |