Aussie AI
Chapter 10. GPU Kernel Debugging
-
Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
-
by David Spuler
Kernel Debugging Techniques
The kernels running on the GPU are the most important C++ code you’ll ever write, but also the most difficult to debug. When you’re focused on getting the most speed out of the silicon, it’s far too easy to introduce an insidious bug.
We’ve already examined some of the main techniques that look at the kernel from the “outside” in the host code:
- Error checking CUDA Runtime API calls in the host code.
- Managing sticky errors to the extent possible.
Tools are also something that should be top of the list. Some of the NVIDIA debugging tools are amazing in terms of resolving device issues, and should be one of the first thoughts:
cuda-gdbfor interactive debugging.compute-sanitizerand all four of its sub-tools.
There are also two important builtin methods that work inside the kernel C++ code:
printf(but notfprintf)assert
Technically, these are part of CUDA Runtime and defined without an explicitly included header file,
but sometimes you may need to include <stdio.h>/<iostream> or <assert.h>.
Do not underestimate what you can achieve with just these two methods! The main strategies are:
- Add more tracing with
printfand usecudaDeviceSynchronizeto ensure you see the output messages (but beware the dreaded kernel buffer overflow!). - Add lots of
assertcalls peppered throughout (afterwards, you can remove them or leave them, at your discretion).
Some additional techniques can be helpful in finding the cause of a failure:
- Serializing kernel launches (e.g., set environment variable
CUDA_LAUNCH_BLOCKINGto 1, or use the command “set cuda launch_blocking on” insidecuda-gdb, or do it manually by adding temporary calls tocudaDeviceSynchronizeafter kernel launches). - Launch a single thread, or a single warp, if your kernel uses grid-stride loops or equivalent.
- Add
cudaSynchronizeDeviceto serialize, and also to flush the kernel output buffers (printfandassert).
Triggering Bugs Earlier
A lot of kernel bugs can be found using the techniques already mentioned. The above approaches are very powerful, but they can be limited in some less common situations:
- Intermittent bugs — hard to reproduce bugs.
- Silent bugs — how would you even know?
You can’t really find a bug with cuda-gdb or
the compute-sanitizer memory checker if you can’t reproduce the failure.
On the other hand, an intermittent failure might be a race condition or other synchronization error,
so you probably should run racecheck and synccheck.
Silent bugs are even worse, because you don’t know they exist. I mean, they’re not really a problem, because nobody’s logged a ticket to fix it, but you just know it’ll happen in production at the biggest customer site in the middle of Shark Week.
How do you shake out more bugs? Here are some thoughts:
- Set the
CUDA_ENABLE_COREDUMP_ON_EXCEPTIONenvironment variable (because the program won’t dump core on various GPU errors, but can quietly continue). - Add more assertions on arithmetic operations in device code (e.g., more tests for floating-point
NaNor negative zero). - Auto-wrap CUDA API calls in host code to ensure error checking for all calls.
- Fast self-checks for simple kernel launch mistakes (e.g.,
nthreads%32==0andnthreads<=1024). - Arithmetic overflow or underflow is a very silent bug for both integers and floating-point (e.g., check
unsignedintegers aren’t so high they’d be negative if converted toint). - Index safety tests in kernels actually hide bugs (use
printfmessages or assertions instead, assuming you’re managing sizes to avoid extra wasted threads). - Build a unit testing kernel to check
__device__utility functions brick-by-brick.
There are also some changes to the host code that can help detect some types of kernel bugs:
- Add self-testing code with more complex sanity checks for kernel launches.
- Consider debug wrapper functions with extra self-testing.
- Add more function parameter validation
With all of these things, any extra runtime testing code requires a shipping policy choice: remove it for production, leave it in for production, only leave it in for beta customers, leave in only the fast checks, and so on.
If you’re still struggling with an unsolvable bug, here are a few “hail Mary” passes into the endzone:
- Add a call to
cudaGetLastErrororcudaPeekAtLastErrorimmediately after kernel launches, before any call tocudaDeviceSynchronizeor other implicit synchronizations (otherwise, synchronous kernel launch failures, such as more than 1024 threads or shared memory too high, may be silent; admittedly, you should have seen that the kernel’s not running in a debugger). - You can run
valgrindon CUDA C++ executables, though it’s probably not any better thancompute-sanitizer, but there may be a few rare things it can find. - Review the latest code changes; it’s often just a basic mistake hidden by “code blindness” (e.g., check your “
.x” and “.y” attributes). - Mixing up the indices of square matrices is a silent, nasty bug in your algorithm that’s hard to detect with most debugging approaches.
- Add a lot of calls to synchronization primitives like
__syncthreadsor__syncwarp(this may help prove it’s a synchronization error, but probably won’t help you find it). - Add a
cudaMemsetcall after everycudaMalloc(and variants) to see if initializing the memory fixes it (admittedly, tools should find this anyway). - Similarly, try
memsetaftermallocornew, or change tocalloc(note that there’s nocudaCalloc!).
And some other practical housekeeping tips can sometimes help with detecting new bugs as soon as they enter the source code and planning ahead for future failures:
- Examine compiler warnings and have a “warning-free build” policy.
- Have a separate “
make lint” build path with lots more warnings enabled. - Keep track of random number generator seeds for reproducibility.
- Add some portability sanity checks, such as:
static_assert(sizeof(float)==4);
I guarantee that last one will save your bacon one day!
De-Slugging Kernels
Your code has just slowed down and you don’t know why? Well, first thing is to run one of the various CUDA profiling tools.
Some ideas for slugs in your code include:
- You left all of those self-testing code blocks in the source when you were trying to fix a bug!
- Logic around
cudaSetDeviceis broken, and the code is now reduced to only running on one GPU. - Launching too few blocks, so each thread is doing a lot of work.
- The “
-O” or “--dopt” optimization flag was removed or changed in theMakefile. - Too much synchronization with
cudaDeviceSynchronizeor__syncthreadsor whatever. - Environment variable
CUDA_LAUNCH_BLOCKINGis enabled. - Addition of conditional control flow paths caused serious branch divergence in threads.
- Your grid-stride loop has “
i++” instead of “i+=stride” and every thread is computing every element (endless redundant computation). - The build process lost the “
-DNDEBUG” flag and all your assertions are live again. - You’re running on a non-NVIDIA GPU for some strange reason.
Some general areas of sluggish execution include:
- Non-coalesced memory access patterns are slower.
- Thread divergence (warp or branch divergence).
- Implied synchronization in various CUDA Runtime APIs (on host).
- Non-aligned memory accesses are slower (aim for 128-byte alignment).
- Shared memory contention (“bank conflicts”).
- Nested kernels can balloon runtime cost.
cudaMemcpywith non-pinned host memory (causes paging).- Register spills (and “register pressure”).
- Instruction locality issues (instruction cache misses).
That’s enough for here, but CUDA C++ optimization can be the next book.
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |