Aussie AI

Chapter 10. GPU Kernel Debugging

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Kernel Debugging Techniques

The kernels running on the GPU are the most important C++ code you’ll ever write, but also the most difficult to debug. When you’re focused on getting the most speed out of the silicon, it’s far too easy to introduce an insidious bug.

We’ve already examined some of the main techniques that look at the kernel from the “outside” in the host code:

Error checking CUDA Runtime API calls in the host code.
Managing sticky errors to the extent possible.

Tools are also something that should be top of the list. Some of the NVIDIA debugging tools are amazing in terms of resolving device issues, and should be one of the first thoughts:

cuda-gdb for interactive debugging.
compute-sanitizer and all four of its sub-tools.

There are also two important builtin methods that work inside the kernel C++ code:

printf (but not fprintf)
assert

Technically, these are part of CUDA Runtime and defined without an explicitly included header file, but sometimes you may need to include <stdio.h>/<iostream> or <assert.h>.

Do not underestimate what you can achieve with just these two methods! The main strategies are:

Add more tracing with printf and use cudaDeviceSynchronize to ensure you see the output messages (but beware the dreaded kernel buffer overflow!).
Add lots of assert calls peppered throughout (afterwards, you can remove them or leave them, at your discretion).

Some additional techniques can be helpful in finding the cause of a failure:

Serializing kernel launches (e.g., set environment variable CUDA_LAUNCH_BLOCKING to 1, or use the command “set cuda launch_blocking on” inside cuda-gdb, or do it manually by adding temporary calls to cudaDeviceSynchronize after kernel launches).
Launch a single thread, or a single warp, if your kernel uses grid-stride loops or equivalent.
Add cudaSynchronizeDevice to serialize, and also to flush the kernel output buffers (printf and assert).

Triggering Bugs Earlier

A lot of kernel bugs can be found using the techniques already mentioned. The above approaches are very powerful, but they can be limited in some less common situations:

Intermittent bugs — hard to reproduce bugs.
Silent bugs — how would you even know?

You can’t really find a bug with cuda-gdb or the compute-sanitizer memory checker if you can’t reproduce the failure. On the other hand, an intermittent failure might be a race condition or other synchronization error, so you probably should run racecheck and synccheck.

Silent bugs are even worse, because you don’t know they exist. I mean, they’re not really a problem, because nobody’s logged a ticket to fix it, but you just know it’ll happen in production at the biggest customer site in the middle of Shark Week.

How do you shake out more bugs? Here are some thoughts:

Set the CUDA_ENABLE_COREDUMP_ON_EXCEPTION environment variable (because the program won’t dump core on various GPU errors, but can quietly continue).
Add more assertions on arithmetic operations in device code (e.g., more tests for floating-point NaN or negative zero).
Auto-wrap CUDA API calls in host code to ensure error checking for all calls.
Fast self-checks for simple kernel launch mistakes (e.g., nthreads%32==0 and nthreads<=1024).
Arithmetic overflow or underflow is a very silent bug for both integers and floating-point (e.g., check unsigned integers aren’t so high they’d be negative if converted to int).
Index safety tests in kernels actually hide bugs (use printf messages or assertions instead, assuming you’re managing sizes to avoid extra wasted threads).
Build a unit testing kernel to check __device__ utility functions brick-by-brick.

There are also some changes to the host code that can help detect some types of kernel bugs:

Add self-testing code with more complex sanity checks for kernel launches.
Consider debug wrapper functions with extra self-testing.
Add more function parameter validation

With all of these things, any extra runtime testing code requires a shipping policy choice: remove it for production, leave it in for production, only leave it in for beta customers, leave in only the fast checks, and so on.

If you’re still struggling with an unsolvable bug, here are a few “hail Mary” passes into the endzone:

Add a call to cudaGetLastError or cudaPeekAtLastError immediately after kernel launches, before any call to cudaDeviceSynchronize or other implicit synchronizations (otherwise, synchronous kernel launch failures, such as more than 1024 threads or shared memory too high, may be silent; admittedly, you should have seen that the kernel’s not running in a debugger).
You can run valgrind on CUDA C++ executables, though it’s probably not any better than compute-sanitizer, but there may be a few rare things it can find.
Review the latest code changes; it’s often just a basic mistake hidden by “code blindness” (e.g., check your “.x” and “.y” attributes).
Mixing up the indices of square matrices is a silent, nasty bug in your algorithm that’s hard to detect with most debugging approaches.
Add a lot of calls to synchronization primitives like __syncthreads or __syncwarp (this may help prove it’s a synchronization error, but probably won’t help you find it).
Add a cudaMemset call after every cudaMalloc (and variants) to see if initializing the memory fixes it (admittedly, tools should find this anyway).
Similarly, try memset after malloc or new, or change to calloc (note that there’s no cudaCalloc!).

And some other practical housekeeping tips can sometimes help with detecting new bugs as soon as they enter the source code and planning ahead for future failures:

Examine compiler warnings and have a “warning-free build” policy.
Have a separate “make lint” build path with lots more warnings enabled.
Keep track of random number generator seeds for reproducibility.
Add some portability sanity checks, such as: static_assert(sizeof(float)==4);

I guarantee that last one will save your bacon one day!

De-Slugging Kernels

Your code has just slowed down and you don’t know why? Well, first thing is to run one of the various CUDA profiling tools.

Some ideas for slugs in your code include:

You left all of those self-testing code blocks in the source when you were trying to fix a bug!
Logic around cudaSetDevice is broken, and the code is now reduced to only running on one GPU.
Launching too few blocks, so each thread is doing a lot of work.
The “-O” or “--dopt” optimization flag was removed or changed in the Makefile.
Too much synchronization with cudaDeviceSynchronize or __syncthreads or whatever.
Environment variable CUDA_LAUNCH_BLOCKING is enabled.
Addition of conditional control flow paths caused serious branch divergence in threads.
Your grid-stride loop has “i++” instead of “i+=stride” and every thread is computing every element (endless redundant computation).
The build process lost the “-DNDEBUG” flag and all your assertions are live again.
You’re running on a non-NVIDIA GPU for some strange reason.

Some general areas of sluggish execution include:

Non-coalesced memory access patterns are slower.
Thread divergence (warp or branch divergence).
Implied synchronization in various CUDA Runtime APIs (on host).
Non-aligned memory accesses are slower (aim for 128-byte alignment).
Shared memory contention (“bank conflicts”).
Nested kernels can balloon runtime cost.
cudaMemcpy with non-pinned host memory (causes paging).
Register spills (and “register pressure”).
Instruction locality issues (instruction cache misses).

That’s enough for here, but CUDA C++ optimization can be the next book.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

Aussie AI

Chapter 10. GPU Kernel Debugging

Kernel Debugging Techniques

Triggering Bugs Earlier

De-Slugging Kernels

Quick Links

Product

New to Writing?

Writing Styles