Aussie AI

Chapter 6. CUDA Debugging Strategies

  • Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
  • by David Spuler

General CUDA Debugging Techniques

A lot of the work in debugging CUDA programs is nothing special: it’s just C++ mistakes. Most of the errors in coding are ordinary, boring coding errors that every C++ programmer is prone to. These can occur in the host code and the device code, although problems in the host code are more common.

On the other hand, if you get a bug in a CUDA C++ kernel, it’s usually a nasty one. And there are a variety of ways to go wrong in handling pointers and addresses in the kernel, from basic beginner mistakes to traps that can catch the experienced CUDA practitioner.

The best way to catch a bug is to try to make it happen early. We want the program to crash in the lab, not out in production. In this regard, some of the best practices are about auto-detecting the failures in your code, rather than waiting for them to actually cause a crash:

  • Check every CUDA API return code (even the harmless functions that can “never” fail).
  • Use macro wrappers to help handle errors.
  • Add debug wrapper functions and enable them while testing.
  • Run compute-sanitizer on your code regularly.
  • Thrash the code in many ways in the nightly builds.

If you mess up, and a bug happens in the production backend of your AI training run, I suggest this: blame the data scientists. Surely, the problem was in the training data, not in my perfect CUDA C++ code. And if that doesn’t work, well, obviously the GPU was overheating.

Very Difficult Bugs. Some bugs are like roaches and keep coming out of the woodwork. General strategies for solving a tricky bug include:

  • Can you reproduce it? That’s the key.
  • Write a unit test that triggers it (if you can).
  • Try to cut down the input to the smallest case that triggers the fault.
  • Gather as much information about the context as possible (e.g., if it’s a user-reported error).

Your debugging approach should include:

  • Run compute-sanitizer to check for CUDA memory glitches.
  • Run the other compute-sanitizer tools (it has four modes).
  • Think about what code you just changed recently (or was just committed to the repo by someone else!).
  • Memory-related failures often cause weird errors nowhere near the cause.
  • Review the debug trace output carefully (i.e., may be that some other part of the code failed much earlier).
  • Step through the code in cuda-gdb bout ten more times.
  • Run a static analysis (“linter”) tool on the code.
  • Run an AI copilot debugger tool. I hear they’re terrific.
  • Refactor a large module into smaller functions that are more easily unit-tested (often you accidentally fix the bug!).

If you really get stuck, you could try talking to another human (gasp!). Show your code to someone else and they’ll find the bug in three seconds.

Serializing Kernel Launches

Both beginner and advanced CUDA C++ programmers can make debugging easier via serialization of thread execution. Some basic strategies are:

  • Serialize the kernel launches.
  • Launch only one thread (i.e., kernel<<<1,1>>>)

Serialized kernel launches. The advantage of a serialized kernel launch is that only one kernel is running at a time. This doesn’t mean that the threads are running sequentially within that one kernel, but at least you don’t have two things happening on the GPU at once. This is a great help in localizing the cause of any CUDA Runtime errors, which can come in asynchronously from any active kernel.

Serializing kernel launches is possible in several ways. One simple way is to manually add a call to cudaDeviceSynchronize immediately after every kernel launch. Since beginner programmers often already do this in their code, it’s not going to add much benefit to a debugging session in the learning lab.

CUDA kernel launches are usually asynchronous, but you can make them synchronous or “blocking” using the settings to auto-serialize every kernel launch. You can set the environment variable “CUDA_LAUNCH_BLOCKING” to 1.

cuda-gdb serialized kernels. There are additional options when debugging your code in the cuda-gdb symbolic debugger. There are various flags you can set within an interactive debugging session. The one to serialize all kernel launches so they are “blocking” is to enable the “launch_blocking” setting:

    set cuda launch_blocking on

Another useful cuda-gdb option is to set an auto-breakpoint on every kernel launch in your program with the “break_on_launch” setting. The command is:

    set cuda break_on_launch application

Single-thread kernels. Launching only a single thread with kernel<<<1,1>>> is also not really for beginners. If your kernel is really simple, such as a basic vector addition with one “+” operation on a single element, then your program simply won’t work anymore. If you run only one thread, your kernel will only process one vector element.

The advice to launch a single thread is more relevant to advanced kernels that use a grid-stride loop. A single kernel like that will check the value of blockDim.x, which will now be 1, and will adjust the loop to iterate over every element of a vector. But, again, using a grid-stride loop is not for beginners.

Localizing the Error

One of the basic techniques in debugging for large and complex CUDA programs involves localizing the error. The problem arises because of these factors:

  • CUDA kernel launches are asynchronous, so the host code keeps running.
  • The CUDA Runtime API does not report GPU kernel errors immediately.
  • Multiple kernels may be running in parallel.

The result of asynchronous kernel launching is a weird sequence, whereby a GPU error report can come back to the host code at any time. This assumes that the CPU kept going after launching the kernel launch, rather than blocking on a cudaDeviceSynchronize call.

Here’s an example sequence of events:

  • Host code launches a kernel (triple chevrons and all that).
  • The CPU code keeps moving ahead (because the launch is non-blocking).
  • The device code for that kernel starts running on the GPU.
  • Stuff happens (on both CPU and GPU).
  • An error occurs in the GPU kernel (for some reason).
  • The GPU cannot interrupt the host code on the CPU.
  • Instead, the GPU buffers the error code.
  • The next call to the CUDA Runtime library on the host will return this error code.

Hence, there’s some weird problems:

  • The error code might cause a failure in some CUDA APIs that you think should never fail (e.g., cudaSetDevice, cudaGetDeviceProperties, or whatever).
  • The error code might appear to be from setting up a new kernel (e.g., cudaMemcpy fails), but in fact, it’s an error from the prior kernel launch.
  • The error code might occur after a second kernel is launched, and you might think it’s from the second kernel, when it’s actually from the first kernel.

Random Number Seeds

Neural network code often uses random numbers to improve accuracy via a stochastic algorithm. For example, the top-k decoding uses randomness for creativity and to prevent the repetitive looping that can occur with greedy decoding. And you might use randomness to generate input tests when you’re trying to thrash the model with random prompt strings.

But that’s not good for debugging! We don’t want randomness when we’re trying to reproduce a bug!

Hence, we want it to be random for users, but not when we’re debugging. Random numbers need a “seed” to get started, so we can just save and re-use the seed for a debugging session. This idea can be applied to old-style rand/srand functions or to the newer <random> libraries like std::mt19937 (stands for “Mersenne twister”).

Seeding the random number generator in old-style C++ is done via the “srand” function. The longstanding way to initialize the random number generator, so it’s truly random, is to use the current time:

    srand(time(NULL));

Note that seeding with a guessable value is a security risk. Hence, it’s safer to use some additional arithmetic on the time return value.

After seeding, the “rand” function can be used to get a truly unpredictable set of random numbers. The random number generator works well and is efficient. A generalized plan is to have a debugging or regression testing mode where the seed is fixed.

    if (g_aussie_debug_srand_seed != 0) {
        // Debugging mode
        srand(g_aussie_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
        srand(time(NULL));
    }

The test harness has to set the global debug variable “g_aussie_debug_srand_seed” whenever it’s needed for a regression test. For example, either it’s manually hard-coded into a testing function, or it could be set via a command-line argument to your test harness executable, so the program can be scripted to run with a known seed.

This is better, but if we have a bug in production, we won’t know the seed number. So, the better code also prints out the seed number (or logs it) in case you need to use it later to reproduce a bug that occurred live.

    if (g_aussie_debug_srand_seed != 0) {
        srand(g_aussie_debug_srand_seed);   // Debug mode
    }
    else {  // Normal run
        long int iseed = (long)time(NULL);
        fprintf(stderr, "INFO: Random number seed: %ld 0x%lx\n", iseed, iseed);
        srand(iseed);
    }

An extension would be to also print out the seed in error context information on assertion failures, self-test errors, or other internal errors.

There’s one practical problem with this for reproducibility: what if the bug occurs after a thousand queries? If there’s been innumerable calls to our random number generator, there’s not really a way to reproduce the current situation. One simple fix is to instantiate a new random number generator for every query, which really isn’t very expensive.

Making the Correction

An important part of the debugging phase that is often neglected is actually making the correction. You’ve found the cause of the failure, but how do you fix it? It is imperative that you actually understand what caused the error before fixing it; don’t be satisfied when a correction works and you don’t know why.

Here are some thoughts on the best practices for the “fixing” part of debugging:

  • Test it one last time.
  • Add a unit test or regression test.
  • Re-run the entire unit test or regression test suite.
  • Update status logs, bug databases, change logs, etc.
  • Update documentation (if applicable)

Another common pitfall is to make the correction and then not test whether it actually fixed the problem. Furthermore, making a correction will often uncover (or introduce!) another new bug. Hence, not only should you test for this bug, but it’s a very good idea to use extensive regression tests after making an apparently successful correction.

Level Up Your Post-Debugging Routine. Assuming you can fix it, think about the next level of professionalism to avoid having a repetition of similar problems. Consider doing followups such as:

  • Add a unit test or regression test to re-check that problematic input every build.
  • Write it up and close the incident in the bug tracking database like a Goody Two-Shoes.
  • Add safety input validation tests so that a similar failure is tolerated (and logged).
  • Add a self-check in a C++ debug wrapper function to check for it next time at runtime.
  • Is there a tool that would have found it? Or even a grep script? Can you run it automatically? Every build?

As with all applications, there’s another level needed to get the code out the door into production. Some of the issues for fully production-ready CUDA C++ code include:

  • Validate function parameters (don’t trust the caller or the user).
  • Check return codes of all CUDA primitives.
  • Handle memory allocation failure (e.g., graceful shutdown).
  • Kernels should correctly scale for large values (e.g., vector dimensions).
  • Choose block/thread sizes for best occupancy
  • Don’t exceeding GPU device specifications.
  • Add unique error message codes for supportability

Let’s not forget that maybe a little testing is required. High-quality coding requires all manner of joyous programmer tasks: write unit tests, warning-free compilation, static analysis checks, add assertions and debug tracing, run cuda-memcheck, write useful commit summaries (rather than “I forget”), don’t cuss in the bug tracking record, update the doc, comment your code, and be good to your mother.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

CUDA C++ Optimization The new CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging