Aussie AI

Chapter 7. CUDA Debugging Tools

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

CUDA Tools Overview

Compiler and IDE tools for programming CUDA include:

NVIDIA C++ Compiler (NVCC) — the nvcc command-line compiler.
Nsight Eclipse Edition — integration with the Eclipse IDE.
Nsight Visual Studio Edition (VSE) — CUDA’s Microsoft Visual Studio integration.
Nsight Visual Studio Code Edition (VSCE) — Visual Studio Code integration.

Debugging tools include:

cuda-gdb — command-line debugging on Linux (very similar to gdb).
Compute Sanitizer — command-line debugging tool with four sub-tools: memcheck for memory debugging, racecheck for race conditions, synccheck for synchronization checking, initcheck for initialization checks.
cuda-memcheck — discontinued tool, replaced by compute-sanitizer and its memcheck default tool.

Optimization and performance profiling tools include:

NVIDIA Visual Profiler — performance profiling with a GUI interface.
Nsight Systems — system profiling and tracing.
Nsight Compute — performance profiling for CUDA kernels.
Nsight Graphics — specialized profiling for graphics applications.
Nsight Deep Learning Designer — profiler focused on AI/ML applications.
nvprof — command-line profiler (now deprecated)

In-code debugging libraries and CUDA C++ code-related tools that you might need include:

Debug wrapper library for CUDA C++
Emulation test library
Linter for CUDA C++

These are active projects for us at Aussie AI (see https://www.aussieai.com/cuda/projects).

There are also some advanced APIs and SDKs available from NVIDIA if you want to get ambitious and do some very deep integrations into the CUDA tools:

Compute Sanitizer API — create a new “tool” for compute-sanitizer.
CUDA Debugger API — cuda-gdb API integration on Linux.
NVIDIA Tools Extension SDK (NVTX) — tool integration API.
CUDA Profiling Tools Interface (CUPTI) — profiling and tracing integration API.
Nsight Aftermath SDK — postmortem crash debugging.
Nsight Perf SDK — performance profiling for graphics applications.
Nsight Tools JupyterLab Extension — extension for profiling of Python applications.

The remainder of this chapter focuses on the debugging tools and their capabilities.

Command-Line Debugging Tools

The main command-line debugging tools to use are:

cuda-gdb (interactive debugger on Linux or Windows WSL)
compute-sanitizer (including sub-tools: memcheck, initcheck, racecheck, synccheck)

Profiling tools you can use on the command-line include:

ncu (Nsight Compute CLI)
nvprof — useful, but it’s deprecated, and will be riding off into the sunset.
gprof — the standard Linux profiler is useful for host code.

Compute Sanitizer

Compute Sanitizer is an NVIDIA tool that has four sub-tools to detect different problems. In addition to multiple tool capabilities, it also supports multiple platforms: Linux and Windows.

The most frequent usage of the compute-sanitizer command-line tool is likely to be memory debugging. If you have an error in a CUDA C++ application on Linux, and you can reproduce it, then just re-run the application with the Compute Sanitizer tool:

    compute-sanitizer a.out

This is equivalent to:

    compute-sanitizer --tool memcheck a.out

You can also supply command-line arguments to your application after the executable name, or additional options to Compute Sanitizer before the executable name.

The default tool for compute-sanitizer is “memcheck” for memory fault detection, and you don’t need a specific option to run it. This mode is very similar in usage to valgrind on Linux, which does similar memory checking functions on standard CPU C++ applications, but compute-sanitizer knows more about GPU memory problems. There also used to be a cuda-memcheck tool, which this has now superseded.

Compute Sanitizer does not require any re-compilation, and can run on your program just after it has crashed. However, its reports can be clearer to read if there is debug symbol information available in the executable, such as compilation with “-g” (host) or “-G” (device) debug information options to nvcc. Hence, it may be advisable to maintain these debug-enabled versions of executables in your build systems, as they are useful for both cuda-gdb and compute-sanitizer.

The error reported by the memory checking tools include:

Memory access problems (including device-side)
Errors with malloc and free
Double free
Memory leaks (especially with the “--leakcheck=full” option)

However, it is not limited to memory errors, and also finds:

CUDA runtime errors
Hardware exceptions

There are various additional options that you can turn on for additional error checking.

Abnormal program termination

One of the things about compute-sanitizer that can be tricky is that it doesn’t fully detect the cause of actual crashes of your application. Instead, sometimes you only get a report like this:

    ========= Error: process didn't terminate successfully
    ========= Target application returned an error
    ========= ERROR SUMMARY: 0 errors

The last line is misleading, as there weren’t zero errors, but the second-last line is more useful: “Target application returned an error”. This often means that your program crashed in the host code.

cuda-gdb batch mode. You can detect this host program crash better in cuda-gdb as it will trap the signals, so just run an interactive debugging sessions. Alternatively, if you have a simple reproducible case, you can automate this with batch mode, where the command to run is like this:

    cuda-gdb --batch --command=cuda-gdb-test.txt a.out

The batch input file is a set of cuda-gdb commands:

    run
    where
    exit

Here’s an example output (abridged):

    Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
    0x00007ffff7cdfa4e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
    #0  0x00007ffff7cdfa4e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
    #1  0x000055555555fdb5 in aussie_cudaMalloc(void**, int) ()
    #2  0x0000555555562ea9 in aussie_run_clear_vector_kernel_generic(int) ()
    #3  0x00005555555633aa in main ()
    A debugging session is active.
    Inferior 1 [process 5143] will be killed.
    Quit anyway? (y or n) [answered Y; input not from terminal]

There are various other useful things that can be automated using batch cuda-gdb and various script commands. For example, you can use it as a trace mechanism that prints out the stack trace at every call to a certain function.

racecheck

The racecheck tool is a sub-tool of Compute Sanitizer for detecting “race conditions” or “data races.” The command to run the tool is:

    compute-sanitizer --tool racecheck a.out

This tool detects problems in thread accesses to shared memory, but won’t help with any race conditions involving global memory accesses. It works by detecting “hazards,” which means conditions that indicate the potential for a race condition occurring. This is more effective than trying to detect actual race conditions, as they are transient and often non-reproducible. The types of hazards found include:

Write-write
Write-read
Read-write

Obviously, the accessed must be occurring to the same memory location for a hazard to exist. Also, note that a “read-read” sequence is never going to be a hazard for a race condition, but is just normal parallelism!

synccheck

The synccheck tool is a sub-tool of Compute Sanitizer that detects synchronization issues. It looks for “hazards” that indicate thread synchronization problems in GPU kernels. Here is the execution command:

    compute-sanitizer --tool synccheck a.out

This tool focuses on synchronization issues in device code, such as with the APIs:

__syncthreads()
__syncwarp()

Some of the errors found by synccheck include:

Invalid arguments
Thread divergence (warp-level or block-level)

initcheck

The initcheck tool is a sub-tool of Compute Sanitizer for detecting initialization issues in device accesses to device global memory. Execution is performed by:

    compute-sanitizer --tool initcheck a.out

Note that this only examines global memory, so it won’t find other types of uninitialized memory accesses in kernels. The default memcheck tool in compute-sanitizer can find other similar problems. The main issues found by initcheck are:

Accesses to uninitialized device global memory.
Unused device global memory (never accessed).

Fixing an uninitialized memory access issue found by initcheck is usually to initialize your device memory properly, such as by:

Device-side array initialization code
cudaMemset
cudaMemcpy

If initcheck reports on unused device global memory, this may indicate some sort of algorithm error, whereby some of the global memory is not being used.

cuda-gdb

I’m a big fan of gdb for debugging standard C++ on Linux, and cuda-gdb is even better. The platform support for cuda-gdb is primarily Linux, but also Windows WSL2, and also MacOS in a “host only” mode for remote debugging.

The cuda-gdb tool is a source-code modification of the open source gdb code to add NVIDIA GPU support, and you can actually find the cuda-gdb source code on Github at https://github.com/NVIDIA/cuda-gdb/.

Hence, cuda-gdb has most of the gdb features, and NVIDIA tries to keep up with new features. The basic commands from gdb are all supported:

r or run — run the code (with optional arguments), or restart if already running.
c or continue — continue running (after stopping at a breakpoint).
s or step — stepping through statements (also just Enter).
where — stack trace (also aliased to “bt” for backtrace).
list — source code listing
p or print — print a variable or expression.
up
n or next

Some of the CUDA-specific commands inside cuda-gdb include:

   info cuda

To see the list, run the help command:

    help info cuda

Examples of some of the many CUDA debugging sub-commands include:

   info cuda devices
   info cuda sms
   info cuda warps
   info cuda lanes
   info cuda blocks
   info cuda threads

Pre-Breakpointing Trick

One advanced tip for using cuda-gdb is to define a function called “breakpoint” in your C++ application. Here’s an example:

    void breakpoint()
    {
        volatile int x = 0;
        x = 0;  // Set breakpoint here
    }

It looks like a silly function, but it serves one useful purpose. The idea is that when you start an interactive debugging session in cuda-gdb, or automatically in your “.cuda-gdbinit” resource file, you can set a breakpoint there:

    b breakpoint

Why do that? The reason is that you also add calls to your “breakpoint” function at relevant points in various places where failures can occur:

CUDA error check macros
Assertion macros
Debug wrapper function failure detection
Unit test failures

Hence, if any of those bad things happen while you’re running interactively in the debugger, you’re immediately stopped at exactly that point. If you’re not running in the debugger, this is a very fast function (though admittedly, it can’t be inline!), so it doesn’t slow things down much. You can even consider leaving this in production code, since the breakpoint function is only called in rare situations where a serious failure has already occurred, in which case execution speed is not a priority.

This technique is particularly useful because don’t have to go back and figure out how to reproduce the failure, which can be difficult to do for some types of intermittent failures from race conditions or other synchronization problems. Instead, it’s already been pre-breakpointed for you, with the cursor blinking at you, politely asking you to debug it right now, or maybe after lunch.

Postmortem Debugging

Postmortem debugging involves trying to debug a program crash, such as a “core dump” on Linux. In this situation, you should have a “core” file that you can load into cuda-gdb. The command to use is:

    cuda-gdb a.out core

Unfortunately, not all errors in a CUDA application will trigger a core dump, so you might have nothing to debug if it doesn’t. One way to ensure that you get a core file is to set the environment variable CUDA_ENABLE_COREDUMP_ON_EXCEPTION, which will cause the CUDA Runtime to dump a core file on various additional failures.

Sometimes in large environments, it’s hard to know where a core file came from. An advanced feature of this environment variable is that you can format the filename to be more useful than just “core.” You can specify the format to include things like the time/date and the name of the executable.

Programmatic C++ core dumps. If you’re wanting to have your CUDA C++ take control of its own core dumps (e.g., exceptions, assertion failures, etc.), there are various points:

You can always fork-and-abort on Linux.
Maybe putenv("CUDA_ENABLE_COREDUMP_ON_EXCEPTION") might work?
Surely you can write some code to crash!

On the other hand, maybe you’re only thinking about core dumps because you want to save debug context information. Doing this might obviate the need for a core dump:

Use std::backtrace or another backtrace library.
Print error context information (e.g., user’s query)
Print platform details

Customer core dumps. One of the supportability issues with postmortem debugging is that you want your customers to be able to submit a core file that they have triggered on your CUDA-based application. These are usually large files, so there are logistical issues to overcome with uploads.

Another issue is that in order to run cuda-gdb on a core file, the developer needs to have exactly the right executable that created the core dump. Hence, your build and release management needs to maintain available copies of all executable files in versions shipped to customers or in beta testing (or to internal customers for in-house applications). And there needs to be a command-line option whereby the phone support staff can instruct customers to report the exact version and build number of the executable they are using. It’s easy to lose track!

Valgrind for CUDA

Can you use the Linux Valgrind tool to detect memory errors in CUDA C++ programs? More specifically, this refers to the Memcheck tool that is part of Valgrind.

In short, you can run Valgrind on your host code, but you’re probably better off using compute-sanitizer for CUDA C++ programs. Valgrind does run both the host and the device code, and can be used to find errors.

The CUDA Toolkit used to include a tool called “cuda-memcheck” but it has since been deprecated and removed in favor of compute-sanitizer. Interestingly, there’s an old research paper [Baumann and Gracia, 2013] on using Valgrind with CUDA, called the “CudaGrind” tool. There’s even a Github repo for this tool, although it hasn’t been edited in 9 years, so I’m not sure it’s still valid.

In any case, the basic Linux version of Valgrind Memcheck is still very well supported. The method to use Valgrind for Linux on a CUDA application is simply to run the executable:

    valgrind a.out

If Valgrind is not installed in your Linux environment, you’ll need to do something like this:

    apt install valgrind

The start of the Valgrind output is like this:

    ==1143== Memcheck, a memory error detector
    ==1143== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==1143== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
    ==1143== Command: ./a.out

As it executes your program, the output from your program will be interleaved with error reports from Valgrind. Hopefully, there won’t be any!

The end of the Valgrind execution gives you a nice summary of memory leaks and errors.

    ==1143== HEAP SUMMARY:
    ==1143==     in use at exit: 12,710,766 bytes in 10,810 blocks
    ==1143==   total heap usage: 15,851 allocs, 5,041 frees, 47,396,077 bytes allocated
    ==1143== 
    ==1143== LEAK SUMMARY:
    ==1143==    definitely lost: 0 bytes in 0 blocks
    ==1143==    indirectly lost: 0 bytes in 0 blocks
    ==1143==      possibly lost: 30,965 bytes in 199 blocks
    ==1143==    still reachable: 12,679,801 bytes in 10,611 blocks
    ==1143==         suppressed: 0 bytes in 0 blocks
    ==1143== Rerun with --leak-check=full to see details of leaked memory
    ==1143== 
    ==1143== For lists of detected and suppressed errors, rerun with: -s
    ==1143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Running CUDA programs in Valgrind is obviously slower because of the instrumentation, but this is also true of similar tools like compute-sanitizer. There is also a problem with “ioctl” where you get errors like these from Valgrind:

    ==1143== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
    ==1143==    This could cause spurious value errors to appear.
    ==1143==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.

These warnings are probably indicative of Valgrind having problems understanding the CUDA primitives related to GPU kernel code. Valgrind is fine for checking your host code for C++ memory usage errors, but lacking for device code checking. Hence, compute-sanitizer is preferred overall.

Warning-Free Build

Don’t ignore compiler warnings! A very good goal for C++ software quality is to get to a warning-free compile. You should think of compiler warnings as doing “static analysis” of your code. To maximize this idea, turn on more warning options, since the warnings are rarely wrong in modern compilers, although some are about harmless things.

Harmless doesn’t mean unimportant. And anyway, the so-called “harmless” warnings aren’t actually harmless, because if there’s too many of them in the compilation output, then the bad bugs won’t get seen. Hence, make the effort to fix the minor issues in C++ code that’s causing warnings. For example, fix the “unused variable” warnings or “mixing float and double” type warnings, even though they’re rarely a real bug. And yet, sometimes they are! This is why it’s powerful to have a warning-free compile.

Tracking compilation warnings. One way to take warning-free compilation to the next level is to actually store and analyze the compiler output. It’s like log file analysis in DevOps, only it’s not for systems management, but for debugging. On Linux, I typically use this idea:

    make build |& tee makebuild.txt

Here’s an actual example from a Makefile in an Aussie AI project on Linux:

    build:
        -@make build2 |& tee makebuild.txt
        -@echo 'See output in makebuild.txt'

The Makefile uses prefix “-” and “@” flags, which means that it doesn’t echo the command to output, and doesn’t stop if one of the steps triggers an error.

When the build has finished, then we have a text file “makebuild.txt” which can be viewed for warning messages. To go further, I usually use grep to remove some of the common informational messages, to leave only warning messages. Typically, my Linux command looks like:

    make warnings

Here’s an example of the “warnings” target in a Makefile for one of my Aussie AI projects:

    warnings:
        -@cat makebuild.txt | grep -v '^r -' \
        | grep -v '^g++ ' | grep -v '^Compiling' \
        | grep -v '^Making' | grep -v '^ar ' \ 
        | grep -v '^make\[' | grep -v '^ranlib' \
        | grep -v '^INFO:' | grep -v 'Regressions failed: 0' \ 
        | grep -v 'Assertions failed: 0' | grep -v SUCCESS \
        |more

Note that this uses grep to remove the informational messages from g++, ar, ranlib, and make. And it also removes the unit testing success messages if all tests pass (but not if they fail!). The idea is to show only the bad stuff because log outputs with too many lines get boring far too quickly and then nobody’s watching.

One annoying thing about using grep with make is that you get these kind of error messages:

    make: [annoying] Error 1 (ignored)

Here’s a way to fix them in a Makefile on Linux:

       -@grep tmpnam *.cu *.cpp || true

The “true” command is a shell command that never fails. Note that this line uses the double-pipe “||” shell logical-or operator, so it only runs “true” if grep fails. But don’t accidentally use a single “|” pipe operator, which would actually be a silent bug! This idea makes the line calling grep return a non-zero status, and then make is silent.

Finally, your warning-free tracking method should ideally be part of your “nightly builds” that do more extensive analysis than the basic CI/CD acceptance testing. You should email those warnings to the whole team, at about 2am ideally, because C++ programmers don’t deserve any sleep.

Linters for CUDA C++

Linters, or “static analyzers,” are tools that examine your source code for errors or stylistic concerns. General advice in regard to using linters for CUDA C++ programming is:

Use compiler warnings as free linting.
Use a separate linter build sequence.
Have two linter paths (one for bugs, one for style).
Use multiple compilers and linters for extra coverage.
Automate linting into the nightly build.

Note that we have an active project for a CUDA C++ linter. Find more information about Aussie Lint at https://www.aussieai.com/cuda/projects.

Using gcc as a linter. If you want more warnings, and who doesn’t, you can enable more warnings in gcc on Linux. You can either do this in your main build by enabling more compiler warnings, or use a separate build path (e.g., choose an inspiring name like: “make lint”) so that the main build is not inundated with new warnings. The way to do this is via the “--compiler-options” command-line option to nvcc, which specifies pass-through options for the underlying C++ compiler. By default, this compiler is gcc on Linux and cl.exe on Windows.

Since nvcc uses source-to-source compilation for the host code, these options will be running on most of your CUDA C++ host code, except for the parts that gcc wouldn’t understand (e.g., the <<<...>>> kernel launch syntax will be modified before being passed through).

An example command with extra linting power would be:

   nvcc --compiler-options="-Wall" aussie-test-crashes.cu

Some useful gcc warning flags include:

-Wall — “all” warnings (well, actually, some).
-Wextra — the “extra” warnings not enabled by “-Wall”.
-Wpedantic — yet more of the fun ones.

Hence, a longer command is:

    nvcc --compiler-options="-Wall -Wpedantic -Wextra" aussie-test-crashes.cu

You know, I really cannot say that I am a fan of endlessly scrolling warnings from the “pedantic” mode. Maybe, turn that one off, and pick-and-choose from the list of flags in the “pedantic” list. For example, I have used “-Wpointer-arith” in projects.

Linting device code

Device code is directly compiled by nvcc, rather than via source-to-source compilation, so the device code won’t get linted this way. We could try to bypass nvcc and use gcc directly, such as this:

    gcc -I/usr/local/cuda/include -Wall myfile.cu

But there’s at least two problems:

(a) the file suffix “.cu” needs to be changed to “.cpp” or similar, and

(b) code sequences like “<<<” and “__global__” won’t be understood by gcc.

Hence, you need to have a separate linting build sequence that renames files. You may also need source code changes to wrap kernel launch syntax with #if statements, or alternatively, use some fancy sed replacement tricks.

Code that is both host and device code will presumably go through both paths, and thus will be linted. This inspires another idea of a linting strategy to get device code covered, too, which is:

Use a separate linting path.
Use nvcc as the compiler, but with the gcc compiler warnings on.
Mark all the device C++ code as also “__host__” code.

But we don’t want to change this in our main code base that is processed by nvcc. Hence, this requires tricks like a macro that’s only enabled in linting mode, or a sed trick to add __host__ wherever there’s a __device__ specifier. This starts getting into murky territory, and I’m going to say it’s probably not worth the effort, but maybe it has some value.

Fixing Linter Warnings

Here’s some advice about fixing the code to address linter concerns:

Aim for a warning-free compilation of bug-level messages.
Don’t overdo code changes to fix any stylistic complaints.

Fix the bugs found by warnings (obviously), but as far as the stylistic type warnings are concerned, be picky. I say, aim for code quality and resilience, not code aesthetic perfection.

Warning-free linting. As with the main build, if you’re not fixing the less severe linter warnings, turn them off, or have two separate build sequences for the main anti-bug linting versus stylistic linting. You want any newly found serious problems to be visible, not lost in a stream of a hundred other spurious warnings. Hence, high quality code requires achieving a warning-free linting status for the main warnings.

On the other hand, you don’t want programmers doing too much “busy work” fixing minor coding style warnings with little practical impact on code reliability. Hence, you might find that your policy of “warning-free linting” needs to suppress some of the pickier warnings. And that’ll be a fun meeting to have.

References

GNU, Sep 2024 (accessed), 3.8 Options to Request or Suppress Warnings (GCC warning options), https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
Thomas M. Baumann, Jose Gracia, 3 Oct 2013, Cudagrind: A Valgrind Extension for CUDA, https://arxiv.org/abs/1310.0901, https://github.com/dpc-grindland/Cudagrind (Valgrind Memcheck for CUDA C++, but over 10 years old)

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging