Aussie AI
Chapter 7. CUDA Debugging Tools
-
Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
-
by David Spuler
CUDA Tools Overview
Compiler and IDE tools for programming CUDA include:
- NVIDIA C++ Compiler (NVCC) — the
nvcccommand-line compiler. - Nsight Eclipse Edition — integration with the Eclipse IDE.
- Nsight Visual Studio Edition (VSE) — CUDA’s Microsoft Visual Studio integration.
- Nsight Visual Studio Code Edition (VSCE) — Visual Studio Code integration.
Debugging tools include:
cuda-gdb— command-line debugging on Linux (very similar togdb).- Compute Sanitizer — command-line debugging tool with four sub-tools:
memcheckfor memory debugging,racecheckfor race conditions,synccheckfor synchronization checking,initcheckfor initialization checks. cuda-memcheck— discontinued tool, replaced bycompute-sanitizerand itsmemcheckdefault tool.
Optimization and performance profiling tools include:
- NVIDIA Visual Profiler — performance profiling with a GUI interface.
- Nsight Systems — system profiling and tracing.
- Nsight Compute — performance profiling for CUDA kernels.
- Nsight Graphics — specialized profiling for graphics applications.
- Nsight Deep Learning Designer — profiler focused on AI/ML applications.
nvprof— command-line profiler (now deprecated)
In-code debugging libraries and CUDA C++ code-related tools that you might need include:
- Debug wrapper library for CUDA C++
- Emulation test library
- Linter for CUDA C++
These are active projects for us at Aussie AI (see https://www.aussieai.com/cuda/projects).
There are also some advanced APIs and SDKs available from NVIDIA if you want to get ambitious and do some very deep integrations into the CUDA tools:
- Compute Sanitizer API — create a new “tool” for
compute-sanitizer. - CUDA Debugger API —
cuda-gdbAPI integration on Linux. - NVIDIA Tools Extension SDK (NVTX) — tool integration API.
- CUDA Profiling Tools Interface (CUPTI) — profiling and tracing integration API.
- Nsight Aftermath SDK — postmortem crash debugging.
- Nsight Perf SDK — performance profiling for graphics applications.
- Nsight Tools JupyterLab Extension — extension for profiling of Python applications.
The remainder of this chapter focuses on the debugging tools and their capabilities.
Command-Line Debugging Tools
The main command-line debugging tools to use are:
cuda-gdb(interactive debugger on Linux or Windows WSL)compute-sanitizer(including sub-tools:memcheck,initcheck,racecheck,synccheck)
Profiling tools you can use on the command-line include:
ncu(Nsight Compute CLI)nvprof— useful, but it’s deprecated, and will be riding off into the sunset.gprof— the standard Linux profiler is useful for host code.
Compute Sanitizer
Compute Sanitizer is an NVIDIA tool that has four sub-tools to detect different problems. In addition to multiple tool capabilities, it also supports multiple platforms: Linux and Windows.
The most frequent usage of the compute-sanitizer command-line tool is likely to be memory debugging.
If you have an error in a CUDA C++ application on Linux, and you can reproduce it,
then just re-run the application with the Compute Sanitizer tool:
compute-sanitizer a.out
This is equivalent to:
compute-sanitizer --tool memcheck a.out
You can also supply command-line arguments to your application after the executable name, or additional options to Compute Sanitizer before the executable name.
The default tool for compute-sanitizer is “memcheck” for memory fault detection,
and you don’t need a specific option to run it.
This mode is very similar in usage to valgrind on Linux,
which does similar memory checking functions on standard CPU C++ applications,
but compute-sanitizer knows more about GPU memory problems.
There also used to be a cuda-memcheck tool, which this has now superseded.
Compute Sanitizer does not require any re-compilation, and can run on your program
just after it has crashed.
However, its reports can be clearer to read
if there is debug symbol information available in the
executable, such as compilation with “-g” (host) or “-G” (device)
debug information options to nvcc.
Hence, it may be advisable to maintain these debug-enabled versions
of executables in your build systems,
as they are useful for both cuda-gdb and compute-sanitizer.
The error reported by the memory checking tools include:
- Memory access problems (including device-side)
- Errors with
mallocandfree - Double
free - Memory leaks (especially with the “
--leakcheck=full” option)
However, it is not limited to memory errors, and also finds:
- CUDA runtime errors
- Hardware exceptions
There are various additional options that you can turn on for additional error checking.
Abnormal program termination
One of the things about compute-sanitizer that can be tricky is that it doesn’t
fully detect the cause of actual crashes of your application.
Instead, sometimes you only get a report like this:
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors
The last line is misleading, as there weren’t zero errors,
but the second-last line is more useful: “Target application returned an error”.
This often means that your program crashed in the host code.
cuda-gdb batch mode.
You can detect this host program crash
better in cuda-gdb as it will trap the signals,
so just run an interactive debugging sessions.
Alternatively, if you have a simple reproducible case,
you can automate this with batch mode,
where the command to run is like this:
cuda-gdb --batch --command=cuda-gdb-test.txt a.out
The batch input file is a set of cuda-gdb commands:
run
where
exit
Here’s an example output (abridged):
Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
0x00007ffff7cdfa4e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#0 0x00007ffff7cdfa4e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000055555555fdb5 in aussie_cudaMalloc(void**, int) ()
#2 0x0000555555562ea9 in aussie_run_clear_vector_kernel_generic(int) ()
#3 0x00005555555633aa in main ()
A debugging session is active.
Inferior 1 [process 5143] will be killed.
Quit anyway? (y or n) [answered Y; input not from terminal]
There are various other useful things that can be automated using batch cuda-gdb
and various script commands.
For example, you can use it as a trace mechanism that prints out the stack trace
at every call to a certain function.
racecheck
The racecheck tool is a sub-tool of Compute Sanitizer for detecting “race conditions” or “data races.”
The command to run the tool is:
compute-sanitizer --tool racecheck a.out
This tool detects problems in thread accesses to shared memory, but won’t help with any race conditions involving global memory accesses. It works by detecting “hazards,” which means conditions that indicate the potential for a race condition occurring. This is more effective than trying to detect actual race conditions, as they are transient and often non-reproducible. The types of hazards found include:
- Write-write
- Write-read
- Read-write
Obviously, the accessed must be occurring to the same memory location for a hazard to exist. Also, note that a “read-read” sequence is never going to be a hazard for a race condition, but is just normal parallelism!
synccheck
The synccheck tool is a sub-tool of Compute Sanitizer that detects synchronization issues.
It looks for “hazards” that indicate thread synchronization problems in GPU kernels.
Here is the execution command:
compute-sanitizer --tool synccheck a.out
This tool focuses on synchronization issues in device code, such as with the APIs:
__syncthreads()__syncwarp()
Some of the errors found by synccheck include:
- Invalid arguments
- Thread divergence (warp-level or block-level)
initcheck
The initcheck tool is a sub-tool of Compute Sanitizer
for detecting initialization issues
in device accesses to device global memory.
Execution is performed by:
compute-sanitizer --tool initcheck a.out
Note that this only examines global memory,
so it won’t find other types of uninitialized memory accesses in kernels.
The default memcheck tool in compute-sanitizer can find other similar problems.
The main issues found by initcheck are:
- Accesses to uninitialized device global memory.
- Unused device global memory (never accessed).
Fixing an uninitialized memory access
issue found by initcheck is usually to initialize your device memory properly,
such as by:
- Device-side array initialization code
cudaMemsetcudaMemcpy
If initcheck reports on unused device global memory, this may indicate some sort
of algorithm error, whereby some of the global memory is not being used.
cuda-gdb
I’m a big fan of gdb for debugging standard C++ on Linux, and cuda-gdb is even better.
The platform support for cuda-gdb is primarily Linux,
but also Windows WSL2, and also MacOS in a “host only” mode for remote debugging.
The cuda-gdb tool is a source-code modification of the open source gdb code
to add NVIDIA GPU support,
and you can actually find the cuda-gdb source code on Github at https://github.com/NVIDIA/cuda-gdb/.
Hence, cuda-gdb has most of the gdb features, and NVIDIA tries
to keep up with new features.
The basic commands from gdb are all supported:
rorrun— run the code (with optional arguments), or restart if already running.corcontinue— continue running (after stopping at a breakpoint).sorstep— stepping through statements (also justEnter).where— stack trace (also aliased to “bt” for backtrace).list— source code listingporprint— print a variable or expression.upnornext
Some of the CUDA-specific commands inside cuda-gdb include:
info cuda
To see the list, run the help command:
help info cuda
Examples of some of the many CUDA debugging sub-commands include:
info cuda devices info cuda sms info cuda warps info cuda lanes info cuda blocks info cuda threads
Pre-Breakpointing Trick
One advanced tip
for using cuda-gdb
is to define a function called “breakpoint” in your C++ application.
Here’s an example:
void breakpoint()
{
volatile int x = 0;
x = 0; // Set breakpoint here
}
It looks like a silly function, but it serves one useful purpose.
The idea is that
when you start an interactive debugging
session in cuda-gdb, or automatically in your “.cuda-gdbinit” resource file, you can set a breakpoint there:
b breakpoint
Why do that?
The reason is that you also add calls to your “breakpoint” function at relevant points in
various places where failures can occur:
- CUDA error check macros
- Assertion macros
- Debug wrapper function failure detection
- Unit test failures
Hence, if any of those bad things happen while you’re running interactively in the debugger,
you’re immediately stopped at exactly that point.
If you’re not running in the debugger, this is a very fast function (though admittedly, it
can’t be inline!), so it doesn’t slow things down much.
You can even consider leaving this in production code, since the breakpoint function
is only called in rare situations
where a serious failure has already occurred, in which case execution speed is not a priority.
This technique is particularly useful because don’t have to go back and figure out how to reproduce the failure, which can be difficult to do for some types of intermittent failures from race conditions or other synchronization problems. Instead, it’s already been pre-breakpointed for you, with the cursor blinking at you, politely asking you to debug it right now, or maybe after lunch.
Postmortem Debugging
Postmortem debugging involves trying to debug a program crash,
such as a “core dump” on Linux.
In this situation, you should have a “core” file that you can load
into cuda-gdb.
The command to use is:
cuda-gdb a.out core
Unfortunately, not all errors in a CUDA application will trigger
a core dump, so you might have nothing to debug if it doesn’t.
One way to ensure that you get a core file is
to set the environment variable CUDA_ENABLE_COREDUMP_ON_EXCEPTION,
which will cause the CUDA Runtime to dump a core file on various additional failures.
Sometimes in large environments, it’s hard to know where a core file came from.
An advanced feature of this environment variable is that you can format
the filename to be more useful than just “core.”
You can specify the format to include things like the time/date
and the name of the executable.
Programmatic C++ core dumps. If you’re wanting to have your CUDA C++ take control of its own core dumps (e.g., exceptions, assertion failures, etc.), there are various points:
- You can always
fork-and-aborton Linux. - Maybe
putenv("CUDA_ENABLE_COREDUMP_ON_EXCEPTION")might work? - Surely you can write some code to crash!
On the other hand, maybe you’re only thinking about core dumps because you want to save debug context information. Doing this might obviate the need for a core dump:
- Use
std::backtraceor another backtrace library. - Print error context information (e.g., user’s query)
- Print platform details
Customer core dumps.
One of the supportability issues with postmortem debugging
is that you want your customers
to be able to submit a core file that they have triggered
on your CUDA-based application.
These are usually large files, so there are logistical issues to overcome with uploads.
Another issue is that in order to run cuda-gdb on a core file,
the developer needs
to have exactly the right executable that created the core dump.
Hence, your build and release management needs to maintain available
copies of all executable files in versions shipped to customers or in beta testing
(or to internal customers for in-house applications).
And there needs to be a command-line option whereby the phone support staff can
instruct customers to report the exact version and build number of the executable they are using.
It’s easy to lose track!
Valgrind for CUDA
Can you use the Linux Valgrind tool to detect memory errors in CUDA C++ programs? More specifically, this refers to the Memcheck tool that is part of Valgrind.
In short, you can run Valgrind on your host code,
but you’re probably better off using compute-sanitizer for CUDA C++ programs.
Valgrind does run both the host and the device code, and can be used to find errors.
The CUDA Toolkit used to include a tool called “cuda-memcheck” but it has since been
deprecated and removed in favor of compute-sanitizer.
Interestingly, there’s an old research paper [Baumann and Gracia, 2013]
on using Valgrind with CUDA, called the “CudaGrind” tool.
There’s even a Github repo for this tool, although it hasn’t been edited in 9 years,
so I’m not sure it’s still valid.
In any case, the basic Linux version of Valgrind Memcheck is still very well supported. The method to use Valgrind for Linux on a CUDA application is simply to run the executable:
valgrind a.out
If Valgrind is not installed in your Linux environment, you’ll need to do something like this:
apt install valgrind
The start of the Valgrind output is like this:
==1143== Memcheck, a memory error detector
==1143== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1143== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1143== Command: ./a.out
As it executes your program, the output from your program will be interleaved with error reports from Valgrind. Hopefully, there won’t be any!
The end of the Valgrind execution gives you a nice summary of memory leaks and errors.
==1143== HEAP SUMMARY:
==1143== in use at exit: 12,710,766 bytes in 10,810 blocks
==1143== total heap usage: 15,851 allocs, 5,041 frees, 47,396,077 bytes allocated
==1143==
==1143== LEAK SUMMARY:
==1143== definitely lost: 0 bytes in 0 blocks
==1143== indirectly lost: 0 bytes in 0 blocks
==1143== possibly lost: 30,965 bytes in 199 blocks
==1143== still reachable: 12,679,801 bytes in 10,611 blocks
==1143== suppressed: 0 bytes in 0 blocks
==1143== Rerun with --leak-check=full to see details of leaked memory
==1143==
==1143== For lists of detected and suppressed errors, rerun with: -s
==1143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Running CUDA programs in Valgrind is obviously slower because of the instrumentation,
but this is also true of similar tools like compute-sanitizer.
There is also a problem with “ioctl” where you get errors like these from Valgrind:
==1143== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==1143== This could cause spurious value errors to appear.
==1143== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
These warnings are probably indicative of Valgrind having problems understanding the CUDA primitives
related to GPU kernel code.
Valgrind is fine for checking your host code for C++ memory usage errors,
but lacking for device code checking. Hence, compute-sanitizer is preferred overall.
Warning-Free Build
Don’t ignore compiler warnings! A very good goal for C++ software quality is to get to a warning-free compile. You should think of compiler warnings as doing “static analysis” of your code. To maximize this idea, turn on more warning options, since the warnings are rarely wrong in modern compilers, although some are about harmless things.
Harmless doesn’t mean unimportant. And anyway, the so-called “harmless” warnings
aren’t actually harmless, because if there’s too many of them in the compilation output,
then the bad bugs won’t get seen.
Hence, make the effort to fix the minor issues in C++ code that’s causing warnings.
For example, fix the “unused variable” warnings or “mixing float and double” type warnings,
even though they’re rarely a real bug.
And yet, sometimes they are!
This is why it’s powerful
to have a warning-free compile.
Tracking compilation warnings. One way to take warning-free compilation to the next level is to actually store and analyze the compiler output. It’s like log file analysis in DevOps, only it’s not for systems management, but for debugging. On Linux, I typically use this idea:
make build |& tee makebuild.txt
Here’s an actual example from a Makefile in an Aussie AI project on Linux:
build:
-@make build2 |& tee makebuild.txt
-@echo 'See output in makebuild.txt'
The Makefile uses prefix “-” and “@” flags, which means that it doesn’t echo the command to output,
and doesn’t stop if one of the steps triggers an error.
When the build has finished, then we have a text file “makebuild.txt” which can be viewed for warning messages.
To go further, I usually use grep to remove
some of the common informational messages, to leave only warning messages.
Typically, my Linux command looks like:
make warnings
Here’s an example of the “warnings” target in a Makefile for one of my Aussie AI projects:
warnings:
-@cat makebuild.txt | grep -v '^r -' \
| grep -v '^g++ ' | grep -v '^Compiling' \
| grep -v '^Making' | grep -v '^ar ' \
| grep -v '^make\[' | grep -v '^ranlib' \
| grep -v '^INFO:' | grep -v 'Regressions failed: 0' \
| grep -v 'Assertions failed: 0' | grep -v SUCCESS \
|more
Note that this uses grep to remove the informational messages from g++, ar, ranlib, and make.
And it also removes the unit testing success messages if all tests pass (but not if they fail!).
The idea is to show only the bad stuff
because log outputs with too many lines get boring far too quickly
and then nobody’s watching.
One annoying thing about using grep with make is that you get these kind of error messages:
make: [annoying] Error 1 (ignored)
Here’s a way to fix them in a Makefile on Linux:
-@grep tmpnam *.cu *.cpp || true
The “true” command is a shell command that never fails.
Note that this line uses the double-pipe “||” shell logical-or operator,
so it only runs “true” if grep fails.
But don’t accidentally use a single “|” pipe operator, which would actually be a silent bug!
This idea makes the line calling grep return a non-zero status, and then make is silent.
Finally, your warning-free tracking method should ideally be part of your “nightly builds” that do more extensive analysis than the basic CI/CD acceptance testing. You should email those warnings to the whole team, at about 2am ideally, because C++ programmers don’t deserve any sleep.
Linters for CUDA C++
Linters, or “static analyzers,” are tools that examine your source code for errors or stylistic concerns. General advice in regard to using linters for CUDA C++ programming is:
- Use compiler warnings as free linting.
- Use a separate linter build sequence.
- Have two linter paths (one for bugs, one for style).
- Use multiple compilers and linters for extra coverage.
- Automate linting into the nightly build.
Note that we have an active project for a CUDA C++ linter. Find more information about Aussie Lint at https://www.aussieai.com/cuda/projects.
Using gcc as a linter.
If you want more warnings, and who doesn’t, you can enable more warnings
in gcc on Linux.
You can either do this in your main build by enabling more compiler warnings,
or use a separate build path (e.g., choose an inspiring name like: “make lint”)
so that the main build is not inundated with new warnings.
The way to do this is via the “--compiler-options” command-line option to nvcc,
which specifies pass-through options for the underlying C++ compiler.
By default, this compiler is gcc on Linux and cl.exe on Windows.
Since nvcc uses source-to-source compilation for the host code,
these options will be running on most of your CUDA C++ host code,
except for the parts that gcc wouldn’t understand
(e.g., the <<<...>>> kernel launch syntax
will be modified before being passed through).
An example command with extra linting power would be:
nvcc --compiler-options="-Wall" aussie-test-crashes.cu
Some useful gcc warning flags include:
-Wall— “all” warnings (well, actually, some).-Wextra— the “extra” warnings not enabled by “-Wall”.-Wpedantic— yet more of the fun ones.
Hence, a longer command is:
nvcc --compiler-options="-Wall -Wpedantic -Wextra" aussie-test-crashes.cu
You know, I really cannot say that I am a fan of endlessly scrolling warnings
from the “pedantic” mode.
Maybe, turn that one off, and pick-and-choose from the list of flags
in the “pedantic” list.
For example, I have used “-Wpointer-arith” in projects.
Linting device code
Device code is directly compiled by nvcc,
rather than via source-to-source compilation, so the device code won’t get linted this way.
We could try to bypass nvcc and use gcc directly, such as this:
gcc -I/usr/local/cuda/include -Wall myfile.cu
But there’s at least two problems:
(a) the file suffix “.cu” needs to be
changed to “.cpp” or similar,
and
(b) code sequences like “<<<” and “__global__” won’t be understood by gcc.
Hence, you need to have a separate linting build sequence that renames files.
You may also need source code changes to wrap kernel launch syntax with #if statements,
or alternatively, use some fancy sed replacement tricks.
Code that is both host and device code will presumably go through both paths, and thus will be linted. This inspires another idea of a linting strategy to get device code covered, too, which is:
- Use a separate linting path.
- Use
nvccas the compiler, but with thegcccompiler warnings on. - Mark all the device C++ code as also “
__host__” code.
But we don’t want to change this in our main code base
that is processed by nvcc.
Hence, this requires tricks like a macro that’s only enabled in linting mode,
or a sed trick to add __host__ wherever there’s a __device__ specifier.
This starts getting into murky territory,
and I’m going to say it’s probably not worth the effort,
but maybe it has some value.
Fixing Linter Warnings
Here’s some advice about fixing the code to address linter concerns:
- Aim for a warning-free compilation of bug-level messages.
- Don’t overdo code changes to fix any stylistic complaints.
Fix the bugs found by warnings (obviously), but as far as the stylistic type warnings are concerned, be picky. I say, aim for code quality and resilience, not code aesthetic perfection.
Warning-free linting. As with the main build, if you’re not fixing the less severe linter warnings, turn them off, or have two separate build sequences for the main anti-bug linting versus stylistic linting. You want any newly found serious problems to be visible, not lost in a stream of a hundred other spurious warnings. Hence, high quality code requires achieving a warning-free linting status for the main warnings.
On the other hand, you don’t want programmers doing too much “busy work” fixing minor coding style warnings with little practical impact on code reliability. Hence, you might find that your policy of “warning-free linting” needs to suppress some of the pickier warnings. And that’ll be a fun meeting to have.
References
- GNU, Sep 2024 (accessed), 3.8 Options to Request or Suppress Warnings (GCC warning options), https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
- Thomas M. Baumann, Jose Gracia, 3 Oct 2013, Cudagrind: A Valgrind Extension for CUDA, https://arxiv.org/abs/1310.0901, https://github.com/dpc-grindland/Cudagrind (Valgrind Memcheck for CUDA C++, but over 10 years old)
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |