Aussie AI

Chapter 8. CUDA Error Checking

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

CUDA Error Checking

Everyone’s always known that it’s good programming style to check all error return codes. It’s extra work, but everyone does it anyway, because it’s so important. I’ve been coding for years, and I’ve never seen a printf or fopen without an if statement immediately after it.

CUDA puts all your good intentions to a much stronger test, because literally every CUDA API function can fail, at random times, with any possible error code. One of the issues is that a CUDA kernel can fail asynchronously, and then report its error at the next available opportunity, with timing unrelated to the host code on the CPU.

The solution is that after every CUDA function call, you add an if statement. Here’s an example for manually checking cudaMemcpy calls:

        errval = cudaMemcpy(device_v, v, sz, cudaMemcpyHostToDevice);
        if (errval != cudaSuccess) {
                // CUDA error...
                AUSSIE_CUDA_ERROR(errval, "cudaMemcpy host-to-device failure");
        }

But then your fingers eventually get tired from too many keystrokes, and you start making copy-paste errors, too. Maybe you should try an AI copilot? Alternatively, let’s define a macro for that.

The main styles I’ve seen for CUDA error checking macros are either:

Wrap every CUDA function call in a macro, or
Test after CUDA function calls.

There are pros and cons of each approach, but they both suffer from a major limitation: manual code changes. I have suggestions of two ways to automate it:

Recursive preprocessor macro intercepts.
Macro intercepts to debug wrapper functions.

These are all explained in detail further below.

CUDA Error Check Macros

The use of a CUDA error check macro works as a replacement for manual error checking. Here’s an example usage of a macro:

    AUSSIE_CUDA_ERRORCHECK(cudaDeviceSynchronize() );

Note that you can also wrap the assignment of the error code to a variable for further analysis.

    cudaError_t errval = cudaSuccess;
    // ...
    AUSSIE_CUDA_ERRORCHECK( errval = cudaDeviceSynchronize() );

Here’s one version of what the macro can look like:

    #define AUSSIE_CUDA_ERRORCHECK(codeexpression) \
         do {  \
           cudaError_t err = codeexpression ;  \
           if (err != cudaSuccess) { \
                fprintf(stderr, \
                        "CUDA ERROR: %d (%s) in %s at %s:%d\n", \
                                (int)err, cudaGetErrorString(err), \
                                        __func__, __FILE__, __LINE__); \
           } \
         } while(0)

This macro definition uses the do..while(0) common C++ idiom to make the macro fully like a statement. This style avoids some problems with semicolons that would arise if you just use curly braces, so don’t add a semicolon at the end of this macro, or all that work is in vain. This also avoids a serious “dangling-else” bug if you only used the if statement alone.

But don’t forget the extra inside pair of parentheses in the calls:

    AUSSIE_CUDA_ERRORCHECK( cudaDeviceSynchronize );  // Wrong!
    AUSSIE_CUDA_ERRORCHECK( errval = cudaDeviceSynchronize );

Here’s another more elegant method of doing the macros in combination with an inline function in a header file:

    #define AUSSIE_CUDA_ERRORCHECK2(codeexpression) \
        aussie_cuda_check_function((codeexpression), \
                __func__, __FILE__, __LINE__)

    inline void aussie_cuda_check_function(cudaError_t err, 
        const char *func, const char *fname, int lnum)
    {
       if (err != cudaSuccess) { 
            fprintf(stderr, 
                "CUDA ERROR: %d (%s) in %s at %s:%d\n",
                (int)err, cudaGetErrorString(err),
                func /*__func__*/, 
                fname /*__FILE__*/, 
                lnum /*__LINE__*/ );
       } 
    }

Checking After CUDA Calls

The alternative method is to check after the CUDA API calls. The macro includes a call to cudaGetLastError or cudaPeekAtLastError. The use of the macro looks like:

        AUSSIE_CUDA_CHECKAFTER();  // calls cudaGetLastError

Here is one way to define it:

    #define AUSSIE_CUDA_CHECKAFTER() \
        do {  \
           cudaError_t err = cudaGetLastError();  \
           if (err != cudaSuccess) { \
                fprintf(stderr, \
                    "CUDA ERROR: %d (%s) in %s at %s:%d\n", \
                    (int)err, cudaGetErrorString(err), __func__, __FILE__, __LINE__); \
           } \
         } while(0)

An equivalent method that is perhaps clearer is to do one of these methods instead after a kernel launch:

    AUSSIE_CUDA_ERRORCHECK2( cudaPeekAtLastError() );
    AUSSIE_CUDA_ERRORCHECK2( cudaGetLastError() );

Kernel launch special case. I’ve got some better ideas for the error check macros, but there’s one situation where you definitely must use the “check after” style: kernel launches.

The <<<...>>> kernel invocation syntax does not return a status code, so there’s nothing to check. Also, there’s also an obscure situation whereby synchronous kernel launch errors (e.g., threads-per-block size more than 1024) can get missed, unless they are immediately checked for using either cudaGetLastError or cudaPeekAtLastError.

Hence, the best solution for kernel launch error detection, at least in non-production mode, is something like this:

    mykernel <<< blocks, threads >>> (v, n);
    AUSSIE_CUDA_ERRORCHECK2( cudaPeekAtLastError() );

Alternatively, maybe one day in the distant future this will work:

    err = mykernel <<< blocks, threads >>> (v, n);  // FAILS!
    AUSSIE_CUDA_ERRORCHECK2(err);

Recursive Macro Error Checks

C++ allows macros to be recursive in the sense that they can use their own name. It’s not actually “recursive” and is actually limited to a once-only expansion, rather than an infinitely recursive expansion. This feature is a longstanding feature of C and C++ languages since they were created, so you can rely upon it. For example, these would be harmless:

    #define cudaMemset(a,b,c)   cudaMemset(a,b,c)
    #define cudaMemcpy(a,b,c,d) cudaMemcpy(a,b,c,d)

The idea is to automatically add the error check macros:

    #define cudaMemset(a,b,c)   AUSSIE_CUDA_ERRORCHECK(cudaMemset(a,b,c))
    #define cudaMemcpy(a,b,c,d) AUSSIE_CUDA_ERRORCHECK(cudaMemcpy(a,b,c,d))

But that doesn’t quite work, when used with this type of call:

    errval = cudaMemcpy(....);

The do...while(0) trick expands out to give a compilation syntax error:

    errval = do { ... // etc.

Similarly, the version with a combined macro and inline function also gets a different type of compilation error:

    errval = aussie_cuda_check_function(....)

The problem is that the return type of the inline function is void. Hence, we’d need to go back and fix any code that uses the return value of cudaMemcpy or cudaMemset, which would be a good job for a coding copilot, if only I didn’t have so many trust issues.

Instead, we can just fix the return type to be cudaError_t and use a pass-through of the error code:

    #define AUSSIE_CUDA_ERRORCHECK3(codeexpression) \
        aussie_cuda_check_function2((codeexpression), __func__, __FILE__, __LINE__)

    inline cudaError_t aussie_cuda_check_function2(
            cudaError_t err, 
            const char *func, const char *fname, int lnum)
    {
       if (err != cudaSuccess) { 
            fprintf(stderr, 
                "CUDA ERROR: %d (%s) in %s at %s:%d\n",
                (int)err, cudaGetErrorString(err),
                func /*__func__*/, 
                fname /*__FILE__*/, 
                lnum /*__LINE__*/ );
       }
       return err;  // pass through!
    }

And we really should add a ridiculous number of round brackets around the macro parameters, and also use #undef for total macro safety:

    #undef cudaMemset  // safety
    #undef cudaMemcpy
    #define cudaMemset(a,b,c)   (AUSSIE_CUDA_ERRORCHECK3(cudaMemset((a),(b),(c))))
    #define cudaMemcpy(a,b,c,d) (AUSSIE_CUDA_ERRORCHECK3(cudaMemcpy((a),(b),(c),(d))))

Voila! Now we have a set of macros that automatically adds CUDA return code error checking around all calls to cudaMemcpy and cudaMemset. And it should work irrespective of whether their returned values are used or not in the calls.

To use them properly, we just need to #include a header file near the top of every CUDA C++ source file. But it has to be after any CUDA toolkit header files like “cuda_runtime.h” because those system header files have prototype declarations of functions like cudaMemcpy that our tricky macros will break.

Now we only have to add similar recursive macros for all 1,657 of the CUDA Runtime API functions. No, relax, I’m just kidding. The number is 50 according to this command:

    grep cudaError_t /usr/local/cuda/include/cuda_runtime.h | wc -l

Yeah, I could probably have used “grep -c” but I just don’t want to, and you can’t make me.

Macro Intercepted Debug Wrapper Functions

Is there any way you can level up? We’ve already auto-added the error checking macros around all the CUDA Runtime API calls. Can we do better?

Of course, we can!

One extension is to build debug wrapper function versions for the main API calls. These functions can then perform more extensive error self-checking than is performed within the CUDA Runtime.

    #undef cudaMemcpy
    cudaError_t aussie_cudaMemcpy_wrapper(void *destp, const void *srcp, size_t sz, enum cudaMemcpyKind mode)
    {
        cudaError_t err = AUSSIE_CUDA_ERRORCHECK3(cudaMemcpy(destp,srcp,sz,mode));
        return err;
    }
    #define cudaMemcpy(a,b,c,d)  aussie_cudaMemcpy_wrapper(a,b,c,d)  // Intercept!

Note that the #undef is really important here, and must be before the wrapper function body. If we’re not careful, our wrapper function can wrap itself, and become infinitely recursive.

The above example doesn’t do any extra error checking, other than what we’ve already put into the CUDA error checking macro (i.e., AUSSIE_CUDA_ERRORCHECK3), which checks for the cudaSuccess return code. However, we could add extra self-checking code for common errors that arise from cudaMemcpy copy-pasting:

Destination or source pointers are null
Destination or source pointers are the wrong address scopes
Destination pointer equals source pointer

The CUDA Runtime already finds a lot of those errors, and compute-sanitizer would find even more. However, we could go further with our analysis. For example, some more extensive error checks possible could be:

Pointer allocated by cudaMalloc but never copied by cudaMemcpy.
Pointer allocated by cudaMalloc only partially copied by cudaMemcpy.
cudaMemcpy size argument is zero or negative (after conversion to size_t).
cudaMemset arguments appear to be in reverse order.

The possible error checks from this type of API interception are discussed further in the full section on debug wrapper functions.

Limitations of Macro Interception

Two of these methods rely on preprocessor macro interception to auto-wrap the calls with debug checks. Unfortunately, macro interception isn’t a perfect solution, and some of the problems that macros may have include:

No way to auto-intercept the <<<...>>> kernel launch syntax.
Problematic for device code (e.g., CUDA Dynamic Parallelism, fprintf not available).
Interception of new and delete operators is only possible at link-time, and even this trick won’t work for device code.
Namespace-scoped calls fail: e.g., cuda::cudaMemcpy(...) or std::malloc(...)
Use of CUDA API names as function pointers won’t work.
Non-standard calling syntax: e.g., parentheses around the function name.

Much better than macro interception would be a way to link to a debug version of the CUDA Runtime library. If only it were open source code! Many more complex error checks are possible than are performed, and this would significantly improve the timeframe to detect many types of coding errors.

Alternatively, tools such as compute-sanitizer could link with a debug runtime library version that contained a more extensive set of checks. Maybe this is feasible to do via the callback methods in the Compute Sanitizer API, which is an official part of the CUDA Toolkit.

Reporting and Handling CUDA Errors

What should an error checking macro do on failures? Some of the many options include:

Print an error message
Print the error code number and its name with cudaGetErrorString
Give source code context information
Exit the program (or not?)

That’s not the full list, and some more advanced ideas for production-grade error handling include:

Throw an exception and hope someone’s listening.
Full stack trace (e.g., std::backtrace in C++23).
Report a full error context for supportability in the wild.
Log information to a file, not just to stderr.
Try to recover if it’s a non-sticky error.
Gracefully crash if it’s a sticky error.
Try to localize if it’s a current failure or from a prior kernel launch.
Call a debug breakpoint function to help with interactive debugging.
Abort the program to generate a “core” file.

A key aspect of reporting the error context is the CUDA C++ statements that triggered the issue. The basics of error context are these macros:

__func__
__FILE__
__LINE__

I don’t know why one is lower case and two are upper case, but it’s called international standardization. That’s an example of what makes C++ programming so fun.

However, I have to say that I think these source code context macros are on their way out. Once reporting the full stack trace in C++23 with std::backtrace is widespread, why would we need those macros? Also gone would be lots of preprocessor macro tricks that only exist in order to report the source code context. Instead, use an inline function and std::backtrace.

More advanced error context that can help with supportability includes things like:

Date and time of error.
User query that triggered the failure.
Random number seed (for reproducibility of AI errors).
Full stack trace (if available)

I feel like there should be an LLM for this. Maybe I’ll go look on Hugging Face.

Limitations of CUDA Error Checking

Some problem areas include:

No return code for kernel launches.
Sticky errors.
Some odd idiosyncrasies in the CUDA Runtime API (are they bugs?)
Not all types of bugs are raised as runtime errors.
Limited possibilities in device code

But I have to say that the really major limitation is this:

Remembering to add it every time!

I’ve given a few suggestions for auto-fixing that issue above, but they’re far from perfect. Maybe the CUDA Runtime API needs a callback mechanism, or some other method whereby programmers can ensure that they never miss an error return.

Error checking in device code. It used to be that you only needed this error checking in the host code, because none of the runtime APIs worked in device code. And then someone at NVIDIA decided to change that with nested kernel launches, and then someone else in marketing decided to give it a trendy name and a three-letter acronym: CUDA Dynamic Parallelism (CDP).

The downside is that the C++ features are much more limited in device code. For example, you can’t easily use global variables or write to files. You can’t even use fprintf, for heaven’s sake, so you can only print to stdout.

Which is great news, because it means you get to have the fun of defining an error checking macro all over again!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming