Aussie AI

Chapter 15. CUDA Debug Wrapper Functions

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Why Debug Wrapper Functions?

The idea of debug wrapper functions is to fill a small gap in the self-checking available in the CUDA ecosystem. There are two types of self-testing that happen when you run CUDA C++ programs:

Error checks in the CUDA Runtime API (i.e., when an API doesn’t return cudaSuccess in host code).
compute-sanitizer detection of numerous run-time errors.

Both of these methods are highly capable and will catch a lot of bugs. To optimize your use of these capabilities in debugging, you should:

Test all error return codes (e.g., a fancy macro method), and
Run compute-sanitizer on lots of unit tests and regression tests in your CI/CD approval process, or, when that gets too slow, at least in the nightly builds.

But this is not perfection! But there’s two main reasons that some bugs will be missed:

CUDA Runtime doesn’t detect all the bugs (because the main aim is fast!).
You have to remember to run compute-sanitizer on your code.

Okay, so I’m joking about “remembering” to run the debug tests, because you’ve probably got them running automatically in your build. But there’s some real cases where the application won’t ever be run in debug mode:

Many internal failures trigger no visible symptoms for users (silent failures).
Customers cannot run compute-sanitizer on their premises (unless you ask nicely).
Your website “customers” also cannot run it on the website backends.
Some applications are too costly to re-run just to debug an obscure error (I’m looking at you, AI training).

Hence, in the first case, there’s bugs missed in total silence, never to be fixed. And in the latter cases, there’s a complex level of indirection between the failure occurring and the CUDA C++ programmer trying to reproduce it in the test lab. It’s much easier if your application self-diagnoses the error!

Fast Debug Wrapper Code

But it’s too slow, I hear you say. Running the code with compute-sanitizer is much slower than without. We can’t ship an executable where the kernels have so much debug instrumentation that they’re running that much slower.

You’re not wrong, and it’s the age-old quandary about whether to ship testing code. Fortunately, there are a few solutions:

Use fast self-testing tricks like magic numbers in memory.
Have a command-line flag or config option that turns debug tests on and off at runtime.
Have “fast” and “debug” versions of your executable (e.g., ship both to beta customers).

At the very least, you could have a lot of your internal CUDA C++ code development and QA testing done on the debug wrapper version that self-detects and reports internal errors.

As the first point states, there are “layers” of debugging wrappers (also ogres, like Shrek). You can define very fast or very slow types of self-checking code into debug wrapper code. These self-tests can be as simple as parameter null tests or as complex as detecting memory stomp overwrites with your own custom code. In approximate order of time cost, here are some ideas:

Parameter basic validation (e.g., null pointer tests).
Address type validation (e.g., via cudaPointerGetAttributes).
Magic values added to the initial bytes of uninitialized and freed memory blocks.
Magic values stored in every byte of these blocks.
Tracking 1 or 2 (or 3) of the most recently allocated/freed addresses.
Hash tables to track addresses of every allocated or freed memory block.

I’ve actually done all of the above for a debug library in standard C++, which I’m now working on updating for CUDA C++. Make sure you check the Aussie AI website to see when it gets released.

CUDA C++ Runtime Wrapper Functions

You can use macros to intercept various CUDA Runtime API calls. For example, here’s a simple interception of cudaMalloc:

    // intercept cudaMalloc
    #undef cudaMalloc
    #define cudaMalloc aussie_cudaMalloc
    cudaError_t aussie_cudaMalloc(void** addr_of_v, int sz);

Once intercepted, the wrapper code can perform simple validation tests of the various parameters. Here’s a simple wrapper for the cudaMalloc function in a debug library for CUDA C++ that I’m working on:

    cudaError_t aussie_cudaMalloc_simple(void** addr_of_v, int sz)
    {
        // Debug wrapper version: cudaMalloc() 
        AUSSIE_DEBUGLIB_TRACE("cudaMalloc called");
        AUSSIE_DEBUG_PRINTF("%s: == ENTRY cudaMalloc === addr=%p, sz=%d\n", 
             __func__, addr_of_v, sz);

        g_aussie_cuda_malloc_count++;
        if (!addr_of_v) {  // null pointer...
            AUSSIE_CHECK(addr_of_v != NULL, "AUS006", "cudaMalloc null address");
            return cudaErrorInvalidValue ;
        }
        AUSSIE_CHECK(sz != 0, "AUS007", "cudaMalloc size is zero");
        AUSSIE_CHECK(sz >= 0, "AUS008", "cudaMalloc size is negative");

        // Call the real cudaMalloc
        void *new_v = NULL;
        cudaError_t err = cudaMalloc(&new_v, sz);
        if (err != cudaSuccess) {
                AUSSIE_ERROR("AUS200", "ERROR: cudaMalloc error return");
                // Try to keep going...
        } 
        *addr_of_v = new_v;   // Store it for return to caller
        AUSSIE_CHECK(new_v != NULL, "AUS009", "cudaMalloc allocation failure");
        return err;
    }

This actually has four levels of tests:

Validation of called parameter values.
Detection of CUDA runtime errors (pass-through).
Detection of memory allocation failure.
Builtin debug tracing macros that can be enabled.

A more advanced version could also attempt to check pointer addresses are valid and have not been previously freed, and a variety of other memory errors. Coming soon!

Standard C++ Debug Wrapper Functions

In addition to wrapping the CUDA Runtime API calls, it can be helpful during debugging to wrap some standard C++ library function calls with your own versions, so as to add additional parameter validation and self-checking code. Some of the functions which you might consider wrapping include:

malloc
calloc
memset
memcpy
memcmp

If you’re doing string operations in your code, you might consider wrapping these:

strdup
strcmp
strcpy
sprintf

Note that you can wrap the C++ “new” and “delete” operators at the linker level by defining your own versions, but not as macro intercepts. You can also intercept the “new[]” and “delete[]” array allocation versions at link-time.

There are different approaches to consider when wrapping system calls, which we examine using memset as an example:

Leave “memset” calls in your code (auto-intercepts)
Use “memset_wrapper” in your code instead (manual intercepts)

Macro auto-intercepts: You might want to leave your code unchanged using memset. To leave “memset” in your code, but have it automatically call “memset_wrapper” you can use a macro intercept in a header file.

    #undef memset // ensure no prior definition
    #define memset memset_wrapper  // Intercept

Note that you can also use preprocessor macros to add context information to the debug wrapper functions. For example, you could add extra parameters to “memset_wrapper” such as:

    #define memset(x,y,z)  memset_wrapper((x),(y),(z),__FILE__,__LINE__,__func__)

Note that in the above version, the macro parameters must be parenthesized even between commas, because there’s a C++ comma operator that could occur in a passed-in expression. Also note that these context macros (e.g., __FILE__) aren’t necessary if you have a C++ stack trace library, such as std::stacktrace, on your platform.

Variadic preprocessor macros: Note also that there is varargs support in C++ #define macros. If you want to track variable-argument functions like sprintf, printf, or fprintf, or other C++ overloaded functions, you can use “...” and “__VA_ARGS__” in preprocessor macros as follows.

    #define sprintf(fmt,...)  sprintf_wrapper((fmt),__FILE__,__LINE__,__func__, __VA_ARGS__ )

Manual Wrapping: Alternatively, you might want to individually change the calls to memset to call memset_wrapper without hiding it behind a macro. If you’d rather have to control whether or not the wrapper is called, then you can use both in the program, wrapped or non-wrapped. Or if you want them all changed, but want the intercept to be less hidden (e.g., later during code maintenance), then you might consider adding a helpful reminder instead:

    #undef memset
    #define memset dont_use_memset_please

This trick will give you a compilation error at every call to memset that hasn’t been changed to memset_wrapper.

Example: memset Wrapper Self-Checks

Here’s an example of what you can do in a wrapper function called “memset_wrapper” from one of the Aussie AI projects:

    void *memset_wrapper(void *dest, int val, int sz)  // Wrap memset
    {
        if (dest == NULL) {
                aussie_assert2(dest != NULL, "memset null dest");
                return NULL;
        }
        if (sz < 0) {
                // Why we have "int sz" not "size_t sz" above
                aussie_assert2(sz >= 0, "memset size negative");
                return dest;  // fail
        }
        if (sz == 0) {
                aussie_assert2(sz != 0, "memset zero size (reorder params?)");
                return dest;
        }
        if (sz <= sizeof(void*)) {
                // Suspiciously small size
                aussie_assert2(sz > sizeof(void*), "memset with sizeof array parameter?");
                // Allow it, keep going
        }
        if (val >= 256) {
                aussie_assert2(val < 256, "memset value not char");
                return dest; // fail
        }
        void* sret = ::memset(dest, val, sz);  // Call real one!
        return sret;
    }

It’s a judgement call whether or not to leave the debug wrappers in place, in the vein of speed versus safety. Do you prefer sprinting to make your flight, or arriving two hours early? Here’s one way to remove the wrapper functions completely with the preprocessor if you’ve been manually changing them to the wrapper names:

    #if DEBUG
        // Debug mode, leave wrappers..
    #else // Production (remove them all)
        #define memset_wrapper memset
        //... others
    #endif

Compile-time self-testing macro wrappers

Here’s an idea for combining the runtime debug wrapper function idea with some additional compile-time tests using static_assert.

    #define memset_wrapper(addr,ch,n) ( \
        static_assert(n != 0), \
        static_assert(ch == 0), \
        memset_wrapper((addr),(ch),(n),__FILE__,__LINE__,__func__))

The idea is interesting, but it doesn’t really work, because not all calls to the memset wrapper will have constant arguments for the character or the number of bytes, so the static_assert commands will fail in that case. You could use standard assertions, but this adds runtime cost. Note that it’s a self-referential macro, but that C++ guarantees it only gets expanded once (i.e., there’s no infinite recursion of preprocessor macros).

Generalized Self-Testing Debug Wrappers

The technique of debug wrappers can be extended to offer a variety of self-testing and debug capabilities. The types of messages that can be emitted by debug wrappers include:

Input parameter validation failures (e.g., non-null)
Failure returns (e.g., allocation failures)
Common error usages
Informational tracing messages
Statistical tracking (e.g., call counts)

Personally, I’ve built some quite extensive debug wrapping layers over the years. It always surprises me that this can be beneficial, because it would be easier if it were done fully by the standard libraries of compiler vendors. The level of debugging checks has been increasing significantly (e.g., in GCC), but I still find value in adding my own wrappers.

There are several major areas where you can really self-check for a lot of problems with runtime debug wrappers:

File operations
Memory allocation
String operations

These are left as an exercise for the reader!

Link-Time Interception: new and delete

Macro interception works for CUDA APIs like cudaMalloc, and for standard C++ functions like malloc and free, but you can’t macro-intercept the new and delete operators, because they don’t use function-like syntax. Fortunately, you can use link-time interception of these operators instead, simply by defining your own versions.

Note that defining class-level versions of the new and delete operators is a well-known optimization, but this isn’t what we’re doing here. Instead, this link-time interception requires defining four operators at global scope:

new
new[]
delete
delete[]

Note that you cannot use the real new and delete inside these link-time wrappers. They would get intercepted again, and you’d have infinite stack recursion. However, you can call malloc and free instead, assuming they aren’t also macro-intercepted. Here’s the simplest versions:

    void * operator new(size_t n)
    {
        return malloc(n);        
    }

    void* operator new[](size_t n)
    {
        return malloc(n);        
    }

    void operator delete(void* v)
    {
        free(v);
    }

    void operator delete[](void* v)
    {
        free(v);
    }

This method of link-time interception is an officially sanctioned standard C++ language feature since the 1990s. Be careful, though, that the return types and parameter types are precise, using size_t and void*, as you cannot use int or char*. Also, declaring these functions as inline gets a compilation warning, and is presumably ignored by the nvcc compiler, as this requires link-time interception.

This code runs fine with nvcc compilation, but the above example is not much of a debugging wrapper, more like just a “wrapper,” because it does no error checking. However, when I started adding more self-tests, I triggered warnings about “calling a __host__ function from a __host__ __device__ function.” It seems that nvcc compiles these functions as both host and device code, which makes sense.

Unfortunately, when I tried to work around this by declaring the operators as host-only versions using the __host__ specifier, it triggered compilation errors. Declaring two versions, one with __host__ and the other with __device__, also didn’t work (nor did __global__). Maybe there’s a workaround possible by putting these operator definitions into standard C++ code that’s only processed by gcc, not by nvcc, and then link it in.

In the absence of a solution for now, this means that we’re limited to using the subset of C++ that can run on the device inside these link-time interceptions. Hence, there are significant problems trying to generalize this into a useful debugging wrapper library, because any use of host-specific aspects such as global variables triggers compilation errors.

Here’s an example of some ideas of some basic possible checks with printf outputting:

    #define AUSSIE_ERROR(mesg, ...) \
        ( printf((mesg) __VA_OPT__(,) __VA_ARGS__ ) )

    void * operator new(size_t n)
    {
        if (n == 0) {
            AUSSIE_ERROR("new operator size is zero\n");
        }
        void *v = malloc(n);        
        if (v == NULL) {
            AUSSIE_ERROR("new operator: allocation failure\n");
        }        
        return v;
    }

Note that you can’t use __FILE__ or __LINE__ as these are link-time intercepts, not macros. Maybe you could use std::backtrace instead, but I have my doubts.

References

Note that Aussie AI has an active project for a CUDA C++ debug wrapper library with support for intercepting a wide range of CUDA C++ functions. Find more information at https://www.aussieai.com/cuda/projects.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming