Aussie AI

Chapter 16. CUDA Debug Tracing

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Debug Tracing Messages

Ah, yes, worship the mighty printf!

A common debugging method is adding debug trace output statements to a program to print out important information at various points in the program. Judicious use of these printf statements can be highly effective in localizing the cause of an error, but this method can also lead to huge volumes of not particularly useful information. One desirable feature of this method is that the output statements can be selectively enabled at either compile-time or run-time.

Debug tracing messages are informational messages that you only enable during debugging. These are useful to software developers to track where the program is executing, and what data it is processing. The simplest version of this idea looks like:

    #if DEBUG
    printf("DEBUG: I am here!\n");
    #endif

A better solution is to code some BYO debug tracing macros. Here’s a macro version:

   #define aussie_debug(str)  ( printf("DEBUG: %s\n", (str)) )
   ... 
   aussie_debug<("I am here!");

Device output limits. The output from printf on the GPU is limited to a buffer size. Firstly, this means that GPU trace output may not appear immediately. Secondly, it means that some output can get lost.

If the CPU does not synchronize often enough, or the GPU emits far too much output, then tracing messages will overflow the circular buffer and overwrite the first output. Hence, the earlier trace messages will be lost forever. Consider carefully the volume of output needed when debug tracing, and also add more frequent host calls to synchronize with the device.

You can also check or change the size of the circular buffer by calling cudaDeviceGetLimit or cudaDeviceSetLimit using the property cudaLimitPrintfFifoSize. The GPU and CPU clear the output buffer in some behind-the-scenes magic whenever they synchronize. Remember that the output is only stored in a buffer on the GPU, and is actually coming to the screen from the CPU!

Output to stderr. Note that we could use fprintf to stderr if we were sure it wasn’t needed to run on the device, which only supports printf. And here’s the C++ stream version, which also won’t work in device code:

   #define aussie_debug(str)  ( std::cerr << str << std::endl )
   ... 
   aussie_debug("DEBUG: I am here!");

In order to only show these when debug mode is enabled in the code, our header file looks like this:

   #if DEBUG
       #define aussie_debug(str)  ( std::cerr << str << std::endl )
   #else
       #define aussie_debug(str) // nothing
   #endif

Missing Semicolon Bug: Professional programmers prefer to use “0” rather than emptiness to remove the debug code when removing it from the production version. It is also good to typecast it to “void” type so it cannot accidentally be used as the number “0” in expressions. Hence, we get this improved version for removing a debug macro:

    #define aussie_debug(str) ((void)0)  // better!

It’s not just a stylistic preference. The reason is that the “nothing” version can introduce an insidious bug if you forget a semicolon after the debug trace call in an if statement:

   if (something) aussie_debug("Hello world") // Missing semicolon
   x++;

If the “nothing” macro expansion is used, then the missing semicolon leads to this code:

   if (something) // nothing
   x++;

Can you see why it’s a bug? Instead, if the expansion is “((void)0)” then this missing semicolon typo will get a compilation error.

Variable-Argument Debug Macros

A neater solution is to use varargs preprocessor macros with the special tokens “...” and “__VA_ARGS__”, which are standard in C and C++ (since 1999):

    #define aussie_debug(fmt,...)  printf((fmt), __VA_ARGS__ )
    ...
    aussie_debug("DEBUG: I am here!\n");

That’s not especially helpful, so we can add more context:

    // Version with file/line/function context
    #define aussie_debug(fmt,...)  \
        ( printf("DEBUG [%s:%d:%s]: ", \
            __FILE__, __LINE__, __func__ ), \
        printf((fmt), __VA_ARGS__ ))
   ...
   aussie_debug("I am here!\n");

This will report the source code filename, line number, and function name. Note the use of the comma operator between the two printf statements (whereas a semicolon would be a macro bug). Also required are parentheses around the whole thing, and around each use of the “fmt” parameter.

Here’s a final example that also detects if you forgot a newline in your format string (how kind!):

    // Version with newline optional
    #define aussie_debug(fmt,...)  \
        (printf("DEBUG [%s:%d:%s]: ", \
          __FILE__, __LINE__, __func__ ), \
        printf((fmt), __VA_ARGS__ ), \
        (strchr((fmt), '\n') != NULL \
                || printf("\n")))
   ...
   aussie_debug("I am here!");  // Newline optional

Dynamic Debug Tracing Flag

Instead of using “#if DEBUG”, it can be desirable to have the debug tracing dynamically controlled at runtime. This allows you to turn it on and off without a rebuild, such as via a command-line argument or inside a cuda-gdb session. And you can decide whether or not you want to ship it to production with the tracing available to be used. Your phone support staff would like to have an action to offer customers rather than “turn it off and on.”

This idea of dynamic control of tracing can be controlled by a single Boolean flag:

   extern bool g_aussie_debug_enabled;

We can add some macros to control it:

    #define aussie_debug_off()  ( g_aussie_debug_enabled = false )
    #define aussie_debug_on()  ( g_aussie_debug_enabled = true )

And then the basic debug tracing macros simply need to check it:

    #define aussie_dbg(fmt,...)  ( g_aussie_debug_enabled && \
        printf((fmt), __VA_ARGS__ ))

So, this adds some runtime cost of testing a global flag every time this line of code is executed.

Here’s the version with file, line, and function context:

    #define aussie_dbg(fmt,...)  \
        ( g_aussie_debug_enabled && \
        ( printf("DEBUG [%s:%d:%s]: ", \
                __FILE__, __LINE__, __func__ ), \
        printf((fmt), __VA_ARGS__ )))

And here’s the courtesy newline-optional version:

    #define aussie_dbg(fmt,...)  \
        ( g_aussie_debug_enabled && \
        (printf("DEBUG [%s:%d:%s]: ", \
          __FILE__, __LINE__, __func__ ), \
        printf((fmt), __VA_ARGS__ ), \
        (strchr((fmt), '\n') != NULL \
            || printf("\n"))))

Device Code Dynamic Debugging

That all sounds great, except when you realize that device code can’t just create a global flag. Accesses to the “g_aussie_debug_enabled” global variable inside a kernel are a compile error.

To use this idea in device code, you would have to do this:

    __device__ bool g_aussie_debug_enabled;

Whenever this variable is accessed by device code in a debug macro, it triggers a global memory access, which is a very expensive access. An alternative would be to use __constant__ to have the value in the “constant cache,” which should be faster, but it’s still slower than kernel local variables.

Furthermore, the mechanics of enabling or disabling this debug flag’s value on the device based on a command-line argument in the host code are quite difficult. The host code can’t just set the global variable on the device.

The performance cost of either __device__ or __constant__ may not be worth the value from the extra tracing flexibility. Another way that’s fast, but requires code changes, is to pass a debug flag around as a parameter to kernel functions.

Alternatively, the simpler debug trace methods with #if can be used. Similarly, you could use these basic C++ constant styles:

    #define g_aussie_debug_enabled true
    const bool g_aussie_debug_enabled = true;

However, these methods now require a re-compile to change, so we haven’t achieved the “dynamic debug tracing” that we wanted!

Multi-Statement Debug Trace Macro

An alternative method of using debugging statements is to use a special macro that allows any arbitrary statements. For example, debugging output statements can be written as:

    DBG( printf("DEBUG: Entered function print_list\n"); )

Or using C++ iostream output style:

    DBG( std::cerr << "DEBUG: Entered function print_list\n"; )

This allows use of multiple statements of debugging, with self-testing code coded as:

    DBG( count++; )
    DBG( if (count != count_elements(table)) { )
    DBG(     aussie_internal_error("ERROR: Element count wrong"); )
    DBG( } )

But it’s actually easier to add multiple lines of code or a whole block in many cases. An alternative use of DBG with multiple statements is valid, provided that the enclosed statements do not include any comma tokens (unless they are nested inside matching brackets). The presence of a comma would separate the tokens into two or more macro arguments for the preprocessor, and the DBG macro above requires only one parameter:

    DBG(
        count++; 
        if (count != count_elements(table)) { // self-test
            aussie_internal_error("ERROR: Element count wrong"); // error
        }
    )

The multi-statement DBG macro is declared in a header file as:

    #if DEBUG
    #define DBG(token_list) token_list  // Risky
    #else
    #define DBG(token_list) // nothing
    #endif

The above version of DBG is actually non-optimal for the macro error reasons already examined. A safer idea is to add surrounding braces and the “do-while(0)” trick to the DBG macro:

    #if DEBUG
    #define DBG(token_list) do { token_list } while(0)   // Safer
    #else
    #define DBG(token_list)   ((void)0)
    #endif

Note that this now requires a semicolon after every expansion of the DBG macro, whereas the earlier definition did not:

   DBG( std::cerr << "Value of i is " << i << "\n"; );

Whenever debugging is enabled, the statements inside the DBG argument are activated, but when debugging is disabled they disappear completely. Thus, this method offers a very simple method of removing debugging code from the production version of a program, if you like that kind of thing.

This DBG macro may be considered poor style since it does not mimic any usual syntax. However, it is a neat and general method of introducing debugging statements, and is not limited to output statements.

Yet another alternative style is to declare the DBG macro so that it follows this statement block structure:

    DBG {
        // debug statements
    }

Refer to the implementation of a block “SELFTEST” macro in the prior chapter for details on how to do this.

Multiple Levels of Debug Tracing

Once you’ve used these debug methods for a while, you start to see that you get too much output. For a while, you’re just commenting and uncommenting calls to the debug routines. A more sustainable solution in a large project is to add numeric levels of tracing, where a higher number gets more verbose.

To make this work well, we declare both a Boolean overall flag and a numeric level:

    extern bool g_aussie_debug_enabled;
    extern int g_aussie_debug_level;

As for running this in device code, the same provisos about global memory access on the GPU apply, except doubly so. This method is probably more likely to be considered for host code, and general application code running on the CPU.

Here’s the macros to enable and disable the basic level:

    #define aussie_debug_off()  ( \
        g_aussie_debug_enabled = false, \
        g_aussie_debug_level = 0)

    #define aussie_debug_on()  ( \
        g_aussie_debug_enabled = true, \
        g_aussie_debug_level = 1 )

And here’s the new macro that sets a numeric level of debug tracing (higher number means more verbose):

    #define aussie_debug_set_level(lvl)  ( \
        g_aussie_debug_enabled = (((lvl) != 0)), \
        g_aussie_debug_level = (lvl) )

Here’s what a basic debug macro looks like:

    #define aussie_dbglevel(lvl,fmt,...)  ( \
        g_aussie_debug_enabled && \
        (lvl) <= g_aussie_debug_level && \
        printf((fmt), __VA_ARGS__ ))
    ...
    aussie_dbglevel(1, "Hello world");
    aussie_dbglevel(2, "More details");

Now we see the reason for having two global variables. In non-debug mode, the only cost is a single Boolean flag test, rather than a more costly integer “<” operation.

And for convenience we might add multiple macro name versions for different levels:

    #define aussie_dbglevel1(fmt) (aussie_debuglevel(1, (fmt)))
    #define aussie_dbglevel2(fmt) (aussie_debuglevel(2, (fmt)))
    ...
    aussie_dbglevel1("Hello world");
    aussie_dbglevel2("More details");

Device debug levels. As with the simpler debug flag earlier, controlling an integer setting for a dynamic debug level is difficult in device code. Options include:

    __device__ int g_aussie_debug_level = 3;
    __constant__ int g_aussie_debug_level = 3;
    const int g_aussie_debug_level = 3;
    #define g_aussie_debug_level 3

All of the above options are either relatively inefficient, or require a re-compile anyway.

A workable solution is passing a debug level as a parameter to all kernel launches from the host, and between device function calls within the device code. The host code can get the debug level (e.g., from a command-line argument), and pass its debug setting to the device via kernel launches. This is a relatively efficient way to achieve a dynamic level of debug tracing. In this way, both you and your customers could run your application with different tracing levels, but without needing a different binary for each debug level.

Very volatile. Note that if you are altering debug tracing levels inside a symbolic debugger (e.g., cuda-gdb) or IDE debugger, you might want to consider declaring the global level variables with the “volatile” qualifier. This applies in this situation because their values can be changed (by you!) in a dynamic way that the optimizer cannot predict. On the other hand, you can skip this, as this issue won’t affect production usage, and only rarely impacts your own interactive debugging usage.

BYO debug printf: All of the above examples are quite fast in execution, but heavy in space usage. They will be adding a fair amount of executable code for each “aussie_debug” statement. I’m not sure that I really should care that much about the code size, but anyway, we could fix it easily by declaring our own variable-argument debug printf-like function.

Advanced Debug Tracing

The above ideas are far from being the end of the options for debug tracing. The finesses to using debug tracing messages include:

Environment variable to enable debug messages.
Command-line argument to enable them (and set the level).
Configuration settings (e.g., changeable inside the GUI, or in a config file).
Add unit tests running in trace mode (because sometimes debug tracing crashes!).
Extend to multiple sets or named classes of debug messages, not just numeric levels, so you can trace different aspects of execution dynamically.

Supportability Tip: Think about customers and debug tracing messages: are there times when you want users to enable them? Usually, the answer is yes. Whenever a user has submitted an error report, you’d like the user to submit a run of the program with tracing enabled to help with reproducibility. Hence, consider what you want to tell customers about enabling tracing (if anything). Similarly, debug tracing messages could be useful to phone support staff in various ways to diagnose or resolve customer problems. Consider how a phone support person might help a customer to enable these messages.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming