Aussie AI

Chapter 13. Self-Testing Code

  • Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
  • by David Spuler

What is Self-Testing Code?

Instead of doing work yourself, get a machine to do it. Who would have ever thought of that?

Getting your code to test itself means you can go get a cup of coffee and still be billing hours. The basic techniques include:

  • Unit tests
  • Regression tests
  • Error checking
  • Assertions
  • Self-testing code blocks
  • Debug wrapper functions

The simplest of these is unit tests, which aim to build quality brick-by-brick from the bottom of the code hierarchy. The largest techniques are to run full regression tests suites, or to add huge self-testing code blocks.

Self-Testing Code Block

Sometimes an assertion, unit test, or debug tracing printout is too small to check everything. Then you have to write a bigger chunk of self-testing code. The traditional way to do this in code is to wrap it in a preprocessor macro:

    #if DEBUG
       ... // block of test code
    #endif

Another reason to use a different type of self-testing code than assertions is that you’ve probably decided to leave the simpler assertions in production code. A simple test like this is probably fine for production:

    assert(ptr != NULL);  // Fast

But a bigger amount of arithmetic may be something that’s not for production:

    assert(aussie_vector_sum(v, n) == 0.0);  // Slow

So, you probably want to have macros and preprocessor settings for both production and debug-only assertions and self-testing code blocks. The simple way looks like this:

    #if DEBUG
    assert(aussie_vector_sum(v, n) == 0.0);
    #endif

Or you could have your own debug-only version of assertions that are skipped for production mode:

    assert_debug(aussie_vector_sum(v, n) == 0.0);

The definition of “assert_debug” then looks like this in the header file:

    #if DEBUG
    #define assert_debug(cond) assert(cond)  // Debug mode
    #else 
    #define assert_debug(cond)  // nothing in production
    #endif

This makes the “assert_debug” macro a normal assertion in debug mode, but the whole coded expression disappears to nothing in production build mode. The above example assumes a separate set of build flags for a production build.

Self-test Code Block Macro

An alternative formulation of a macro for installing self-testing code using a block-style, rather than a function-like macro, is as follows:

    SELFTEST {
        // block of debug or self-test statements
    }

The definition of the SELFTEST macro looks like:

    #if DEBUG
    #define SELFTEST // nothing (enables!)
    #else
    #define SELFTEST if(1) {} else // disabled
    #endif

This method relies on the C++ optimizer to fix the non-debug version, by noticing that “if(1)” invalidates the else clause, so as to remove the block of unreachable self-testing code that’s not ever executed.

Note also that SELFTEST is not function-like, so we don’t have the “forgotten semicolon” risk when removing SELFTEST as “nothing”. In fact, the nothing version is actually when SELFTEST code is enabled, which is the opposite situation of that earlier problem. Furthermore, we cannot use the “do-while(0)” trick in this different syntax formulation.

Self-Test Block Macro with Debug Flags

The compile-time on/off decision about self-testing code is not the most flexible method. The block version of SELFTEST can also have levels or debug flag areas. One natural extension is to implement a “flags” idiom for areas, to allow configuration of what areas of self-testing code are executed for a particular run (e.g., a decoding algorithm flag, a normalization flag, a MatMul flag, etc.). One Boolean flag is set for each debugging area, which controls whether or not the self-testing code in that module is enabled or not.

A macro definition of SELFTEST(flagarea) can be hooked into the run-time configuration library for debugging output. In this way, it has both a compile-out setting (DEBUG==0) and dynamic runtime “areas” for self-testing code. Here’s the definition of the self-testing code areas:

    enum selftest_areas {
        SELFTEST_NORMALIZATION,
        SELFTEST_MATMUL,
        SELFTEST_SOFTMAX,
        // ... more
    };

A use of the SELFTEST method with areas looks like:

    SELFTEST(SELFTEST_NORMALIZATION) {
        // ... self-test code
    }

The SELFTEST macro definition with area flags looks like:

    
    extern bool g_aussie_debug_enabled;  // Global override
    extern bool DEBUG_FLAGS[100];     // Area flags

    #if DEBUG
        #define SELFTEST(flagarea) \
            if(g_aussie_debug_enabled == 0 || DEBUG_FLAGS[flagarea] == 0) \
            { /* do nothing */ } else
    #else
    #define SELFTEST if(1) {} else // disabled completely
    #endif

This uses a “debug flags” array idea as for the debugging output commands, rather than a single “level” of debugging. Naturally, a better implementation would allow separation of the areas for debug trace output and self-testing code, with two different sets of levels/flags, but this is left as an extension for the reader.

Debug Stacktrace

There are various situations where it can be useful to have a programmatic method for reporting the “stack trace” or “backtrace” of the function call stack in C++. Some examples include:

  • Your assertion macro can report the full stack trace on failure.
  • Self-testing code similarly can report the location.
  • Debug wrapper functions too.
  • Writing your own memory allocation tracker library.

C++ is about to have standard stack trace capabilities with its standardization in C++23. This is available via the “std::stacktrace” facility, such as printing the current stack via:

    std::cout << "Stacktrace: " << std::stacktrace::current() << std::endl;

The C++23 stacktrace library is already supported by GCC and early support in MSVS is available via a compiler flag “/std:c++latest”. There are also two different longstanding implementations of stack trace capabilities: glibc backtrace and Boost StackTrace. The C++23 standardized version is based on Boost’s version.

Unified Address Self-Testing

Pointers are doubly complicated in CUDA C++, because they can be host or device pointers, not to mention shared or constant memory. Adding some self-testing code can be beneficial to quality, and you can use the cudaPointerGetAttributes function to query information about any address.

Here’s an example utility function:

    bool aussie_is_device_pointer(void *ptr)
    {
        cudaPointerAttributes attrib;
        cudaError_t err = cudaPointerGetAttributes(&attrib, ptr);
        if ( err != cudaSuccess) {
            printf("ERROR: %s: cudaPointerGetAttributes failed: %p\n", __func__, ptr);
            return false;
        }
        if (attrib.type /*memoryType*/ == cudaMemoryTypeDevice) {
            return true;  // Device pointer
        }
        else if (attrib.type /*memoryType*/ == cudaMemoryTypeHost) {
            return false;   // Host pointer
        }
        printf("ERROR: %s: cudaPointerGetAttributes neither device nor host: %p\n", __func__, ptr);
        return false;
    }

A few points about this idea:

  • This runs in host code.
  • Host pointer address detection is not as simple (discussed below).
  • The documentation says the structure field name is “memoryType” but I had to use “type” instead (after scouring the header file).

Host pointer issues. Defining an aussie_is_host_pointer would seem to be just reversing two return values for cudaMemoryTypeDevice and cudaMemoryTypeHost, but that doesn’t work as well as you might think. The type of cudaMemoryTypeHost only applies to host pointers in Unified Addressing, so it fails for ordinary malloc or new pointers on the host. These basic addresses get another setting for “type” with value 0, whereas cudaMemoryTypeHost is 1, and cudaMemoryTypeDevice is 2.

Various extensions of this idea are possible. For example, you can also get the device details if it is a device pointer, and other details for host pointers. Unfortunately, I’m not aware of any way to get more detailed information, such as whether it’s a cudaMalloc block, and its allocated byte size. But I can dream.

Kernel Launch Self-Testing

Calculations of grid dimensions such as block counts and block sizes can be error-prone. One idea is to add some self-testing code to auto-validate the calculations. This may be particularly beneficial for novice CUDA acolytes, but less so for experienced programmers.

Here’s an example of the types of self-tests that are possible for a one-dimensional vector kernel:

    #define AUSSIE_ERROR(mesg) printf("ERROR: %s: %s\n", __func__, (mesg))

    bool aussie_check_kernel_dimensions_1D(int blocks, int threads, int n)
    {
        if (n == 0) {
            AUSSIE_ERROR("N is zero");
            return false;  // fail
        }
        if (n < 0) {
            AUSSIE_ERROR("N is negative");
            return false;  // fail
        }
        if (blocks == 0) {
            AUSSIE_ERROR("Zero block count");
            return false;  // fail
        }
        if (blocks < 0) {
            AUSSIE_ERROR("Negative block count");
            return false;  // fail
        }        
        if (threads == 0) {
            AUSSIE_ERROR("Zero thread count");
            return false;  // fail
        }
        if (threads < 0) {
            AUSSIE_ERROR("Negative thread count");
            return false;  // fail
        }        
        if (blocks == 1 && threads == 1) {
            AUSSIE_ERROR("WARN: Sequential execution: blocks=1, threads=1");
            // It's allowed (for beginners), drop down...
        }
        if (threads > 1024) {
            AUSSIE_ERROR("Thread count more than 1024 maximum");
            return false;  // fail
        }
        if (threads %32 != 0) {
            AUSSIE_ERROR("WARN: Thread count is not a multiple of 32 (warp size)");
            // Allow for beginners ... drop down to keep going
        }

        // Note: Some total thread count checks assume 1 op per thread (i.e., no loops)
        // ... so this is really mainly for educational use in checking of simple kernels.

        int num = blocks * threads;
        if (num == n) {
            // Perfection...
            return true;
        }
        if (num < n) {
            // NOTE: Error in simple kernel, but could be valid grid-stride loop usage...
            AUSSIE_ERROR("WARN: Thread total is lower than n (not enough threads or grid-stride loop)");
            if (n % threads != 0) {
                AUSSIE_ERROR("WARN: Grid-stride loop kernel would have wasted iterations");
            }
            return true;  // allow it
        }
        if (num > n) {
            AUSSIE_ERROR("WARN: Thread total more than n (extra wasted threads)");
            return true;  // allow it
        }
        return true;  // No serious errors found...
    }

Note that this self-checking idea can be extended to a lot of CUDA Runtime C++ calls. This is discussed in detail in the debug wrapper function chapter.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

CUDA C++ Optimization The new CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging