Aussie AI

Chapter 9. CUDA Sticky Errors

  • Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
  • by David Spuler

What are Sticky Errors?

There is a subset of the error codes that are called “sticky,” which is another way of saying “really bad.” They’re called “sticky” because they get stuck in the CUDA error code, and cannot be cleared (not even by cudaGetLastError). Even worse, they stop any other CUDA API from working, so you can’t launch another kernel, and functions like cudaMemcpy just fail immediately.

The list of sticky errors on the GPU is short, but obviously includes the worst kinds of error:

  • Invalid address — cudaErrorIllegalAddress 700.
  • Illegal instruction — cudaErrorIllegalInstruction 715.
  • Kernel time-out — triggers a kernel launch error cudaErrorLaunchFailure 719, but later.
  • Misaligned access — cudaErrorMisalignedAddress 716.
  • Null dereference — cudaErrorLaunchFailure (later).

Note that some of these sticky errors will return cudaErrorLaunchFailure, but not as synchronous errors at the time of the kernel launch. They are triggered later asynchronously by kernel code execution and can thus occur much later, at any point between kernel launch and the first synchronization after the failure. The name of the error code is somewhat misleading!

Note that if you’re wondering where are “segmentation fault” or “core dumped” on this list, that’s “CPU thinking” and you really need to get your head into GPU world. The problematic coding errors that trigger a segfault in C++, when run on a GPU, will either cause some of the above sticky errors with a cudaErrorLaunchFailure error code , or in some cases, won’t trigger a CUDA runtime error at all in device code (except they can be found by compute-sanitizer).

There’s no easy recovery from CUDA sticky errors, and your GPU won’t accept any further work from the program. The rest of your host code will keep running on the CPU, but anything related to the GPU will just fail with an error code from CUDA sent back to the host.

The good news is that you can intercept all these error codes with your glorious “CUDA error check” macro, but the bad news is there’s no way to fix it. They’re called “sticky” errors because you’re stuck!

Detecting Sticky Errors

How do you detect a sticky error? Unfortunately, you cannot just check the numeric code value for a few specific enum values, because some codes can be both sticky and non-sticky. For example, cudaErrorLaunchFailure could be a synchronous launch failure from too many threads-per-block (non-sticky) or an asynchronous error from a null pointer dereference in a kernel (sticky).

Since cudaGetLastError returns the current error, but also clears the error flag for the next call, you might think this would work:

    bool aussie_is_sticky_error_FAILS(bool warn)  // Buggy!
    {
        cudaError_t err = cudaGetLastError(); // Clear prior error
        err = cudaGetLastError();  // Twice
        if (err != cudaSuccess) { // Sticky error?
            if (warn) fprintf(stderr, "CUDA STICKY ERROR: %d %s\n", 
                        (int)err, cudaGetErrorName(err));
            return true;
        }
        return false;  // not sticky
    }

Actually, no, this will never report a sticky error. The function cudaGetLastError will return cudaSuccess the second time, even if you’re in a sticky error state. It seems that sticky errors are not sticky for this CUDA Runtime API function.

Unfortunately, the only reliable way that I know to detect a sticky error is to issue a dummy active call to the CUDA runtime, like using cudaMemcpy or cudaMalloc, such as this:

    bool aussie_is_sticky_error(bool warn)
    {
             // Do a dummy cudaMalloc to see if it's a sticky error...
        void *vdummy = NULL;
        cudaError_t err = cudaGetLastError();  // Clear prior error
        err = cudaMalloc(&vdummy, 1);  // Sticky error?
        if (err != cudaSuccess) {
                    if (warn) fprintf(stderr, "CUDA STICKY ERROR: %d %s\n", (int)err, cudaGetErrorName(err));
                return true;  // Yes, sticky..
        }
        return false;  // not sticky
    }

Probably it would be better to use a dummy cudaMemcpy call than the above, which needlessly fragments device allocated memory.

Actually, here’s a cleaner way suggested on the NVIDIA Forums: call cudaDeviceSynchronize twice, although this has the inefficiency that it causes synchronization, which would invalidate any gain if you’re using data transfer overlapping optimizations.

    bool aussie_is_sticky_error_SYNCHRONIZED(bool warn)
    {
        // Call cudaDeviceSynchronize twice to test stickiness
        cudaError_t err = cudaDeviceSynchronize();  // First call
        if (err != cudaSuccess) {
            err = cudaGetLastError();  // Clear it
            err = cudaDeviceSynchronize();  // Second call...
            if (err != cudaSuccess) {  // Sticky error?
                if (warn) fprintf(stderr, "CUDA STICKY ERROR: %d %s\n", (int)err, cudaGetErrorName(err));
                return true;  // Yes, sticky..
            }
        }
        return false;  // not sticky
    }

What Causes Sticky Errors?

Which kernel device errors cause sticky errors? I wrote some dummy kernels to trigger crashing code.

    __global__ void null_deref_local(float *f, int n)
    {
        int *ptr = NULL;
        *ptr = 1;
    }

    __global__ void array_underflow_write(float *f, int n)
    {
        f[-1] = 0.0;
    }

    __global__ void array_overflow_write(float *f, int n)
    {
        f[n + 1] = 0.0;
    }

    __global__ void array_underflow_read(float *f, int n)
    {
        volatile int x = f[-1];
        x = x;
    }

    __global__ void array_overflow_read(float *f, int n)
    {
        volatile int x = f[n + 1];
        x = x;
    }

    __global__ void cudamalloc_uninit_read(float *f, int n)
    {
        volatile int x = f[3];
        x = x;
    }

    __global__ void do_nothing_kernel(float *f, int n)
    {
        volatile int x = 0;
        x = x;
    }

The idea with “volatile” is to prevent the CUDA compiler from optimizing my bad code away. Maybe I’m giving it too much credit, because it didn’t even remove my blatantly obvious null dereference.

Additionally, I used a test harness to launch the kernels:

    fnptr<<<1,1>>>(dest_v, n);   // Launch kernel

    // Check for CUDA synchronous launch errors
    err = cudaPeekAtLastError();  // Any error?
    if (err != cudaSuccess) {
        // ... etc... (did not occur)
    }
    err = cudaDeviceSynchronize();  // Wait for completion
    if (err == cudaSuccess) {
        fprintf(stderr, "%s: no error\n", name);
    }
    else {
        if (aussie_is_sticky_error(false)) {
          fprintf(stderr, "%s: CUDA sticky error: %d %s\n", name, (int)err, cudaGetErrorName(err));
        }
        else {
          fprintf(stderr, "%s: CUDA non-sticky error: %d %s\n", name, (int)err, cudaGetErrorName(err));
        }
    }

Here are my results, summarized from multiple invocations (because of the sticky one!):

    Null dereference local: CUDA sticky error: 719 cudaErrorLaunchFailure
    Array_underflow write: no error
    Array underflow read: no error
    cudaMalloc uninitialized read: no error
    Array_overflow write: no error
    Array_overflow read: no error

Just between you and me, I did initially have a “cudaMalloc uninitialized write” and it didn’t fail. But then I removed it.

Compute Sanitizer did a lot better at finding problems than the basic CUDA runtime. For example, here’s the output from the “array underflow write” kernel:

========= Invalid __global__ write of size 4 bytes
=========     at 0x20 in array_underflow_write(float *, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7b5cf32001fc is out of bounds
=========     and is 4 bytes before the nearest allocation at 0x7b5cf3200200 of size 400 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time

Overall, these results constitute a non-peer reviewed, statistical sample of one, based on one GPU on one particular platform at one time of year in one country. Totally guaranteed results!

Sticky Error Recovery

The main way to “recover” from this situation is to shut down the whole application, such as by calling exit or abort. You can print a nice, helpful message to your users for good supportability, but that’s about all you can do. Your host code looks like this:

    if (aussie_is_sticky_error(false)) {
        fprintf(stderr, "I apologize profusely for my very existence!\n");
        abort();
    }

If you aspire to being a perfect programmer, you can close all your open files to flush the buffers, and free all your allocated memory, but the operating system will do that anyway.

In the category of facts you didn’t want to know, this situation is analogous to the old-school Unix crashes in CPU code that cause core dumps, such as segmentation faults and illegal instructions. You can try to register a signal handler function to intercept the SIGSEGV or SIGILL signals, and then try to return from your signal handler, because you want to keep going. Unfortunately, this fails because the CPU will just re-raise the same signal, so it spins, and you can’t recover. Just like GPU sticky errors, the best you can do is register a signal handler that prints a grovelling message and then aborts.

Multi-Process Fix for Sticky Errors

This is in the category of fixes that don’t really fix it: There is a way to try a more extensive recovery of an application that fails with a CUDA sticky error. The solution is: abort your process, and try again.

This idea works if:

  • Parent process is the main controlling application on the CPU.
  • Parent process launches a sub-process (e.g., fork and exec).
  • Child sub-process launches the CUDA kernel.

If the child process detects a sticky error in a CUDA error code, then the child process can shut itself down, telling the parent that it failed, before it shuts itself down. Then the parent can detect the child’s failure status, and try again by launching a new child process to re-do this entire kernel.

The advantage of this method is that a single failed kernel doesn’t kill your entire application, which can be resilient to a transient failure on the GPU. The downside is that it has to re-do the entire kernel, with no partial results available.

The extra work you have to do for this includes:

  • Parent: Fiddly file descriptor work in fork-exec sequences (like it’s 1990 all over again).
  • Child: Detect sticky errors in your CUDA error check macros (making them even more spectacular).
  • Child: Report success or failure status back to the parent process from the child process.
  • Parent: Check the child sub-process return status and re-try (but not infinitely).

It’s a certain kind of fun.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

CUDA C++ Optimization The new CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging