Aussie AI
Chapter 9. CUDA Sticky Errors
-
Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
-
by David Spuler
What are Sticky Errors?
There is a subset of the error codes that are called “sticky,”
which is another way of saying “really bad.”
They’re called “sticky” because they get stuck in the CUDA error code, and cannot be cleared
(not even by cudaGetLastError).
Even worse, they stop any other CUDA API from working, so you can’t launch another kernel,
and functions like cudaMemcpy just fail immediately.
The list of sticky errors on the GPU is short, but obviously includes the worst kinds of error:
- Invalid address —
cudaErrorIllegalAddress700. - Illegal instruction —
cudaErrorIllegalInstruction715. - Kernel time-out — triggers a kernel launch error
cudaErrorLaunchFailure719, but later. - Misaligned access —
cudaErrorMisalignedAddress716. - Null dereference —
cudaErrorLaunchFailure(later).
Note that some of these sticky errors will return cudaErrorLaunchFailure, but not
as synchronous errors at
the time of the kernel launch.
They are triggered later asynchronously by kernel code execution and can thus occur much later,
at any point between kernel launch and the first synchronization after the failure.
The name of the error code is somewhat misleading!
Note that if you’re wondering where are “segmentation fault” or “core dumped” on this list,
that’s “CPU thinking” and you really need to get your head into GPU world.
The problematic coding errors that trigger a segfault in C++, when run on a GPU,
will either cause
some of the above sticky errors with a cudaErrorLaunchFailure error code , or in some cases, won’t trigger a CUDA runtime error at all in device code
(except they can be found by compute-sanitizer).
There’s no easy recovery from CUDA sticky errors, and your GPU won’t accept any further work from the program. The rest of your host code will keep running on the CPU, but anything related to the GPU will just fail with an error code from CUDA sent back to the host.
The good news is that you can intercept all these error codes with your glorious “CUDA error check” macro, but the bad news is there’s no way to fix it. They’re called “sticky” errors because you’re stuck!
Detecting Sticky Errors
How do you detect a sticky error?
Unfortunately, you cannot just check the numeric code value for a few specific enum values, because some codes
can be both sticky and non-sticky.
For example, cudaErrorLaunchFailure could be a synchronous launch failure
from too many threads-per-block (non-sticky) or an asynchronous error from a null pointer dereference in a kernel (sticky).
Since cudaGetLastError returns the current error, but also clears the error flag for the next call,
you might think this would work:
bool aussie_is_sticky_error_FAILS(bool warn) // Buggy!
{
cudaError_t err = cudaGetLastError(); // Clear prior error
err = cudaGetLastError(); // Twice
if (err != cudaSuccess) { // Sticky error?
if (warn) fprintf(stderr, "CUDA STICKY ERROR: %d %s\n",
(int)err, cudaGetErrorName(err));
return true;
}
return false; // not sticky
}
Actually, no, this will never report a sticky error.
The function cudaGetLastError will return cudaSuccess the second time, even if you’re
in a sticky error state.
It seems that sticky errors are not sticky for this CUDA Runtime API function.
Unfortunately, the only reliable way that I know to detect a sticky error is to issue
a dummy active call to the CUDA runtime,
like using cudaMemcpy or cudaMalloc, such as this:
bool aussie_is_sticky_error(bool warn)
{
// Do a dummy cudaMalloc to see if it's a sticky error...
void *vdummy = NULL;
cudaError_t err = cudaGetLastError(); // Clear prior error
err = cudaMalloc(&vdummy, 1); // Sticky error?
if (err != cudaSuccess) {
if (warn) fprintf(stderr, "CUDA STICKY ERROR: %d %s\n", (int)err, cudaGetErrorName(err));
return true; // Yes, sticky..
}
return false; // not sticky
}
Probably it would be better to use a dummy cudaMemcpy call than the above, which needlessly fragments
device allocated memory.
Actually, here’s a cleaner way suggested on the NVIDIA Forums: call cudaDeviceSynchronize twice,
although this has the inefficiency that it causes synchronization,
which would invalidate any gain if you’re using data transfer overlapping optimizations.
bool aussie_is_sticky_error_SYNCHRONIZED(bool warn)
{
// Call cudaDeviceSynchronize twice to test stickiness
cudaError_t err = cudaDeviceSynchronize(); // First call
if (err != cudaSuccess) {
err = cudaGetLastError(); // Clear it
err = cudaDeviceSynchronize(); // Second call...
if (err != cudaSuccess) { // Sticky error?
if (warn) fprintf(stderr, "CUDA STICKY ERROR: %d %s\n", (int)err, cudaGetErrorName(err));
return true; // Yes, sticky..
}
}
return false; // not sticky
}
What Causes Sticky Errors?
Which kernel device errors cause sticky errors? I wrote some dummy kernels to trigger crashing code.
__global__ void null_deref_local(float *f, int n)
{
int *ptr = NULL;
*ptr = 1;
}
__global__ void array_underflow_write(float *f, int n)
{
f[-1] = 0.0;
}
__global__ void array_overflow_write(float *f, int n)
{
f[n + 1] = 0.0;
}
__global__ void array_underflow_read(float *f, int n)
{
volatile int x = f[-1];
x = x;
}
__global__ void array_overflow_read(float *f, int n)
{
volatile int x = f[n + 1];
x = x;
}
__global__ void cudamalloc_uninit_read(float *f, int n)
{
volatile int x = f[3];
x = x;
}
__global__ void do_nothing_kernel(float *f, int n)
{
volatile int x = 0;
x = x;
}
The idea with “volatile” is to prevent the CUDA compiler
from optimizing my bad code away.
Maybe I’m giving it too much credit, because it didn’t even
remove my blatantly obvious null dereference.
Additionally, I used a test harness to launch the kernels:
fnptr<<<1,1>>>(dest_v, n); // Launch kernel
// Check for CUDA synchronous launch errors
err = cudaPeekAtLastError(); // Any error?
if (err != cudaSuccess) {
// ... etc... (did not occur)
}
err = cudaDeviceSynchronize(); // Wait for completion
if (err == cudaSuccess) {
fprintf(stderr, "%s: no error\n", name);
}
else {
if (aussie_is_sticky_error(false)) {
fprintf(stderr, "%s: CUDA sticky error: %d %s\n", name, (int)err, cudaGetErrorName(err));
}
else {
fprintf(stderr, "%s: CUDA non-sticky error: %d %s\n", name, (int)err, cudaGetErrorName(err));
}
}
Here are my results, summarized from multiple invocations (because of the sticky one!):
Null dereference local: CUDA sticky error: 719 cudaErrorLaunchFailure
Array_underflow write: no error
Array underflow read: no error
cudaMalloc uninitialized read: no error
Array_overflow write: no error
Array_overflow read: no error
Just between you and me, I did initially have a “cudaMalloc uninitialized write” and it didn’t fail.
But then I removed it.
Compute Sanitizer did a lot better at finding problems than the basic CUDA runtime. For example, here’s the output from the “array underflow write” kernel:
========= Invalid __global__ write of size 4 bytes ========= at 0x20 in array_underflow_write(float *, int) ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x7b5cf32001fc is out of bounds ========= and is 4 bytes before the nearest allocation at 0x7b5cf3200200 of size 400 bytes ========= Saved host backtrace up to driver entry point at kernel launch time
Overall, these results constitute a non-peer reviewed, statistical sample of one, based on one GPU on one particular platform at one time of year in one country. Totally guaranteed results!
Sticky Error Recovery
The main way to “recover” from this situation is to shut down the whole application,
such as by calling exit or abort.
You can print a nice, helpful message to your users for good supportability,
but that’s about all you can do.
Your host code looks like this:
if (aussie_is_sticky_error(false)) {
fprintf(stderr, "I apologize profusely for my very existence!\n");
abort();
}
If you aspire to being a perfect programmer, you can close all your open files to flush the buffers, and free all your allocated memory, but the operating system will do that anyway.
In the category of facts you didn’t want to know,
this situation is analogous to the old-school Unix crashes in CPU code
that cause core dumps, such as segmentation faults and illegal instructions.
You can try to register a signal handler function to
intercept the SIGSEGV or SIGILL signals,
and then try to return from your signal handler, because you want to keep going.
Unfortunately, this fails because
the CPU will just re-raise the same signal, so it spins, and you can’t recover.
Just like GPU sticky errors,
the best you can do is register a signal handler that prints a grovelling message and then aborts.
Multi-Process Fix for Sticky Errors
This is in the category of fixes that don’t really fix it: There is a way to try a more extensive recovery of an application that fails with a CUDA sticky error. The solution is: abort your process, and try again.
This idea works if:
- Parent process is the main controlling application on the CPU.
- Parent process launches a sub-process (e.g.,
forkandexec). - Child sub-process launches the CUDA kernel.
If the child process detects a sticky error in a CUDA error code, then the child process can shut itself down, telling the parent that it failed, before it shuts itself down. Then the parent can detect the child’s failure status, and try again by launching a new child process to re-do this entire kernel.
The advantage of this method is that a single failed kernel doesn’t kill your entire application, which can be resilient to a transient failure on the GPU. The downside is that it has to re-do the entire kernel, with no partial results available.
The extra work you have to do for this includes:
- Parent: Fiddly file descriptor work in
fork-execsequences (like it’s 1990 all over again). - Child: Detect sticky errors in your CUDA error check macros (making them even more spectacular).
- Child: Report success or failure status back to the parent process from the child process.
- Parent: Check the child sub-process return status and re-try (but not infinitely).
It’s a certain kind of fun.
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |