Aussie AI

Chapter 18. Advanced CUDA C++ Techniques

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 18. Advanced CUDA C++ Techniques

Advanced CUDA Kernel Declarations

The simplest types of CUDA kernels look like basic C++ functions with an extra “__global__” specifier. However, you can also use many of the more advanced types of functions declarations to offer a wider variety of kernel functions. Examples include:

Class member functions — defined in the style: MyClass::mykernel
Function overloading — offer multiple versions of the kernel.
Templated kernels — using the “template” C++ keyword.
Function pointers — uses random * and () sequences and yet somehow works.

Nevertheless, be aware that the kernel code is not running in a standard C++ environment, and the NVCC compiler does not support every type of capability for kernels. For example, here are some of the limitations:

No return type — you can’t define a non-void kernel that returns a value.
Reference parameters not supported.
Default arguments for parameters not supported.

Function Pointer Kernels. Another trick you can do with kernel functions: use function pointers. Although they’re a little mind-bending at first, function pointers are supported for kernel launches (and other things) in CUDA C++, and have been a part of standard C++ for decades.

This idea is very useful for wrapping your kernel launches in boilerplate code, such as the calculation of blocks and threads for the grid dimensions, and managing the up-and-down host-device data transfers. Here’s the idea of a general kernel launch wrapper to do different vector operations:

    void kernel_launch(void (*myvectorfnptr)(float *v, int n), float *v, int n)
    {
        // Set up ... e.g., cudaMalloc, cudaMemcpy, etc.

        myvectorfnptr <<< blocks, nthreads >>> (f, n);

        // Synchronize ... e.g., cudaGetLastError
        // Get results... e.g., cudaMemcpy
        // Cleanup ... e.g., cudaFree
    }

And here’s how you’d call it:

    __device__ void vector_sum(float *v, int n)
    {
        // Sum reduction of vector ...
    }

    // ....
    kernel_launch(vector_sum, v, n);

In the call to the “kernel_launch” function, we have passed the function name “vector_sum” as a function pointer. Note that we didn’t use any “()” after the name. The name of a function when used by itself with parentheses is a function pointer in CUDA C++, and it points to the code for that function. We can call a function pointer just by adding a (...) sequence after it. Function pointers can have parameter lists, or can have zero parameters.

Persistent Kernels

Nobody’s telling your kernels they have to exit. Non-exiting threads can remain resident in the GPU, and receive more work in a type of job scheduling architecture. This can be very efficient for some types of kernel where there is a continual stream of work (e.g., batching or chunked prefill in AI engines).

The advantages of persistent threads include:

Avoids kernel launch overhead
Low latency (fast response time to new queries)

The disadvantages of a persistent kernel architecture include:

Monopolized GPU
Code implementation complexity
Busy wait
GPU never idle (consumes electricity)
Kernel timeouts

The architecture of a persistent kernel has the threads simply waiting for work to be queued. Typically, this would come from the host code, which acts as the overarching controller of the work queue.

Why do you need the GPU threads to be persistent? Why not have the host code wait for work and then launch the kernel threads? The main reason is to avoid the delay from kernel launches, thereby increasing response times.

Similarly, a persistent kernel could have a small kernel managing the work queue, and then launch a larger kernel via dynamic parallelism. But that also undermines the advantage of having the large kernel pre-launched, so as to have a very low latency.

Note that there have been some problems, whereby a persistent long-running kernel starves a non-persistent kernel. These are related to the “lazy module loading” optimization, which can be configured off with the CUDA_MODULE_LOADING environment variable (i.e., change to “EAGER” from default value “LAZY”). Alternatively, a workaround is to dry-run the non-persistent kernel prior to launching the persistent kernel, thereby forcing it to be loaded once in the GPU.

Generally, it would make sense to use the persistent kernel architecture only when there is an expectation of a continual feed of work jobs. In that situation, the downsides are limited, because the GPU is always busy.

Kernel Timeouts

Actually, your GPU can definitely tell your kernels to exit, and it doesn’t ask nicely. There are “kernel timeouts” where the GPU will kill a kernel after a set period of time. Running a persistent kernel requires taking control of these timeouts.

You can determine programmatically in CUDA C++ whether kernel timeouts are enabled on the GPU by using the cudaGetDeviceProperties API and the kernelExecTimeoutEnabled property.

    // Kernel timeout property (enabled/disabled)
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    bool kernel_timeout = (prop.kernelExecTimeoutEnabled != 0); 
    printf("Kernel timeout: %s\n", 
        kernel_timeout ? (char*)"enabled" : (char*)"disabled");

If you’re running a persistent kernel, it would make sense to check that the timer has been disabled, and emit a severe error if not.

The length of a CUDA timeout is typically 5 to 10 seconds. When a timeout occurs for a kernel, a CUDA Runtime error is raised: cudaErrorLaunchTimeout (702), also known as CUDA_ERROR_LAUNCH_TIMEOUT in those heady days when we used underscores in our iconoclastic years.

You can’t directly change the kernel timeout settings from within CUDA C++. There are operating system configuration settings that can be modified, including Windows registry settings, disabling the Watchdog timer, adjusting Linux kernel parameters, or altering the configuration properties of the GPU driver.

Lookup Tables

Lookup tables are such a well-known optimization that they’re just called LUTs. The idea is to precalculate a lot of data results, thereby avoiding that processing cost at runtime.

AI engines provide some good examples. Let’s say you have a vector of intermediate results (“activations”) and you want to apply the sigmoid function to them:

    sigmoid(x) := 1 / (1 + exp(-x))

In basic C++, the code is:

    float sigmoid(float x)
    {
        return 1.0f / ( 1.0f + expf( - x) );
    }

So, we know how to write a super-fast GPU kernel to scan all the vector elements in them, since this is an element-wise algorithm (i.e., it’s actually “embarrassingly parallel” in the vernacular).

The problem is that each of our kernel threads is going to have to compute the above function, which includes an exponential and some other operations. Isn’t there a faster way?

Well, the first point is this: what are you worried about? There’s a builtin CUDA function for exponentials, and it’s basically a single machine code instruction these days, because hardware engineers have long since coded these basic mathematical functions in microcode. Then it’s a negation, addition, and division, so it’s not really that much arithmetic cost, and we can do them all in parallel with GPU threads.

Anyway, back to my point about LUTs. If we wanted to super-optimize it, we could precompute this computation for all 32-bit input float values, and then have a table of float values that are the results. The input is the bits of the float converted to a 32-bit unsigned integer (as in Chapter 17), which is used as the array index. The code looks something like this:

    float sigmoid_precomp(float x)
    {
        unsigned int offset = *(unsigned int*)&x;
        return g_sigmoid_LUT[offset];
    }

As you can see, this is faster! How big is the precomputed “g_sigmoid_LUT” table? Well, it’s 2^32 (approximately 4.7 billion) times four bytes for the float results values, This is therefore about 19 Gigabytes. It’s a little more than will fit into the GPU’s 48KB constant cache, but we shall persevere anyway.

We could use a smaller cache. For example, the rightmost bits in a float are the least-significant digits of a floating point number. If we ignore 8 of them, we lose about 2 or 3 digits, which are not important. Hence, the code becomes:

    float sigmoid_precomp(float x)
    {
        unsigned int offset = *(unsigned int*)&x;
        offset >>= 8;
        return g_sigmoid_LUT[offset];
    }

Note that it’s a good thing we’re using an unsigned integer type, because right shift is undefined on signed integers, if they’re negative. It might shift-in zeros on the left, or it might sign-extend.

How big is our LUT? Well, now it’s 2^24 times 4, which is about 16.7 million times 4, so our LUT is about 67 Megabytes. It’s still too big for our caches or constant memory, but it will fit in GPU global memory. In order to get it into global memory, we need to either:

(a) upload it from the CPU each time the kernel is launched, or

(b) declare a global __device__ array for the LUT data, or

I feel like our GPU needs a ROM chip here, but I don’t think it has one. The easiest solution is to declare a big global array with the __device__ specifier, which allows CUDA runtime to manage that memory, with paging in and out of GPU memory, and the table has lifetime of the application.

There’s another problem, though. If each of our threads is computing the sigmoid of one vector element, the values in the vectors won’t be the same. In fact, they’ll be effectively random, so each thread will be accessing a random index into the LUT. These are certainly not adjacent addresses, and so it’s not a coalesced access pattern into the global memory.

After all this, we come to a sudden realization about LUT algorithms, and their characteristics:

Large data requirement for the precomputed LUT, and
Non-coalesced almost random pattern of accessing the LUT.

And then there’s the dreaded realization of an alternate reality:

It might be better on the CPU.

So, them’s your choices:

Use parallel kernel threads without a LUT,
Upload 67 MB every time to GPU global memory.
Set aside 67 MB global memory and run a persistent kernel.
Run it on the CPU instead.

You pays your money and you takes your chances.

Source Code Precomputation

No matter what method you choose to implement your LUT, there’s the problem that you need to fill the LUT by computing the sigmoid function about 16.7 million times for a 24-bit LUT (67MB size), or 4.7 billion times for a 32-bit LUT (18.8GB size).

When are you going to run them? You could do it at application startup time, although that will be a bit like an old Windows box starting up. Maybe you could code the whirring disk drive sound-effects for fun? Alternatively, you could do the calculations offline and write them to a binary file, and then you can load the binary file at startup time.

There’s a better solution: let the CUDA C++ compiler sort it out. Instead of writing our results to a binary file, here’s the trick:

Create a C++ source code file.

It’s just a single variable name, with an array initializer about a mile long. Then you can compile that C++ source code file, just like any other code. Our program looks like this:

    void generate_sigmoid_LUT(FILE *fp, bool gpu)
    {
        unsigned int maxi = (1<<24);
        if (gpu) fprintf(fp, "__device__ ");
        fprintf(fp, "float g_sigmoid_LUT[]={\n");
        for (unsigned int i = 0; i < maxi; i++) {
            float f = *(float *)&i;
            fprintf(fp, "%10.10f", sigmoid(f));
            if (i + 1 < maxi) fprintf(fp, ",");
            if (i % 5 == 4) fprintf(fp, "\n");
        }
        fprintf(fp, "};\n");
    }

It’s not the tightest code I’ve ever written, but this is to run offline anyway. The output is huge and looks like this:

    __device__ float g_sigmoid_LUT[]={
        // ... lots of data
        0.6224593520,0.6224593520,0.6224594116,0.6224594116,0.6224594116,
        0.6224594116,0.6224594116,0.6224594712,0.6224594712,0.6224595308,
        0.6224595308,0.6224595308,0.6224595308,0.6224595308,0.6224595308,
        // ... more data
    };

There are practical problems with the file size, however. If our LUT has 16.7 million numbers, and we use 10 digits in ASCII for the numeric constants in source code, our output file will be over 167 MB, which is quite a large C++ source file for the nvcc compiler to handle.

When I tested it, my generated CUDA C++ file was 221MB, so I tried to compile it. Unfortunately, nvcc is probably still spinning and thinking, as you read this. Just kidding, but it did take a few minutes. Kudos to all the compiler engineers at NVIDIA!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Chapter 18. Advanced CUDA C++ Techniques