Aussie AI

Chapter 3. GPU Vectorization

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 3. GPU Vectorization

One-Dimensional Kernels

Vectorization is the basic optimization performed by GPUs, whereby each computation in a vector is parallelized into its own thread. In this way, all of the operations in a vector are performed in parallel.

The simplest type of CUDA C++ kernels are to operate on all the elements of a one-dimensional array or vector. There are two main types of one-dimensional kernels that you may need to implement:

Vertical (element-wise) — super-fast!
Horizontal (reductions) — slower.

The reason they are called “vertical” or “horizontal” is shrouded in mystery from time immemorial. After coming up empty in a search, I asked an AI about the origin of the terms, and it replied that the reason was based on whether the GPU was mounted horizontally or vertically inside the computer case. Umm, no. When I corrected it, then the AI suggested it was that vertical operations are “column-wise” whereas horizontal operations are “row-wise,” which is at least closer to an answer, but it’s really talking about matrices not vectors. When I asked who “coined” the terms, the AI claimed they “evolved organically” in the industry.

Vertical kernels. The simplest one-dimensional kernels are “vertical” or “element-wise” kernels, where each operation is only on one of the vector’s elements, and they are not combined. Examples include:

Clear a vector to zero
Set all vector elements to a scalar value
Add a scalar to each element of a vector
Scale a vector (multiply/divide each element by a scalar)
Exponentiate each number in a vector (e.g., Softmax normalization in AI).
Convert all negative values to zero (e.g., RELU activation function in AI).

I should get censured by the AI Worldwide Members Society for mentioning “divide.” AI programmers never divide, but use reciprocal multiplication, thank you very much.

The above examples all involved only a single vector, but there are also common vertical operations on two vectors (or more):

Copy a vector to a second vector
Add a second vector to the first one (e.g., add a bias vector in AI)
Add two vectors to create a third vector (common)
Multiply element-wise two vectors (not very useful)
SAXPY (Single-precision AX Plus Y)

Horizontal reductions. The other type is horizontal or “reduction” kernels, whereby the multiple elements of a vector are combined. The reason they are often called “reductions” is that they “reduce” a vector (many numbers) down to a scalar (single number). Typical examples include:

Sum of all vector elements
Maximum of all vector elements
Minimum of all vector elements
Dot product (scalar product) of two vectors

Note that the dot product is multiplying the vector elements in pairs, and then adding that up, so it’s like a combination of “vertical multiply” and “horizontal add” kernels. All AI engines are based on 2D matrix multiplications, which are based on 1D vector dot products, which is why you needed to know that.

Vertical Kernels

Vertical kernels can run so fast that we call them “embarrassingly parallel,” because each parallel computation has no dependency on any others. The only trick is trying to figure out in what order we’re going to process each vector element.

There are multiple ways to parallelize a one-dimensional element-wise operation on a vector in CUDA threads. Let’s examine some of them:

Total parallelism — each thread processes one element.
Segmented threads — each thread processes a contiguous segment of the vector.
Grid-stride loops — each thread “strides” through the array, processing non-adjacent elements.
Loop unrolled versions of all of these.

You might think at first blush that the best would be segmented processing, because that means adjacent addresses, which is good for “coalescing” of memory accesses. Well, unfortunately, that’s when Data identifies a “sequential thought pattern” in the CUDA episode of Star Trek. The segmented version actually has the worst performance with 100% non-coalesced accesses!

Total Thread Parallelism

This method is launching one GPU thread per vector element. Such a plan would seem to be the ideal method, and it certainly does run fast for small-to-medium vector sizes.

    __global__ void aussie_clear_vector_basic(float* v, int n)
    {
        // Compute offset
        int id = blockIdx.x* blockDim.x + threadIdx.x;

        if (id < n) { // Safety
            v[id] = 0.0;  // Clear one vector element..
        }
    }

This is launched with a thread for every single vector element:

    int nthreads = 256;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_clear_vector_basic <<< blocks, nthreads >>> (dv, n);

We choose threads-per-block of 256 based on a rule-of-thumb. It must be divisible by 32, and cannot be more than 1024.

This is a super-fast kernel with “total parallelism” because every thread sets only one vector element. Hence, every vector element is processed in parallel. Note that we could even dispense with the safety if statement if we could be sure than n was a multiple of 256. The kernel is tighter:

    __global__ void aussie_clear_vector_unsafe(float* v, int n)
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        v[id] = 0.0;  // Clear one vector element..
    }

And we could do a safety check on the host side, which is even removed from production code:

    assert(n % 256 == 0);
    int nthreads = 256;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_clear_vector_unsafe <<< blocks, nthreads >>> (dv, n);

And we can further micro-optimize the arithmetic for the restricted values of nthreads:

    int blocks = n / nthreads; // Simpler (n % nthreads == 0)

One of the limitations should be fairly obvious: what if the vector has more elements than the maximum number of threads we’re allowed to run on a GPU?

CUDA is limited to 1024 threads per block, but can handle a rather large number of blocks. Even if we don’t hit a hard limit, if we do too many, the blocks will get scheduled in “waves” of computation, and we’ve created another inefficiency because launching so many blocks into the GPU is not free.

The second problem is that each kernel thread does so little work that the kernel launch overhead, especially across multiple waves, can become a significant percentage of the cost. It would be better to launch fewer blocks so that there are fewer threads to launch. The solution to our problems it that we use non-total parallelism so that each thread does some extra work by looping through the array.

Segmented Loop Kernel

In this version, each thread does multiple vector elements in a segment of the vector, such that they are in a contiguous sequence. Let’s just do a simple one that does 4 assignments in each thread.

    __global__ void aussie_clear_vector_segment4(float* v, int n)
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        const int seglen = 4;  // Segment length to process
        id *= seglen;
        if (id + seglen - 1 < n) { // Safety
            for (int offset = 0; offset < seglen; offset++, id++) {
                v[id] = 0.0;
            }
        }
    }

The trick is this line, which ensures that thread 0 starts at 0, thread 1 starts at 4, thread 2 starts at 8, and so on:

    id *= seglen;

Hence, each thread is doing 4 times the work of those in the prior example. However, this means we also need to launch 4 times fewer total threads.

    int nthreads = 256;
    int blocks = (n + nthreads - 1) / nthreads;
    blocks /= 4;   // Use 4 times fewer blocks

Optionally, we can add various safety self-tests:

    assert(n % 256 == 0);
    assert(blocks * 4 * nthreads == n);

Why aren’t these good CUDA kernels? Well, there’s two reasons:

1. The inner loop is inefficient, and

2. The memory access pattern is not coalesced.

We can fix the inefficiency of the inner loop using loop unrolling:

    __global__ void aussie_clear_vector_segment4_unrolled(float* v, int n)
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        const int seglen = 4;  // Segment length to process
        id *= seglen;
        if (id + seglen - 1 < n) { // Safety
            v[id] = 0.0;
            v[id + 1] = 0.0;
            v[id + 2] = 0.0;
            v[id + 3] = 0.0;
        }
    }

But, it’s still not efficient because it’s not a “coalesced” memory access pattern.

Well, let’s consider a simple example: to process 40 elements in 10 threads, we use one thread for 0..3, another thread for 4..7, another for 8..11, and so on. So, if you think about the parallel execution, you’ll realize that the first iteration, the 10 parallel threads will process addresses:

    0, 4, 8, 12, etc.

And the second iteration they will process addresses in parallel:

    1, 5, 9, 13, etc.

Although each thread is processing contiguous addresses (sequentially), the parallel processing of a full warp of 32 threads is not accessing adjacent memory addresses. Clearly, this is not a coalesced memory access pattern. See Chapter 9 for more details on coalescing.

Grid-Stride Loop Kernel

These loops are the GPU celebrities, always snapping selfies at any AI convention, because they’re both fast and flexible (like The Flash and Dead Pool combined?). Here’s an example:

    __global__ void aussie_clear_vector_kernel_grid_stride(float* v, int n)
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        int stride = blockDim.x * gridDim.x;
        for ( ; id < n; id += stride) {
            v[id] = 0.0;  // Clear one vector element..
        }
    }

Now, the weird thing about grid-stride loops is that it doesn’t really matter how many blocks or threads we spawn — it still works! This will work:

    int nthreads = 256;
    int blocks = (n + nthreads - 1) / nthreads;

That is just like “total parallelism” with only one assignment per thread. This spawns one thread for every vector element.

And this will also work, with now four assignments per thread, and four times fewer total threads spawned:

    int work_per_thread = 4;
    int nthreads = 256;
    int blocks = (n + nthreads - 1) / nthreads;
    blocks /= work_per_thread;   // Use fewer blocks

And even this will work with completely sequential iterations in one thread only, which is inefficient but occasionally useful for debugging:

    int nthreads = 1;
    int blocks = 1;

Striding. The idea with grid-stride loops is that we have a whole warp of threads processing contiguous memory. Weirdly, this means that each thread is processing elements that are non-contiguous. This is called “striding” through the array, at intervals equal to the size of the grid, which is why we call them “grid-stride loops.”

Note that the meaning of “stride” in these CUDA kernels is the number of threads across the grid of multiple blocks. Other applications in Computer Science use the term to mean the number of bytes between array elements, which isn’t quite the same thing.

Analysis of Grid-Stride Loops

Grid-stride loops are extremely fast, but a little tricky to code up. However, once coded correctly, they are also easy to use in any grid dimensions, because each loop adjusts its iterations so that the threads combine to exactly cover the entire vector. In fact, grid-stride loops will even work if you launch a kernel with one block and one thread, which is useful in debugging.

Grid-stride loops are widely used in CUDA C++ kernels. Clearly someone in Engineering discovered these types of kernels, because if the Marketing Department has been involved earlier, we’d be calling them “Jumping Superfast Amplifiers” or “Flash Astral Stride Turnips (FAST)” or something like that.

Actually, no, I got it! We should call them “Gazelles” because Grid-Stride Loop is GSL. I should work in Marketing, and I’m sure that I would just love it, every single day.

Anyway, the reason that grid-stride loops work so well is counter-intuitive. If we have 100 vector elements processed by 10 threads, then the first thread does 0, 10, 20, etc. And the second thread does 1, 11, 21, etc. However, when you look at the whole 10 threads in parallel, at the first iteration they are processing:

    0, 1, 2, 3, ... 9

And the second lock-step computation of all 10 threads processes:

    10, 11, 12, 13, ... 19

The computation progresses over vector elements 0..9 then 10..19, and so on. This is the very definition of coalesced access in parallel by each warp. Hence, as you’d expect from such a great name, Gazelles are fast!