Aussie AI

Chapter 9. Coalescing and Striding

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 9. Coalescing and Striding

Memory Access Patterns

In addition to choosing the fastest type of memory, it also matters in what order the kernel accesses data. These are called “memory access patterns” and include issues such as data locality and coalescing.

Contiguous Memory Blocks. Many CUDA kernels perform better when operating on contiguous blocks of memory, usually via accessing adjacent memory addresses within a warp. Generally, attempt to keep any data that you have in a way that it stays together, rather than needing to be accessed randomly. Some of the problem areas in this regard include:

Multidimensional arrays (i.e., matrices and tensors).
Hierarchical data structures (e.g., trees and tries).
Permutation arrays (e.g., sparse representations using permutations).
Lookup tables (e.g., precomputations of activation functions).

Linearize Multi-Dimensional Data. Don’t store matrices or tensors in a complex multi-indirection layout with pointers. Instead, CUDA prefers having all of the data in one contiguous linearized array, where you have to manually compute the index offset when doing 2D or 3D accesses. This helps GPU parallelization by achieving both data locality and coalesced memory accesses.

Structure-of-Arrays (SoA). To achieve contiguous storage of data in arrays, it is recommended to use the Structure-of-Arrays (SoA) storage structure, rather than the Array-of-Structures (AoS) layout. This helps optimizations that linearize complex data and use contiguous access for data locality and coalescing.

Coalesced Memory Access Patterns. This technique involves having all threads in a warp or block accessing memory locations that are adjacent to each other. Coalesced memory access patterns are optimized in the GPU hardware, and allow the warp to access an entire contiguous block of memory in parallel. This is faster than accessing memory with intervals between the locations. Storing multi-dimensional data in a linearized format, and preferring Structure-of-Arrays (SoA) over Arrays-of-Structures (AoS) are two ways to gain coalesced accesses. Non-coalesced memory usage is also called a “scattered” memory access pattern.

Grid-Stride Loops. The performance advantage of grid-stride loops is that they achieve coalesced memory access patterns, which are efficient across a full warp. Another advantage is that they are flexible to any number of blocks and threads, so they allow optimization of occupancy rates and wave management without needing to alter the internal methods of the kernel. This type of loop is also safe from array bounds violations and simple to debug, because it also works in a fully sequential mode with a single thread, in which case it has a stride of one. Although best known for use in linear kernels (e.g., vector operations), there are generalizations of grid-stride loops to two-dimensional and three-dimensional kernels, which is similar to tiling.

Avoid Scatter and Gather. These memory access patterns are the polar opposite to coalescing, so it’s little surprise that they would run slow. Try to avoid using scatter/gather methods in your algorithms wherever possible.

Coalesced Data Access

An important way to optimize CUDA kernel data access costs is to use a “coalesced” pattern of accessing chunks of data in global memory. Coalesced data accesses are parallel accesses to contiguous memory addresses. In other words, this means to access contiguous memory with a group of threads, all at the same time. The basic idea is that:

Each thread accesses an adjacent memory address
None of the threads access the same address (no contention)

For a warp of 32 threads, it would access 32 adjacent memory addresses in a contiguous block, with each thread accessing a separate address. These coalesced accesses to global memory are faster in a GPU than random accesses to global memory. Kernels run faster when the memory being accessed is together.

Note that is similar to the CPU speedup of accessing continuous memory blocks in sequence, because of memory cache prefetch optimizations. However, that refers to sequential CPU logic, which we have totally forgotten now that we’ve got a better parallel processor (called a GPU).

What does coalesced memory accesses look like a kernel? Nothing exciting, let me assure you. In fact, they’re very basic and unexciting:

    __global__ void aussie_clear_vector_coalesced(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n) {
                v[id] = 0.0;  // Clear one vector element..
        }
    }

Why is this coalesced? Because every thread accesses a different element of the vector v, in global memory. For simplicity, let’s assume that n==128 and we’re running 2 blocks of 64 threads each (i.e., 2 warps of 32 threads in each block). When it runs, all 128 threads run in parallel, all in lock-step, and they all process v[id] at exactly the same time. The first warp has 32 threads that access indices 0..31, the second warp is also in the first block, and it accesses 32..63, and the next two warps in the second block access array elements 64..95 and 96..127.

These accesses are all coalesced. Each warp of 32 threads accesses all 32 of the contiguous memory elements, all in parallel.

But that’s just an ordinary CUDA kernel, I hear you say. That’s correct, and the basic thread model of kernels tends to use coalesced memory accesses, unless we try hard not to.

Non-Coalesced Memory Accesses

Let’s look at a different kernel, where each thread is clearing 4 elements of the vector, rather than just one. This is not usually a good plan, unless our vectors are so large that we can’t run enough threads to do each element in a separate thread. Hence, here’s the code:

    __global__ void aussie_clear_vector_non_coalesced(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        id *= 4;
        if (id < n) {
            v[id] = 0.0;  // Clear 4 vector elements..
            v[id + 1] = 0.0;
            v[id + 2] = 0.0;
            v[id + 3] = 0.0;
        }
    }

Now, if you’ve got your old CPU sequential C++ programming hat on, this code looks better. It looks like a loop unrolling optimization that does 4 assignments in a row, and should run a lot faster.

Not at all! Everything is upside-down in parallel world (which is why Australians like it), because this kernel is actually:

Slower, in parallel, and
Non-coalesced memory accesses

Because each kernel thread is clearing 4 elements, we actually need to run 4 times fewer threads so it’s much less parallel. And each kernel is doing 4 accesses to global memory, one after another, in an apparently sequential algorithm. This is much less parallelization, except to the extent that the NVCC compiler auto-parallelizes anything. We really should expect this kernel to run 4 times slower.

Furthermore, it’s not a coalesced memory access pattern. Let’s just examine the first warp of 32 threads. These will set id variable to 0...31, and then it’s quadrupled to have values of 0, 4, ..., 124 (31*4). This seems superficially like it might be a coalesced memory pattern, since the 4 assignments in each thread will iterate over a contiguous memory block of 128 vector elements, and ensure that elements v[0]..v[127] are all cleared properly.

But it’s not a coalesced access order. The first assignment v[id] in these 32 threads will set v[0], v[4], v[8], etc. This is not a coalesced pattern, since none of these addresses are adjacent.

The second assignment is also non-coalesced, which also reduces performance. Each of the threads will run it in lock-step, but v[id+1] will set v[1], v[5], v[9], etc. Similarly, the third and fourth assignments are not coalesced.

Strided Memory Accesses

The way to get coalesced memory accesses, if you can’t have each thread accessing only one address, is to have each thread “stride” through the array. In this way, each thread can do multiple memory access operations, but each step of the sequence has all the 32 threads in a warp accessing adjacent memory addresses, so as to achieve coalescing.

How does a stride work? We want to have 32 threads, each doing 4 assignments, but in such a way that each of the 4 assignments in spreading a contiguous range of 32 vector elements.

Obviously, we want the first assignment v[id] to set v[0]...v[31] for the 32 threads in the warp. The second assignment should set v[32]...v[63]. The third should cover 64..95 and the fourth has 96..127. If we look at just the first thread in the warp, this should set:

First assignment — v[0]
Second assignment — v[32]
Third assignment — v[64]
Fourth assignment — v[96]

You can see the pattern: we want to do so every 32 vector elements. This value of 32 is called the stride of the array. Or, rather, the number of bytes is more properly called the stride on an array, but you get the idea.

Here’s our first attempt at a strided kernel with coalesced memory access patterns:

    __global__ void aussie_clear_vector_stride(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n) {
            int stride = 32;
            v[id] = 0.0;  // Clear 4 vector elements.. STRIDED!
            v[id + stride] = 0.0;  // BUG!!
            v[id + 2*stride] = 0.0;
            v[id + 3*stride] = 0.0;
        }
    }

Well, this is indeed coalescing its memory accesses, but not always in the right place — it’s just a little buggy. The test “id<n” does not prevent array bounds overflow for the subsequent assignments such as “v[id+stride]”. Hence, this code craters for any values of n with extra threads. It really needs an “if” safety test before all the four assignments, rather than just the first.

The way to fix it is actually to use a “stride loop” idiom, where array bounds are checked each iteration. Note that this code isn’t yet one of the fabled “grid-stride loops” that lives in the GPU unicorn forest. This is just a basic stride loop, but not over the grid size.

    __global__ void aussie_clear_vector_stride_loop(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = 32;
        for (int i = 0; i < 4; i++, id += stride) {
            if (id < n) {
                v[id] = 0.0;  // Clear 4 vector elements.. STRIDED!
            }
        }
    }

The loop version is basically like the non-loop stride version, but re-rolled into a short loop. It does the same set of array accesses in the same order, but the safety occurs because “id<n” is performed each iteration. This is a coalesced access order, as discussed above, because each assignment runs in a warp in lock-step. The contiguous memory addresses of v are processed like this by the 32 threads in the first warp: 0..31, 32..63, 64..95, and 96..127.

Contiguous Memory Blocks

CUDA C++ typically requires that matrices and tensors are stored in linearized contiguous memory, whether in CPU RAM or GPU VRAM. A critical part of optimizing backend engines is to manage the storage of data in a contiguous memory block, so that they have a sequential address space for high data locality and coalesced access patterns.

Processing chunks of data in parallel is the main optimization used in GPU acceleration. All of the vectors, matrices, and tensors need their underlying data in a block. For example, this applies to both pre-computed static weights and dynamic interim activation results in an AI engine’s computations.

Processing data that is in adjacent addresses is much faster than jumping all over the place for both host and device code. Vectors should obviously be stored in a simple contiguous array of memory. Less obviously, similar comments apply to the memory storage of matrices and tensors.

The use of contiguous memory is an important optimization for both sequential and parallel algorithms. The reasons that memory blocks are more efficient include:

Data locality (cache hits)
Data block GPU uploads (model weights from memory-to-cache)
Coalesced access patterns (GPU data access)
Predictive cache pipelining (in CPU sequential accesses)

Data locality refers to using data in the same or similar address locations. This is helpful for the cache hit rate because data that is already in the cache is much faster to access that a non-cached RAM memory address.

GPU uploads from CPU RAM to the GPU’s RAM (VRAM) is done in blocks via cudaMemcpy. Obviously, we don’t want to be uploading random bits of data from different parts of the RAM.

Non-GPU architectures also benefit from the use of contiguous memory. This is obviously true of CPU SIMD instructions (e.g., AVX on x86), but even in sequential execution, the CPU has its own RAM caching methods and often has other optimizations of memory accesses. Predictive cache pipelining is where the CPU attempts to predict what the next memory location will be, and load it in a pipelined speedup, before being asked. This pipelining of memory accesses is much faster than doing completely sequential address lookups.

Typically, predictive cache pipelining uses the simple heuristic that the next memory address is the most likely next request, which assumes that data is being processed in order of the addresses. Hence, scanning an array in reverse is the worst possible order for these CPUs. Similarly, jumping around to different memory addresses, such as scanning the column of a matrix using a large “stride,” is also inefficient on a CPU. GPUs are different from CPUs, and striding through an array can be a perfect optimization for coalescing in GPU kernel threads.

Fast Memory Block Operations

The slow way to do things in arrays is one element at a time. The faster way is to use the standard memory block functions on the whole array. There are a number of standard functions that operate on array data or memory blocks and they are very fast. Memory block operations in the standard C++ libraries are implemented using fast assembly language behind the scenes. The main functions in the standard C++ library that operate on binary bytes in a memory block are:

memset, cudaMemset — set bytes to a value, or clear bytes to zero.
memcpy, cudaMemcpy — copy bytes in a block (non-overlapping).
memmove — copy bytes, allowing overlapping blocks (no CUDA equivalent).
memcmp — compare two blocks of bytes (no CUDA equivalent).
memchr — search for a byte in a block (no CUDA equivalent).

Note that unlike the standard string functions (such as strlen), these functions do not assume a block is null-terminated by a zero byte. Zero is simply a binary value, and these functions don’t stop at a zero byte. All of these functions operate on a block of memory with a fixed byte length.

Each compiler environment typically offers some extra non-standard byte-wise functions that are also fast. Some of the less standardized C++ intrinsics that operate on memory blocks include:

_memccpy — copy bytes up to a specified sentinel byte.
memicmp / _memicmp — compare bytes ignoring letter case.
bcopy — copy a block of bytes.
bzero — clear a block of bytes to zero.
bcmp — compare bytes in two blocks.
_byteswap_uint64 — swap the bytes of an integer (Microsoft).
__builtin_bswap16 — swap the bytes in an integer (GCC), with 32-bit and 64-bit versions.

Memory Initialization Optimizations

There are several schools of thought with regard to initialization costs:

Don’t initialize (speedy) but run compute-sanitizer often!
Initialize everything to zero so it’s safe for everyone (slower).
Initialize to non-zero bytes to shake out more “use of uninitialized memory” type bugs.
Initialize debug versions with non-zero bytes, but compile-out for production (faster for customers).
Initialize to non-zero bytes for debug versions, but initialize to zero for production (safer for customers).

Making that choice is above my pay grade, but you’re free to choose whatever you like. It’s not even the full list, because the “shake out more bugs” idea has a few variants:

Regularly use compute-sanitizer in the nightly builds.
Use a debug library that quickly marks memory with magic values and detects bad accesses based on the magic numbers.
Use a full debug library that tracks all allocated addresses in a hash table.

Initialize memory blocks with cudaMemset and memset

The cudaMemset function is for initializing device memory (or host memory in Unified Memory), such as a typical call:

    cudaMemset(dptr, 0, n * sizeof(float));   // Initialize

The memset function is a lower-level function in standard C++ that sets all of a memory block to a byte value. It can be used in both kernel and device code, and is widely used as a fast way to initialize a block of memory to all zeros. Typical usage is therefore:

    memset(&x, 0, sizeof(x));   // Initialize

Almost all usages of cudaMemset or memset will be for the zero byte. The only other usage I’ve seen is to fill memory with a dummy non-zero byte as a form of mutation testing to catch uses of uninitialized memory. Here’s an example:

    cudaMemset(dptr, 0x55, sz);  // Magic value for debug

memset sizeof problem. Here’s a common glitch in using memset inside functions with array parameter:

    void zero_array(int arr[10])
    {
        memset(&arr, 0, sizeof(arr));  // Bug
    }

The problem is not memset, but the sizeof operator on function parameters. An array parameter in a function is like a hologram and isn’t really there. It’s not really an array, but a pointer, and sizeof(int[10]) is the same as sizeof(int*). Hence, sizeof(arr) is probably only 4 or 8, rather than 40 or 80, leaving most of the array uninitialized, which is an insidious bug.

Personally, I also recommend a memset debug wrapper function to catch this kind of problem at runtime. Alternatively, maybe use a tricky preprocessor macro can detect it at compile-time with a static_assert somehow. These techniques are discussed in my book CUDA C++ Debugging.

memset portability issue. Even though it’s a fast zeroing method, the use of memset to zero bytes has an obscure portability problem on any architecture where all-bytes-zero is not the same as all data types zero. However, on most standard platforms, all-bytes-zero is correct for all types: integer zero (regardless of endianness), floating-point zero (positive zero is all bits zero), and the null pointer.

Fast memory block copying with cudaMemcpy and memcpy

The fast way to copy an entire memory block is with cudaMemcpy or memcpy. The cudaMemcpy API is widely used in CUDA C++ programs to copy data between host and device.

The memcpy function tends to be used at a lower level for data manipulation. This applies to copying vectors and to matrices and tensors that are linearized into a contiguous array. Rather than copy each element of an array, one at a time, in a loop, the memcpy standard library function can be used to copy the entire block in one statement:

    memcpy(destarr, srcarr, sizeof(srcarr)); // Copy bytes

Note that our use of sizeof operator here is risky, if the variable is an array parameter of a function, because this will return the smaller value representing the size of a pointer.

Note that memcpy is a bitwise copy of the array intended for simple data types. For example, it won’t run C++ copy constructors if applied to an array of objects.

The memcpy function does a very fast memory block copy. It is like strcpy in that the destination is the first parameter, but memcpy will copy everything, even null bytes and hidden padding bytes, and will always copy a fixed number of bytes.

memcpy overlapping blocks error: memcpy is a super-fast byte copy, but is unsafe, because it does not have well-defined behavior if the source and destination blocks overlap. Note that cudaMemcpy does not seem to have this problem (there’s no cudaMemmove function), but overlapping parameter ranges might be a coding bug anyway. And don’t worry, cudaMemcpy doesn’t get left out, because it has enough of its own problems, like ensuring the pointers are the right types, and the copy mode is the correct direction.

The only downside with memcpy is that it can fail with overlapping ranges for the source and destination blocks, so if you are shuffling arrays up or down one element using memcpy, then you have to be careful, because the results on overlapping ranges are undefined. Here’s a buggy example of using memcpy to remove the first character of a string in place:

    memcpy(s, s+1, strlen(s+1)+1);  // Bug

The problem is that the blocks starting at “s” and “s+1” are overlapping. It is implementation-defined whether this code will run correctly. The fix is simply to use memmove, which always works correctly for overlaps:

    memmove(s, s+1, strlen(s+1)+1);  // Correct

The memmove function is a safer version of memcpy, which also works correctly if the memory blocks overlap. If the source and destination blocks don’t overlap, it’s the same as memcpy, except probably slightly slower. If they do overlap, then memmove conceptually will copy the source to a temporary area, and then copy it to the destination block.

memcmp byte comparisons

The memcmp function does a byte-wise comparison of a memory block. There’s no equivalent (i.e., no cudaMemcmp function), but an alternative is the vector equality test Thrust::equal.

The memcmp return value is like strcmp, returning 0 for equality, and a negative or positive value otherwise. However, note that memcmp is not like strcmp, and will not stop when it finds a zero byte.

memcmp return value. A pitfall with memcmp is that you cannot assume that it returns 1 or -1, but must compare the return result to zero (like the strcmp function).

    if (memcmp(&a, &b, sizeof(a)) == 1)  // Bug
    if (memcmp(&a, &b, sizeof(a)) > 0)   // Correct

memcmp object equality testing bug. Looking at the memcmp function, you might think of it as an opportunity to do a fast equality/inequality test on large objects by simply doing a byte-wise test. You would not be the first to think that.

Unfortunately, there are several obscure pitfalls with this approach. There are multiple ways in C++ that the data in two memory blocks might be “identical” in a programming sense, but the bytes different:

Padding bytes (alignment)
Two types of floating-point zero
Multiple types of floating-point NaN
Bitfield data members (unset padding bits)

You can only use memcmp for memory block comparisons (e.g., tensor equality) if you’re sure that these situations cannot occur, such as a contiguous array of primitive data types. Depending on context, you may or may not want floating-point zero and NaN values to map as equal. Take care with this!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization