Aussie AI

Chapter 8. CUDA Memory Optimizations

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 8. CUDA Memory Optimizations

Memory Optimization Techniques

Memory processing is often an important part of optimizing CUDA C++ kernels. The costs of memory arises from:

Memory access costs
Memory transfer costs

Here are some of the main techniques:

Different types of memory
Reduce overall memory usage
Coalescing and striding memory access patterns
Caching and data locality
Overlapping data transfers and computation
Avoid memory access contention
Optimize allocated memory

Let’s examine them in more detail below.

CUDA Memory Hierarchy

The GPU has several layers of memory storage with different speeds and sizes. Hence, the simple idea for CUDA optimization techniques:

Use the fastest type of memory!

The hierarchy of CUDA memory types, from fastest to slowest, includes:

Scalar constants and numerics (no memory)
Registers
L1 cache
L2 cache
Local memory (including the stack)
Shared memory (block scope)
Constant memory (global, but read-only)
Global memory

Unified Memory. Note that CUDA’s capability for Unified Memory is not a distinct type of physical memory. This refers to the software capability whereby CUDA allows both the host and device to access the memory address space, but have it managed behind-the-scenes by the CUDA Runtime and the GPU hardware. This is beneficial to both performance and programmer productivity by allowing automatic optimization of certain types of memory transfers and data flow in an application.

Warp memory? Note also that there’s not really any warp-specific type of memory. The famous “warp shuffle” CUDA intrinsics do not actually have a separate memory mechanism in the normal sense but are actually a register-to-register transfer of data between threads in the warp. This is further based on various criteria such as which threads are participating within the warp and what arithmetic operation is to be performed by the shuffle. These warp-scope operations are a way to avoid using shared memory via warp-wide data exchange, but they don’t use a distinct warp-only type of memory.

Numeric Constants. Constant scalar values such as number constants and named symbolic constants in CUDA C++ are very efficient. These are examples:

    #define MYCONST 32
    const int myconst = 32;

These constants do not consume memory as they are not really stored anywhere. You can see them in the PTX assembly listings (using the “-keep” compiler option), so they are “stored” in the instruction code as operands for the GPU hardware machine-code instructions. However, larger constants such as string literals are stored in constant memory.

This means that if you can make anything a constant, you should absolutely do so. Rather surprisingly, the best example of using constant numbers is actually AI and the LLMs. All of the LLM’s data and their inference engines have a fixed size, known at compile-time, for all of their matrices and vectors. These values can be used as constants! Furthermore, putting constant values into the code allows the auto-optimizer in the compiler to do a great many additional optimizations (e.g., constant folding, constant propagation, constant expression evaluation). Hence, CUDA C++ programmers for AI engine backends should use constants for these values wherever possible in the engine code, rather than trying to generalize the code for all sizes.

Alas, most variables cannot be made into constants, but must be stored in some type of memory. Memory optimizations apply to any of these various levels of memory. Let’s examine each of the types of memory in turn.

Registers

The compiler does a great job of putting all of the simple variables in a CUDA C++ kernel into registers. The way to declare a variable in a register is simply:

    int x;

Hence, all the typical computations of the index, and the builtin variables like blockIdx and threadIdx, are all stored in GPU registers. You can check this by examining the PTX assembly code using the “-keep” compiler option.

Registers are the fastest type of memory on the GPU, and are specific to a thread. However, there are only a limited number of registers available, and a “register spill” is where the variables exceed registers, and must be stored in local memory on the stack.

You can review the GPU register limits by calling cudaGetDeviceProperties using the regsPerBlock property, such as:

    // Registers for whole GPU
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    int kreg = (int)prop.regsPerBlock / 1024;
    printf("Max Registers (whole GPU): %d registers (%dK)\n", (int)prop.regsPerBlock, kreg);

Technically, this is the “maximum” registers a block can use, but if you used this many, there’d be no other registers for other workloads. The registers are shared by all blocks running on a multiprocessor. Hence, you need to manage the registers across all your kernels, and any other workloads running on the same GPU.

Avoiding Register Spills

Methods to reduce the number of registers, and reduce the likelihood of register spills into local memory, include:

Minimize temporary variable usage (e.g., in index or stride calculations)
Minimize kernel function parameters
Limit the scope and lifetime of any local variables
Be careful with loop unrolling
Use compile-time optimizations (e.g., const, constexpr, etc.)

If you want to know about registers, there’s a “-v” option to the ptxas PTX assembler that can tell you about register access patterns (and also shared memory). The option you need to add to the nvcc compiler is:

    --ptxas-options=-v

There’s some irony here: the original versions of C and C++ languages had a “register” keyword, that allowed specification of which local variables should be stored in CPU registers. This was long before GPUs even existed. But it’s no longer in the official C++ language, having been deprecated in C++11 and fully removed in C++14, because the compiler became better than humans at deciding what CPU registers to use. And yet, here we are talking about GPU register allocation. CUDA programmers want the register keyword back!

Local Memory and the Stack

Each thread on the GPU has a limited amount of local memory, which maintains the thread’s execution stack and local variables. This is not as efficient as registers, but is more efficient that shared memory or global memory. You can simply declare local variables in CUDA C++ device functions, which are automatically allocated to either registers or local memory.

The size of the local memory for the “stack” is a limit property of the GPU that you can check (and set). Here’s an example of printing it out:

    // Get stack size
    CUDACHK( cudaDeviceGetLimit(&val, cudaLimitStackSize) );
    assert(err == cudaSuccess);
    double kb = val / ((double)1024);
    printf("Device: stack size = %d bytes (%3.2f KB)\n", (int)val, kb);

You can also set the stack size using cudaDeviceSetLimit.

Another optimization method is to use the alloca function for dynamically allocated stack memory. This memory should be faster than shared memory (or global memory), but there’s only a limited amount, and it only has thread-scope, being on the stack of each thread. If you want to share stack memory between threads (rather than using shared memory), you’ll have to use warp shuffle or warp reduce primitives.

Shared Memory

Shared memory is declared using the “__shared__” double-underscore specifier. This makes it very convenient to declare and access, but it’s not so simple to use efficiently.

There are two types of shared memory: static and dynamic. You can declare a static block of shared memory, with size known at compile-time, by using __shared__ in a thread-level local variable declaration, such as:

    __global__ void mykernel() 
    {
        __shared__ float myshared[128];  // Static
        // ...
    }

You can declare “dynamic” amounts of shared memory as a parameter to a kernel launch, which specifies the bytes. The GPU allocates the amount of shared memory (if it’s not too large), and links it with a global variable inside the kernel code. Hence, an example:

    int sharedsz = 128 * sizeof(float);  // Bytes
    mykernel <<< blocks, threads, sharedsz >>> ();  // Dynamic

Note that the above kernel launch syntax does not use __shared__. Instead, you need to declare a global variable inside your kernel code with both __shared__ and extern (and without a fixed size), such as:

    extern __shared__ float myshared[];   // Dynamic

    __global__ void mykernel() 
    {
        // ...
    }

Obviously, for CUDA to connect the two parts, just like in Highlander, there can be only one extern shared memory variable.

Efficiency of shared memory. Using shared memory is faster than using global memory, and it has the scope of all the threads in a block. Hence, its scope limit to a single block is more restricted than global memory, which can be accessed by all blocks. Shared memory is faster than using global memory, and can be very efficient. However, warp-level shuffle or reduction primitives can often be used for even faster kernels.

Shared memory size. There is only a limited amount of shared memory available to each block. And I mean small, like I’m back in 1982 programming a Commodore 64, because it can be as little as 48K. Yeah, I know, that looks like a typo, where I should have written 48G. Imagine trying to sell an iPhone with only 48K.

On some of the GPUs, the shared memory and L1 cache have to split the available memory. If there’s only 64K per multiprocessor, this has to be partitioned between L1 cache and shared memory (i.e., 16K L1 cache and 48K shared memory, or 16K shared memory and 48K L1 cache, or 32K-32K if you want to be trendy). You need to treat shared memory like it’s gold. And speaking of gold, I actually had a TRS-80 Co Co that wasn’t rose gold, but I digress.

You can manage this split between shared memory and L1 cache as a runtime policy. In CUDA C++ from the host code, you can use the APIs cudaDeviceSetCacheConfig (all kernels) or cudaFuncSetCacheConfig (one kernel). I’m not going to give you a code snippet, because you can just look it up on Stack Overflow anyway.

Anyway, here are some of the GPU shared memory sizes:

B100/B200 — 128K
A100 — 164K
V100 — 64K

Yes, it seems that the recent B100/B200 GPUs (Blackwell) have less shared memory than an A100 (Ampere). I’m sure there’s a very good reason for that, probably something to do with optimizing the transistors on the blah blah blah. I’m not a hardware engineer.

Anyway, that shared memory is per multiprocessor, so it’s not quite global across the entire GPU for everyone. You can find the shared memory size of a GPU in CUDA C++ using the cudaGetDeviceProperties function with the sharedMemPerBlock property. Here’s an example:

    // Shared memory per block
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    double sharedkb = (double)prop.sharedMemPerBlock / (1024.0f);
    double sharedmeg = (double)prop.sharedMemPerBlock / (1024.0f * 1024.0f);
    double sharedgig = sharedmeg / 1024.0f;  // In your dreams!
    printf("Shared memory: %ld bytes (%3.2f KB, %3.2f MB)\n",
        (long int)prop.sharedMemPerBlock, sharedkb, sharedmeg);

However, you cannot increase it! There’s no cudaSetDeviceProperties API in CUDA, because these are a GPU’s read-only hardware properties.

Bank Conflicts (Shared Memory Contention)

Shared memory can suffer performance degradation if there is contention in accesses to shared memory from multiple threads. These are called “bank conflicts” and should be avoided for better performance. Take care with the memory access patterns when using shared memory to avoid this performance pitfall.

Bank conflicts occur when multiple threads attempt to access the same “bank” of shared memory at the same time. This means that each thread must access a separate bank, or else the accesses become serialized.

I feel like I’m in opposite world! Didn’t we learn that “coalesced memory access patterns” need adjacent memory addresses in global memory. But now it’s the exact reverse for shared memory? Adjacent addresses are the worst?

Nope! Wrong!

Actually, successive memory addresses map to different banks. That’s not how I would think of “banks” and maybe they should have called it “stripes” or something involving animals (zebras?).

But that’s actually good news, because it means that our optimizations whereby each thread in a warp accesses an adjacent memory address will still work. That means: grid-stride loops work for shared memory! Well, actually the “stride” won’t be the grid size, but now who’s being picky. So, we’ll name then “zebra loops” instead or “gazelles.”

Note that it won’t work too well if the threads are trying to access adjacent memory addresses that are more than 32-bits (the default), such as a double, which is 64-bits. In this case, a thread that accesses a 64-bit address will inherently trigger a bank conflict by accessing two 32-bit locations across two banks (I mean, zebras). Fortunately, there are CUDA C++ runtime APIs to change it: cudaDeviceSetSharedMemConfig with policies cudaSharedMemBankSizeFourByte or cudaSharedMemBankSizeEightByte. I’ll let you figure out which is which.

Another way to avoid bank conflicts is to change from the use of shared memory to warp-level primitives. These include “warp shuffle” and “warp reduce” primitives, which are very fast, but are limited in scope to the warp (32 threads), whereas shared memory has block scope level.

Constant Memory for Read-Only Data

Constant memory is declared using the “__constant__” special double-underscore specifier. Another optimization for memory usage is to use the faster constant memory for read-only data. However, there is a much more limited size of constant memory compared to the read-write global memory.

Constant data can be declared at global scope in device code by adding __constant__ to a global variable declaration:

    __constant__ float constarr[128];

In the above example, the variable is an array. Although this is the most common usage, it does not need to be of array type.

In the device code, you can just access this variable by its name. Note that __constant__ relates to the memory address and its read-only nature in the device. This specifier does not imply const or constexpr or other similar specifiers.

However, in host code you cannot just use the name of the variable, because that is a device address. Instead, you need to use the CUDA APIs to gets its generalized address in Unified Memory.

From the host side, constant memory can be accessed via the CUDA runtime APIs, such as cudaMemcpyToSymbol and cudaMemcpyFromSymbol. It is also possible to use cudaGetSymbolAddress to get the address in Unified Memory and then use cudaMemcpy with this address.

Interestingly, both methods can be used to modify the variable in the device memory. Umm, this means that the host can modify constant memory on the device. When is a constant not a constant?

The size of constant memory is small, as low as 64KB, even on recent GPUs. You can print out the size of the constant memory on the GPU using the totalConstMem property from cudaGetDeviceProperties:

    // Constant memory size
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    double constkb = (double)prop.totalConstMem / (1024.0f);
    double constmeg = (double)prop.totalConstMem / (1024.0f * 1024.0f);
    double constgig = constmeg / 1024.0f;
    printf("Constant memory: %ld bytes (%3.2f KB)\n", (long int)prop.totalConstMem, constkb);

Global Memory

Global memory is memory on the GPU device that can be accessed by any kernels. It has an address in Unified Memory that can be used to access this variable. The most common method is to use cudaMalloc on the host to create allocated global memory on the device.

    float *device_ptr = NULL;
    int sz = n * sizeof(float);  // bytes
    CUDACHK( cudaMalloc((void**)&device_ptr, sz);

After the above code sequence, a global block of memory has been allocated on the device, and its generalized address has been passed back to the host code. The host code cannot directly examine this memory:

    float val = device_ptr[0];   // Fails on host

However, the host code can use this address in other CUDA API calls that require the device pointer, such as cudaMemcpy or cudaMemset.

The device code cannot immediately use the allocated memory in “device_ptr” for the simple reason that it doesn’t have the address. It’s only on the host side. For this reason, it’s typical for a host-side kernel launch to pass this address to the device as a function parameter to the kernel function. An example of this type:

    mykernel <<< blocks, nthreads >>> (device_ptr, n);

Optimizing global memory. Global memory is the largest and also the slowest, so consider any algorithmic changes that can reduce the amount of memory stored globally. Use of global memory should be replaced with shared memory, local memory, constant memory, or register usage where possible. In order words, use anything else!

Other optimizations such as the smaller data sizes in quantization will inherently reduce global memory usage. Note that “global” memory is not available to everyone else running across the GPU, but only to your application (well, we hope so anyway).

You can print out the global memory size for your GPU via the cudaGetDeviceProperties API with the totalGlobalMem property:

    // Total global memory on GPU
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    double totalmeg = (double)prop.totalGlobalMem / (1024.0f * 1024.0f);
    double totalgig = totalmeg / 1024.0f;
    printf("Global memory: %ld bytes (%3.2f GB)\n", (long int)prop.totalGlobalMem, totalgig);

Modern NVIDIA GPUs have many dozens of gigabytes of global memory. Here’s a list of some of them:

B100/B200 (Blackwell) — 192GB
H100 (Hopper) — 80GB
A100 (Ampere) — 80GB
V100 (Volta) — 16GB or 32GB
P100 (Pascal) — 16GB

Caching and Data Locality

Data Locality. This is a method of optimizing data cache accesses to speed up memory accesses. A good example of this speedup is tiled matrix multiplication, which does small local computations on 2-D “tiles” within a matrix.

Memory Cache Optimizations. There are multiple levels of hardware caching for memory in a GPU. Various optimizations aim to maximize cache hits, so that memory accesses are as fast as possible. Optimizations with data locality are based upon speeding up the cache by increasing the cache hit percentage.

L2 Cache Optimizations. The L2 cache is fast and one optimization is to aim to maximize its use. One possibility is to modify the kernel so as to ensure that the total amount of data being used fits in the L2 cache in its entirety.

The size of the L2 cache can be reported programmatically via cudaGetDeviceProperties and the l2CacheSize property:

    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    double l2kb = (double)prop.l2CacheSize / (1024.0f);
    double l2meg = (double)prop.l2CacheSize / (1024.0f * 1024.0f);
    double l2gig = l2meg / 1024.0f;
    printf("L2 cache memory: %ld bytes (%3.2f KB, %3.2f MB)\n",
        (long int)prop.l2CacheSize, l2kb, l2meg);

Memory Prefetching. The GPU has various hardware capabilities of “prefetching” data from memory. This involves making a reasonable guess as to what address will next be required by the kernel. One simple heuristic is an adjacent address to the most recently accessed memory. Thus, optimizations such as data locality and coalesced memory access patterns may benefit from this hardware optimization behind-the-scenes.

CUDA C++ has the cudaMemPrefetchAsync runtime API, which can be used for prefetching at a relatively high level in Unified Memory. For example, this can be used to prefetch data to be ready for a kernel. However, this is not a way to control low-level memory caches.

The GCC compiler has an intrinsic function __builtin_prefetch that can be used by the programmer to give hints to the data prefetching algorithm. However, this isn’t currently supported in CUDA C++ device kernels, where it gives a compilation error. Hence, it can only be used in host code, when it is passed through to GCC on Linux.

A complicated method of using simple C++ assignments to local variables can be used to prefetch data into registers, for fast computation in threads. The details of this approach are discussed in Wijngaart and Oh (2022), cited in the reference articles. The general method is based on using assignments to four double local variables of the values from four global memory locations, analogous to:

    double d0 = v[0];
    double d1 = v[1];
    double d2 = v[2];
    double d3 = v[3];

However, their method is more complicated than this, but more efficient. The kernel then uses the local variables for computations (i.e., they’ll be stored in registers). As an alternative approach, the article also considers the use of shared memory for prefetching optimizations, but found problems with bank conflicts.

Tiled Memory Access Algorithms. Tiling is operating on a small subset of the data space, such as a two-dimensional “tile” (or a three-dimensional “block” in tensors). The gain from “tiling” an algorithm is from increased data locality and reduced storage of temporary data, which makes much better cache utilization. The classic example is tiled matrix multiplication algorithms, but many kernels can benefit from this approach.

Memory-Aware AI Attention Algorithms. The self-attention module in AI engines is known to be memory-bound. Hence, there are several cutting edge algorithms that aim to reduce the memory accesses during attention computations, including:

Multi-Query Attention (MQA)
Group Query Attention (GQA)
Flash Attention (already with versions 1, 2, and 3)
Paged Attention

There is also the combination: Paged Flash Attention.

Reverse Block Processing. There are some types of algorithms whereby better cache utilization occurs by having the blocks process the data in reverse order. However, these are the exception, not the rule, and reverse accesses can sometimes worsen performance by undermining data prefetch caches.

Memory Size Reduction

If you’ve got a lot of data, it’s going to consume a lot of memory, especially on the GPU. There are various ways to reduce the amount of memory needed to store the information.

Smaller Data Sizes (Quantization). Using smaller data sizes is cheaper to compute and also uses less memory. This is best known in “quantization” of LLMs from 32-bit floating-point to smaller sizes such as 8-bit integers or even 4-bit integers. If the data is smaller, both the access and transfer costs are lower (and also computation costs).

Are you using CUDA C++ for an LLM backend with quantization? Generally speaking, the following notes apply to quantization levels in AI applications, when reducing FP32 (32-bit floating-point) to smaller data types in the LLM:

FP16 quantization (16-bit floating point) — this is standard, with 50% memory gain and minimal decline in accuracy.
INT16 quantization (16-bit integers) — also very effective and accurate, plus allowing integer arithmetic.
INT8 quantization (8-bit integers) — very commonly used for a four-fold memory reduction with a slight accuracy loss.
INT4 quantization (4-bit nibbles) — surprisingly, this is widely used for an eight-fold size reduction and acceptable accuracy loss in many applications.
Binary quantization (1-bit) — very fast because it can be implemented with addition replacing multiplication, but is generally regarded as having too large accuracy degradation.
Ternary quantization (1.5-bit) — like binary, this is very fast via addition, but inaccurate.
INT32 quantization (32-bit integer) — not widely used, as offers no memory gain, and integer multiplication is not much faster than floating-point multiplication on modern hardware.

These are the main ones in industry practice, but research has numerous other options, such as FP8 and FP4 (low-bit floating-point), and also other integer sizes such as 2-bit (INT2), or weird bit sizes like, 3-bit, 5-bit, etc. There does seem to be some promise to techniques involving FP8, since some GPUs now support FP8. Also deserving more attention is INT2 quantization, which can be implemented as additions (one or two), and is more accurate than binary or ternary quantization. Finally, here’s a weird wrinkle: sometimes floating-point addition is slower than floating-point multiplication, because of IEEE 754 oddities.

Reduce Object Sizes. If you’re using any struct or class objects in your kernels, there are some ways to play with their in-memory size. The first point: print it out! Use the sizeof operator to know which objects are large, and also whether your changes make any difference.

The various techniques to get smaller structures and objects in C++ include:

Biggest data members first (rule-of-thumb is largest to smallest).
Try different orders of data members (i.e., consider “packing” and alignment issues).
Bitfields (use type unsigned so you don’t need a sign bit).
Small data types: review bool and enum versus int, short or unsigned char.
Unions (if you must)

Note that these changes can affect speed. Bitfields have runtime cost for packing and unpacking. The first object in a class is often fastest to access, because its offset is zero (optimized away). Hence, moving the largest data member first may interfere with the “most frequently used data member first” speed optimization. Also, fiddling with ordering can affect data locality, either for better or for worse.

Advanced Memory Optimization Techniques

Warp Shuffle and Warp Reduction. Algorithms can avoid using shared memory by using the much faster warp data-sharing intrinsics. There are various types, including warp shuffle primitives and horizontal reductions with warp reduction intrinsics.

Memory alignment. There are several cases where aligned memory accesses are more efficient than non-aligned addresses. This is now less important in advanced GPUs than in older architectures, but is still a consideration. Generally, accessing memory via a non-aligned address does not cause a crash or any error, but can slow down the request. This was once a serious efficiency problem, but is less important when optimizing on the more recent GPU architectures.

Texture and Surface Memory Optimizations. Although originally designed for graphics processing, the capabilities of texture memory can be used to optimize any algorithm. The advantages of texture memory include that it is cached and does not suffer from poor performance for non-coalesced access patterns. If your algorithm requires more unpredictable access to data, this may be a good option rather than non-coalesced access to other types of memory.

Kernel Fusion. This optimization merges two separate kernels into one kernel. This does not usually reduce computation, because the arithmetic from both kernels must still be performed. However, kernel fusion reduces memory accesses by avoiding the storing and re-loading of any temporary data that is created by the first kernel and used by the second kernel. Hence, the total cost of memory access is significantly reduced.

References

Yuan Lin and Vinod Grover, Jan 15, 2018, Using CUDA Warp-Level Primitives, NVIDIA Technical Blog, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Rob Van der Wijngaart and Fred Oh, Mar 23, 2022, Boosting Application Performance with GPU Memory Prefetching, NVIDIA Technical Blog, https://developer.nvidia.com/blog/boosting-application-performance-with-gpu-memory-prefetching/
Mark Harris, Jan 28, 2013, Using Shared Memory in CUDA C/C++, NVIDIA Technical Blog, https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Chapter 8. CUDA Memory Optimizations