Aussie AI

Chapter 11. CUDA Heap Memory Allocation

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 11. CUDA Heap Memory Allocation

Heap Memory Optimizations

In an ideal world, we wouldn’t use memory allocation inside GPU kernels. In the real world, there are optimization techniques for that. These techniques can also be used for host code, because the issues with dynamic memory allocation are similar for CPU and GPU code.

Larger Device Heap Size. You can increase the heap memory size for GPU kernels via the cudaDeviceSetLimit function with property cudaLimitMallocHeapSize. But maybe sit on your hands before you do this, because this is a reminder that perhaps your kernel should use less heap memory.

You can see the heap size of your GPU by getting and setting the “limit” property for the heap. Here’s how to print it:

    // Get heap size
    CUDACHK( cudaDeviceGetLimit(&val, cudaLimitMallocHeapSize) );
    double meg = val / ((double)1024*1024);
    printf("Device: heap size = %d bytes (%3.2f MB)\n", (int)val, meg);

Kernel Memory Allocation Optimizations. Lots of malloc and free calls in a kernel can really get it down. This is true for all C++ code, but using memory allocation on CUDA kernels is also a sluggish practice, and there’s only a limited heap size for threads on the GPU, which you can query via cudaDeviceGetLimit and cudaLimitMallocHeapSize. The general types of optimizations include:

Change to non-heap memory (various ways).
Fewer allocation calls in general.
Less data overall with smaller data types.
Reduce fragmentation with consistent sizing.
Allocate later and free earlier.

Instead of allocating memory, try to use iterations, or store the data in other memory. For small amounts of data, the performance of alloca, which allocates stack memory, should be compared with the normal heap allocations with malloc or new.

If it’s slow, just do it less! Don’t needlessly allocate heap memory for convenience. For example, instead of an allocated buffer, have a dynamic buffer class which includes a fixed size of data, and only allocates heap memory if the total data exceeds this default size. Rather than catering to the general case with allocated memory, specialize your algorithm for smaller sizes to avoid needing memory allocations.

Fragmentation of the heap is another issue. Lots of allocations of small blocks, or interleaving allocations of differently sized blocks, can cause “fragmentation” of your memory heap. The allocator does its best, but sometimes it needs your help. Find ways to not only allocate fewer blocks overall, but keep the different sizes to a minimum.

An advanced way to avoid allocations and deallocations on the device is to use a “persistent kernel” architecture. This is a never-exiting thread, in which case you can pre-allocated the memory blocks you need, rather than repeated allocations and de-allocations.

Late Allocation and Early Free

A typical simple CUDA C++ kernel, such as a test version of vector addition, has this sequence:

1. Allocate three local host vectors (malloc).

2. Allocate three device vectors (cudaMalloc).

3. Store the input data into the two host vectors (host code).

4. Copy the two input vectors (cudaMemcpy host-to-device).

5. Launch the kernel for vector addition (<<<...>>> syntax).

6. Synchronize host-device (cudaDeviceSynchronize or implicitly).

7. Download the results vector (cudaMemcpy device-to-host).

8. Process the local results vector (host code).

9. De-allocate three device vectors (cudaFree).

10. De-allocate three local vectors (free).

But that’s not actually optimal and needlessly retains heap memory on both host and device over time periods where it doesn’t need that memory. The results vector should be allocated later, and the two input vectors should be freed earlier. Here’s a much better sequence:

1. Allocate two local host input vectors (malloc).

2. Store the input data into the two host vectors (host code).

3. Allocate two (not three) device vectors (cudaMalloc).

4. Copy the two input vectors (cudaMemcpy host-to-device).

5. De-allocate two local input vectors (free) — earlier!

6. Allocate the third results device vector (cudaMalloc) — later!

7. Launch the kernel for vector addition.

8. Synchronize host-device.

9. De-allocate the two input device vectors (cudaFree) — earlier!

10. Allocate the third results local vector (malloc) — later!

11. Download the results vector (cudaMemcpy device-to-host).

12. De-allocate the third results device vector (cudaFree) — earlier!

13. Process the local results vector (host code).

14. De-allocate third local results vector (free).

It’s more steps, but it’s more efficient in its use of dynamic memory on both the host and the device. There are variations on this and my suggestion is possibly not the best overall sequence. Arguably, you should prioritize GPU time over host time, so maybe you shouldn’t spend the time to de-allocate the two vectors with free on the host until later. So, our policy to maximize the critical section where the GPU is working should perhaps be to de-optimize the host memory allocation:

Call host malloc before any GPU interactions, and
Delay host free calls until all GPU work is finished.

Furthermore, this is a highly sequential algorithm that needs much more major changes to be fast on the GPU. Instead of only using the default CUDA stream, the faster method is breaking up the vectors into smaller segments across multiple streams to allow much greater parallelization.

As another idea, you might consider using the asynchronous allocation primitives such as cudaMallocAsync to overlap the device memory allocation and the host code that loads the input data locally. You might also think that you can do the free asynchronously, with a further speedup, but the behavior of cudaFreeAsync is a little disappointing.

Asynchronous Memory Allocation

CUDA C++ has two asynchronous functions for memory allocation since CUDA 11.2: cudaMallocAsync and cudaFreeAsync. There are a few opportunities for optimization with these primitives, but there are also various pitfalls in the change from synchronous (blocking) calls to cudaMalloc and cudaFree, which have implicit host-device synchronization, to launching their asynchronous equivalents on a stream.

Another major risk of the unsynchronized versions is using the address from cudaMallocAsync before it is initialized. Changing to cudaFreeAsync is unlikely to trigger a new “use-after-free” or “double-free” fault, because that bug would have already occurred in the cudaFree blocking version.

Using asynchronous allocation with cudaMallocAsync can be part of overlapping memory management and kernel execution. This is discussed for data transfer optimizations in Chapter 10.

The use of cudaFreeAsync is not as useful as you might hope. For starters, you have to set up a stream, including creating it, and then destroying it later (so you don’t have a “stream leak”).

But it’s also not that efficient. Arguably, you should be able to just throw out an asynchronous de-allocation request, and then forget about it completely, since nobody’s ever going to be waiting for that memory. However, cudaFreeAsync does not actually free the memory until a synchronization point on the stream, so you need to incur this stream synchronization delay at some point, and the heap memory is still used up until you do.

Heterogeneous Memory Management (HMM)

HMM is a CUDA C++ feature announced in 2023 that simplifies host and kernel memory allocation. This is a software-based layer that merges the two main sources of allocated memory:

System memory allocation (e.g., malloc)
CUDA-managed memory allocation (e.g., cudaMalloc)

Effectively a generalization of Unified Memory, HMM allows memory allocated on either the host or GPU to be allocated by the other party. This works for memory allocated by any primitive, including standard C++ memory allocation (i.e., malloc, calloc, new), and CUDA memory allocation (e.g., cudaMalloc or cudaMallocManaged). It also works for memory-mapped blocks via the mmap system call. Thus, the programmer is relieved of the burden of needing to explicitly manage the various sources of memory allocation.

This new capability simplifies various coordination between the CPU and GPU. Some examples of areas that where direct access can be beneficial:

Device access to large memory blocks on the CPU (without transferring them!)
Configuration flags and message-passing between host and device.
Synchronization between host and device (e.g., atomics).
Memory-mapped I/O (directly accessible from the GPU).

Custom Memory Allocators

If you don’t like how much memory overhead there is from allocated memory, you can define your own. But here’s a warning: These methods are not for the feeble or uncommitted, and can get quite involved. What’s more, they can even be a de-optimization, because it’s hard to beat operating system capabilities that have had many years of optimization work.

Memory Pooling. This is a memory management technique for allocated memory on the heap (e.g., malloc) that aims to reduce the costs from memory fragmentation. You can take control of the memory allocation algorithm by using your own “pools” of allocated memory blocks. This works particularly well if you have a large number of allocated memory blocks that are exactly the same size. A common example is where a particular C++ class object is allocated a lot, in which case you can override its memory allocation with class-specific new and delete operators.

Dynamic Memory Allocators. You can also try to define your own generic memory allocator routines to replace the default versions provided by CUDA. But honestly, you’d have to get up very early in the morning to do better than what’s already been done. But if you must do it, this is one way:

Macro interception of malloc, calloc and free.
Link-time interception of the global new and delete operators.

If you think you’re up for it, feel free to take on this challenge!

Allocating Stack Memory: alloca

The alloca function can be used to allocated stack memory rather than heap memory. It works in both host and device code, but this optimization may be less applicable to host code, simply because it has a huge heap space.

This main reason to try stack allocation is that the stack is faster memory than shared memory or global memory. Although the alloca function can be used in device code to dynamically allocate memory blocks on the thread’s stack, there are various advantages and disadvantages that should be considered carefully.

The main advantages of alloca include:

Stack memory can be faster to access.
alloca function is itself fast.
Automatic de-allocation on function exit when the stack unwinds.
Helps avoid heap memory fragmentation.

The pitfalls include:

Alignment problems (there’s also aligned_alloc).
Limited stack size (i.e., alloca may fail in device code).
Addresses are invalid after function exit (i.e., the scope is reduced and lifetime is shorter than with heap allocations).
Uninitialized memory by default (but nor do malloc or cudaMalloc).
No way to de-allocated it programmatically, before the function exits.

There is only a limited amount of local memory per thread, although you can view or increase this amount using the cudaDeviceGetLimit and cudaDeviceSetLimit functions with the cudaLimitStackSize property. Here’s how to print the value:

    // Get stack size
    CUDACHK( cudaDeviceGetLimit(&val, cudaLimitStackSize) );
    double kb = val / ((double)1024);
    printf("Device: stack size = %d bytes (%3.2f KB)\n", (int)val, kb);

Note that it’s much more important to test for memory allocation failure when using alloca, because the GPU stack is so small. But, of course, you’re always using good programming style and checking the return values of every CUDA primitive, right?

Memory Leaks

A common problem in managing allocated memory is leaking memory, from blocks that are not de-allocated by free or cudaFree. Finding memory leaks is often considered a debugging task, but it also helps with performance, so it’s both “debugging” and “deslugging.” Nevertheless, many of the debugging tools are helpful in detecting leaked blocks:

compute-sanitizer
valgrind (limited)

Compute Sanitizer is more focused on memory errors than leaks with its default settings. The full memory leak detection available in compute-sanitizer can be enabled via the “leak checking” command-line argument:

    compute-sanitizer --leak-check=full myexecutable

You can even use the Linux valgrind tool to chase down memory leaks, if you prefer. I’m not sure how fully it works for device code allocations, but it certainly works on host code.