Aussie AI
Chapter 11. CUDA Heap Memory Allocation
-
Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
-
by David Spuler
Chapter 11. CUDA Heap Memory Allocation
Heap Memory Optimizations
In an ideal world, we wouldn’t use memory allocation inside GPU kernels. In the real world, there are optimization techniques for that. These techniques can also be used for host code, because the issues with dynamic memory allocation are similar for CPU and GPU code.
Larger Device Heap Size.
You can increase the heap memory size for GPU kernels via the cudaDeviceSetLimit function
with property cudaLimitMallocHeapSize.
But maybe sit on your hands before you do this,
because this is a reminder that perhaps your kernel should use less heap memory.
You can see the heap size of your GPU by getting and setting the “limit” property for the heap. Here’s how to print it:
// Get heap size
CUDACHK( cudaDeviceGetLimit(&val, cudaLimitMallocHeapSize) );
double meg = val / ((double)1024*1024);
printf("Device: heap size = %d bytes (%3.2f MB)\n", (int)val, meg);
Kernel Memory Allocation Optimizations.
Lots of malloc and free calls in a kernel can really get it down.
This is true for all C++ code, but using memory allocation on CUDA kernels is also
a sluggish practice, and there’s only a limited heap size for threads
on the GPU, which you can query via cudaDeviceGetLimit and cudaLimitMallocHeapSize.
The general types of optimizations include:
- Change to non-heap memory (various ways).
- Fewer allocation calls in general.
- Less data overall with smaller data types.
- Reduce fragmentation with consistent sizing.
- Allocate later and free earlier.
Instead of allocating memory, try to use iterations, or store the data in other memory.
For small amounts of data, the performance of alloca,
which allocates stack memory, should be compared with the normal heap allocations
with malloc or new.
If it’s slow, just do it less! Don’t needlessly allocate heap memory for convenience. For example, instead of an allocated buffer, have a dynamic buffer class which includes a fixed size of data, and only allocates heap memory if the total data exceeds this default size. Rather than catering to the general case with allocated memory, specialize your algorithm for smaller sizes to avoid needing memory allocations.
Fragmentation of the heap is another issue. Lots of allocations of small blocks, or interleaving allocations of differently sized blocks, can cause “fragmentation” of your memory heap. The allocator does its best, but sometimes it needs your help. Find ways to not only allocate fewer blocks overall, but keep the different sizes to a minimum.
An advanced way to avoid allocations and deallocations on the device is to use a “persistent kernel” architecture. This is a never-exiting thread, in which case you can pre-allocated the memory blocks you need, rather than repeated allocations and de-allocations.
Late Allocation and Early Free
A typical simple CUDA C++ kernel, such as a test version of vector addition, has this sequence:
1. Allocate three local host vectors (malloc).
2. Allocate three device vectors (cudaMalloc).
3. Store the input data into the two host vectors (host code).
4. Copy the two input vectors (cudaMemcpy host-to-device).
5. Launch the kernel for vector addition (<<<...>>> syntax).
6. Synchronize host-device (cudaDeviceSynchronize or implicitly).
7. Download the results vector (cudaMemcpy device-to-host).
8. Process the local results vector (host code).
9. De-allocate three device vectors (cudaFree).
10. De-allocate three local vectors (free).
But that’s not actually optimal and needlessly retains heap memory on both host and device over time periods where it doesn’t need that memory. The results vector should be allocated later, and the two input vectors should be freed earlier. Here’s a much better sequence:
1. Allocate two local host input vectors (malloc).
2. Store the input data into the two host vectors (host code).
3. Allocate two (not three) device vectors (cudaMalloc).
4. Copy the two input vectors (cudaMemcpy host-to-device).
5. De-allocate two local input vectors (free) — earlier!
6. Allocate the third results device vector (cudaMalloc) — later!
7. Launch the kernel for vector addition.
8. Synchronize host-device.
9. De-allocate the two input device vectors (cudaFree) — earlier!
10. Allocate the third results local vector (malloc) — later!
11. Download the results vector (cudaMemcpy device-to-host).
12. De-allocate the third results device vector (cudaFree) — earlier!
13. Process the local results vector (host code).
14. De-allocate third local results vector (free).
It’s more steps, but it’s more efficient in its use of dynamic memory
on both the host and the device.
There are variations on this and my suggestion is possibly not the best overall sequence.
Arguably, you should prioritize GPU time over host time,
so maybe you shouldn’t spend the time to de-allocate the two vectors with free
on the host
until later.
So, our policy to maximize the critical section where
the GPU is working should perhaps be
to de-optimize the host memory allocation:
- Call host
mallocbefore any GPU interactions, and - Delay host
freecalls until all GPU work is finished.
Furthermore, this is a highly sequential algorithm that needs much more major changes to be fast on the GPU. Instead of only using the default CUDA stream, the faster method is breaking up the vectors into smaller segments across multiple streams to allow much greater parallelization.
As another idea, you might consider using the asynchronous allocation primitives such as cudaMallocAsync
to overlap the device memory allocation and the host code that loads the input data locally.
You might also think that you can do the free asynchronously,
with a further speedup, but the behavior of cudaFreeAsync is a little disappointing.
Asynchronous Memory Allocation
CUDA C++ has two asynchronous functions for memory allocation
since CUDA 11.2: cudaMallocAsync and
cudaFreeAsync.
There are a few opportunities for optimization with these primitives,
but there are also
various pitfalls in the change from synchronous (blocking) calls
to cudaMalloc and cudaFree,
which have implicit host-device synchronization, to launching their asynchronous equivalents
on a stream.
Another major risk of the unsynchronized versions is using the address
from cudaMallocAsync before it is initialized.
Changing to cudaFreeAsync is unlikely to trigger a new “use-after-free” or “double-free” fault,
because that bug would have already occurred in the cudaFree blocking version.
Using asynchronous allocation with cudaMallocAsync can be part of overlapping
memory management and kernel execution.
This is discussed for data transfer optimizations
in Chapter 10.
The use of cudaFreeAsync is not as useful as you might hope.
For starters, you have to set up a stream, including creating it,
and then destroying it later (so you don’t have a “stream leak”).
But it’s also not that efficient.
Arguably, you should be able to just throw out an asynchronous de-allocation request,
and then forget about it completely, since nobody’s ever going to be waiting for that memory.
However, cudaFreeAsync does not actually free the memory until
a synchronization point on the stream,
so you need to incur this stream synchronization delay at some point,
and the heap memory is still used up until you do.
Heterogeneous Memory Management (HMM)
HMM is a CUDA C++ feature announced in 2023 that simplifies host and kernel memory allocation. This is a software-based layer that merges the two main sources of allocated memory:
- System memory allocation (e.g.,
malloc) - CUDA-managed memory allocation (e.g.,
cudaMalloc)
Effectively a generalization of Unified Memory, HMM allows memory allocated on either the
host or GPU to be allocated by the other party.
This works for memory allocated by any primitive,
including standard C++ memory allocation
(i.e., malloc, calloc, new),
and CUDA memory allocation (e.g., cudaMalloc or cudaMallocManaged).
It also works for memory-mapped blocks via the mmap system call.
Thus, the programmer is relieved of the burden of needing
to explicitly manage the various sources of memory allocation.
This new capability simplifies various coordination between the CPU and GPU. Some examples of areas that where direct access can be beneficial:
- Device access to large memory blocks on the CPU (without transferring them!)
- Configuration flags and message-passing between host and device.
- Synchronization between host and device (e.g., atomics).
- Memory-mapped I/O (directly accessible from the GPU).
Custom Memory Allocators
If you don’t like how much memory overhead there is from allocated memory, you can define your own. But here’s a warning: These methods are not for the feeble or uncommitted, and can get quite involved. What’s more, they can even be a de-optimization, because it’s hard to beat operating system capabilities that have had many years of optimization work.
Memory Pooling.
This is a memory management technique for allocated memory on the heap (e.g., malloc)
that aims to reduce the costs from memory fragmentation.
You can take control of the memory allocation algorithm by using your own “pools”
of allocated memory blocks.
This works particularly well if you have a large number of allocated memory blocks
that are exactly the same size.
A common example is where a particular C++ class object is allocated a lot,
in which case you can override its memory allocation
with class-specific new and delete operators.
Dynamic Memory Allocators. You can also try to define your own generic memory allocator routines to replace the default versions provided by CUDA. But honestly, you’d have to get up very early in the morning to do better than what’s already been done. But if you must do it, this is one way:
- Macro interception of
malloc,callocandfree. - Link-time interception of the global
newanddeleteoperators.
If you think you’re up for it, feel free to take on this challenge!
Allocating Stack Memory: alloca
The alloca function can be used to allocated stack memory
rather than heap memory.
It works in both host and device code,
but this optimization may be less applicable to host code,
simply because it has a huge heap space.
This main reason to try stack allocation
is that the stack is faster memory than shared memory or global memory.
Although the alloca function can be used in device code
to dynamically allocate memory blocks on the thread’s stack,
there are various advantages and disadvantages
that should be considered carefully.
The main advantages of alloca include:
- Stack memory can be faster to access.
allocafunction is itself fast.- Automatic de-allocation on function exit when the stack unwinds.
- Helps avoid heap memory fragmentation.
The pitfalls include:
- Alignment problems (there’s also
aligned_alloc). - Limited stack size (i.e.,
allocamay fail in device code). - Addresses are invalid after function exit (i.e., the scope is reduced and lifetime is shorter than with heap allocations).
- Uninitialized memory by default (but nor do
mallocorcudaMalloc). - No way to de-allocated it programmatically, before the function exits.
There is only a limited amount of local memory per thread,
although you can view or increase this amount
using the cudaDeviceGetLimit and cudaDeviceSetLimit functions
with the cudaLimitStackSize property.
Here’s how to print the value:
// Get stack size
CUDACHK( cudaDeviceGetLimit(&val, cudaLimitStackSize) );
double kb = val / ((double)1024);
printf("Device: stack size = %d bytes (%3.2f KB)\n", (int)val, kb);
Note that it’s much more important to test for memory allocation failure
when using alloca, because the GPU stack is so small.
But, of course, you’re always using good programming style and
checking the return values of every CUDA primitive, right?
Memory Leaks
A common problem in managing allocated memory is
leaking memory,
from blocks that are not de-allocated by free or cudaFree.
Finding memory leaks is often considered a debugging task,
but it also helps with performance,
so it’s both “debugging” and “deslugging.”
Nevertheless, many of the debugging tools are helpful in detecting leaked blocks:
compute-sanitizervalgrind(limited)
Compute Sanitizer is more focused on memory errors than leaks with its default settings.
The full memory leak detection available in compute-sanitizer can be enabled via the “leak checking”
command-line argument:
compute-sanitizer --leak-check=full myexecutable
You can even use the Linux valgrind tool to chase down memory leaks,
if you prefer.
I’m not sure how fully it works for device code allocations,
but it certainly works on host code.
References
- Vivek Kini and Jake Hemstad, Jul 27, 2021, Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1, NVIDIA Technical Blog, https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/
- Ram Cherukuri, Dec 16, 2020, Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features, NVIDIA Technical Blog, https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/
- Vivek Kini and Jake Hemstad, Jul 27, 2021, Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2, NVIDIA Technical Blog, https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/
- John Hubbard, Gonzalo Brito, Chirayu Garg, Nikolay Sakharnykh and Fred Oh, Aug 22, 2023, Simplifying GPU Application Development with Heterogeneous Memory Management, NVIDIA Technical Blog, https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
- Mark Harris, Jun 19, 2017, Unified Memory for CUDA Beginners, https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: CUDA C++ Optimization |
|
The new CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |