Aussie AI

Chapter 2. Optimizing CUDA Programs

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 2. Optimizing CUDA Programs

The best thing about optimizing programs written in CUDA C++ is that someone else has already done all the work for you: the hardware engineers. Who knew that silicon could work so well? The GPU underneath the CUDA APIs is so blazingly fast that it really doesn’t matter what miniscule software optimizations you try to apply. Your code will run fast no matter how hard you try to slow it down with debug wrappers.

So, go ahead, dig out your favorite bubble sorts and linear searches. Throw away that book by Donald Knuth, because you no longer need it. You’ll never have to code another hash table in your life.

Oh, wait!

There’s over 100,000 research papers getting written every year about optimizing AI software engines and their LLMs. Maybe there’s still some CUDA C++ coding work to be done...

General Optimization Principles

There are many aspects to C++ code optimization on the CUDA platform. Some of the principles are general to all software programming, whereas some things are CUDA-specific.

Parallel Algorithms. What are you trying to do? Can it be parallelized for the GPU? That’s the key. Honestly, if you can figure out how to run your algorithm in parallel, let CUDA do the grunt work on the GPU, and the rest is easy. Parallelization of software algorithms has spawned a thousand dissertations, but some general ideas include:

Parallelize tasks that run sequentially (obviously)
Overlap tasks
Split data into sub-parts (i.e., parallelize partial computations).
Break tasks into smaller sub-tasks that can be parallelized.

Other Algorithm Improvements. In addition to parallelization, there are algorithm-level improvements. Consider using optimizations such as:

Incremental calculations
Low-precision calculations
Precomputations
Approximate algorithms
Specialized versions of general algorithms
Lazy algorithms

Parallel Data Structures. Choose data structures with linearized data that is contiguous, as they tend to be parallelizable, such as:

Arrays
Vectors, matrices, and tensors
Bit sets, bit vectors, Bloom filters

Inherently sequential data structures are often not great for GPUs, although there are numerous versions available already coded in CUDA libraries. Examples include:

Binary trees
Tries
Linked lists
Queues and dequeues

Leverage CUDA Libraries. Don’t code up things that have already been done by the folks at NVIDIA. There are many libraries for all sorts of applications, ranging from AI to games to science. These libraries offer extensive coverage of many of the common and obscure algorithms and data structures.

cuBLAS — Basic Linear Algebra (e.g., vector and matrix operations).
CUTLASS — advanced linear algebra operations.
CCCL — CUDA C++ Core Libraries including CUB, Thrust, and libcudacxx.
CuFFT — Fast Fourier Transform
cuDNN — Deep Neural Networks.
CUDART — CUDA Runtime library.
cuRAND — Random number generation.
cuSPARSE — Sparse matrix multiplication.
NVML — NVIDIA Management Library for low-level GPU access (similar to the nvidia-smi command).
NVRTC — NVIDIA Runtime Compilation library for dynamic compilation of code (cool!).
CUDA-X — suite of libraries and micro-services for AI and other CUDA applications.
RAPIDS — suite of GPU-accelerated AI libraries for CUDA.

Profilers. The capabilities of the NVIDIA performance profiling tools are amazing. There are GUI versions integrated into your IDE, or command-line versions to run. You can also add timing code into your own CUDA C++ programs, but you may not need to.

Compiler Auto-Optimizations. Modern compilers have amazing capabilities in optimizing code for you. Ensure that you have set all the optimizer flags, and also added “hints” to your code (e.g., restricted pointers, const, constexpr).

Compile-Time Optimizations. The run-time cost of anything that can be computed at compile-time is zero. Actually, literally, zero. You can compute a lot of things at compile-time using the various constant specifiers, specialized versions for particular sizes, and also via advanced template techniques.

Memory Costs. Many algorithms on a GPU are memory-bound, because the computations run faster than the memory access latency. The GPU has multiple levels of memory and caches (e.g., registers are the fastest, and global memory is the slowest), and there are various ways to optimize CUDA memory accesses. An important part of CUDA C++ optimizations is to think about data transfer costs and memory access costs.

Synchronization delays. Undoubtedly, some parts of your algorithm will need to wait for other parts to complete. These inherently sequential steps require you to use synchronization primitives. But don’t overuse them, or use them when not necessary, as this slows down the entire algorithm.

Networking cost. Typical CUDA applications transfer data between the CPU host and the GPU device, which use an internal bus. However, if you have a multi-GPU application, or a large application that runs across multiple servers, then you’ll also need to consider the network cost. There are various protocols to learn such as Direct Memory Access (DMA) and its remote version (RDMA).

Optimizing CUDA Kernels

Let’s be honest. The only reason you’re using CUDA is for speed. The whole goal of writing CUDA kernels is to tap into the awesome parallelization power of NVIDIA’s GPU cores. Hence, the biggest way to “optimize” a CUDA C++ kernel is to figure out the very fastest way to massively parallelize the whole thing. Algorithms, algorithms, algorithms!

This is a whole way of thinking, combining your knowledge of the problem being solved with how to get the most out of CUDA kernels. Thinks like figuring how to launch the most threads in parallel, avoid having dependencies between computations, and reducing the total amount of data transfers and data accesses. A lot of the general CUDA optimization techniques apply broadly to many types of application, from generative AI to games to scientific processing.

Although we all know that CUDA C++ programs are very special, nevertheless, the usual general advice applies:

Profile your code!
Don’t micro-optimize!
Optimize your overall algorithms first.
Focus on optimizing the busiest kernels.
Pretend to be typing when your boss walks in.

The main high-level techniques for CUDA kernel speedups include:

Parallelizing computations (surprise!)
Reducing arithmetic workload (i.e., fewer FLOPs).
Memory management optimizations (e.g., coalescing, locality).
Data access and data transfer optimizations.
High occupancy and load balancing of GPU streaming multiprocessors.
Networking optimizations for multi-server data sharing.

At the algorithm level, this means strategies such as:

Parallelize your algorithms at the top-level (task parallelism).
Smaller data sizes (i.e., quantization of AI models).
Fewer arithmetic operations (e.g., reducing multiplications by pruning LLMs).
Memory-efficient algorithms (e.g., tiled matrix multiplication or memory efficient attention via Flash attention or Paged attention).

Compute Optimizations

Compute organization means organizing the computation so that it is done efficiently on your GPUs:

High GPU occupancy rates (device utilization) via grid sizing and load balancing.
Wave optimization (e.g., single-wave kernels, avoiding the “tail effect”).
Streams (task parallelism).
CUDA Graphs (graph-structured compute path organization).
Vectorization (do-it-yourself or compiler-optimized).
Dynamic parallelism (further concurrency via kernels launching kernels).
GPU isolation (limiting your GPU usage to benefit other workloads).

Compute reduction means optimizing the arithmetic workload on your GPUs:

Tiling for improved data locality (e.g., matrix multiplication kernels).
Kernel fusion (combining two kernels into one).
Kernel fission (splitting one kernel into two).

Instruction processing optimizations include:

Optimize Instruction-Level Parallelism (ILP)
Increase instruction locality in code sequences (avoid GPU instruction cache misses).
Thread coarsening (more instructions per thread, such as via loop unrolling or stride loops).
Avoid using scatter and gather operations.
Lazy loading of GPU modules in kernel initialization.

Synchronization optimizations include:

Overlap computations where possible (asynchronously).
Don’t block in the host code unnecessarily.
Run kernel threads in lock-step wherever possible.
Use __syncwarp() sparingly.
Same for __syncthreads() primitives.
Avoid redundant barriers (unnecessary synchronization).

Some specific CUDA kernel optimizations include:

Choose grid size, block size, and test them.
Optimize the “occupancy” rates of blocks/threads.
Use different configurations on different GPU architectures.
Optimize horizontal reductions (many ways).
Use warp horizontal reduction primitives.
Warp coherence (avoiding warp divergence).
Minimize wasted redundant threads (i.e., extra threads).
Use multiples of 32 for data structure sizes.
Use texture memory to optimize other types of spatial algorithms.

Other ways to run GPUs faster:

Buy the next-generation of GPU (don’t want for budget approval).
Overclock your GPU! (as if it didn’t run hot enough already).

Finally, don’t forget that you’re still programming in the C++ language! A large number of C++ optimizations still apply to CUDA C++, such as:

Compile-time optimizations (e.g., const, constexpr, inline, restricted pointers, templates, etc.).
Auto-vectorization via “hints” (e.g., restricted pointers, #pragma, loop unrolling hints).
Use preprocessor directives (e.g., #if/#ifdef) to reduce live code.
Operator strength reductions (e.g., bitshift for multiply/divide, bitwise-and replaces remainder).
Loop optimizations (e.g., loop unrolling, loop fusion, loop fission, etc.).
General usage of tight C++ coding practices.

That’s already a long list, but we’re only half way, because we haven’t talked about memory yet.

Memory Optimizations

Computation is half the puzzle. The other half is the data, which means optimizing memory accesses and data transfers. Many CUDA kernels are memory-bound because the RAM chips can’t keep up with the raw speed of the GPU.

Memory optimizations at a high level include:

Faster memory usage (e.g., prefer shared memory over global memory).
Cache optimizations (e.g., data locality, instruction locality).
Memory access patterns (e.g., coalesced accesses).

Faster types of memory should be chosen:

Maximize use of registers (i.e., use registers judiciously, or local memory if unavoidable).
Reduce use of global memory (e.g., by using shared memory).
Use warp shuffle operations rather than shared memory (or warp reduction primitives).
Consider malloc heap memory versus alloca stack memory for kernel dynamic memory.

Fewer accesses to memory:

Kernel fusion (combining two kernels into one).
Memory-aware algorithms (e.g., Flash attention, paged attention).

Memory access optimizations include:

Coalesced data access sequences in kernels.
Grid-stride loop idiom for coalesced data access.
Data locality optimizations (e.g., tiled algorithms).
Use read-only cache memory for constant data (i.e., __ldg).
Prefer Structure-of-Arrays (SoA) over Array-of-Structures (AoS) to arrange data contiguously.

Memory cache optimization is an important method:

L2 cache-aware methods
Reverse block access for cache optimization
Instruction cache optimizations
Memory prefetching

Avoid these memory access problems that will lead to increased latency:

Avoid unaligned addresses in memory accesses.
Avoid “bank conflict” contention on accesses to shared memory from multiple threads.
Avoid exceeding Unified Memory limits (starts page swapping GPU memory).
Use 128-byte aligned addresses for fully optimized coalesced data accesses.

Host-device data transfer optimizations include:

Reduce host-to-device data transfers of input data (i.e., cudaMemcpy CPU-to-GPU).
Reduce device-to-host data transfers of results (also cudaMemcpy).
Reduce global-to-shared memory transfers.
Use page-locked “pinned” memory for faster host-to-device data uploads (e.g., cudaMallocHost).
Use Unified Memory for shared CPU-GPU data (e.g., cudaMallocManaged)
Overlap data transfers with other GPU computations (e.g., cudaMemcpyAsync).
Heterogeneous Memory Management (HMM)
Zero-copy memory for host and device joint access to fast memory.

Multi-GPU and network optimizations for data transfer: