Aussie AI Blog
List of CUDA C++ Optimization Techniques
-
September 22, 2025
-
by David Spuler, Ph.D.
List of CUDA C++ Optimization Techniques
This is a compilation of CUDA C++ coding efficiency techniques from various books and articles:
- CUDA C++ Optimization, David Spuler, June 2024.
- C++ Ultra Low Latency, David Spuler, July 2025.
- C++ Low Latency, David Spuler, March 2025.
- Generative AI in C++, David Spuler, March 2024.
- 500+ LLM Inference Optimization Techniques (blog article)
Here’s the list:
-
Overall GPU System Optimizations:
- Buy a bigger GPU
- Buy the next generation GPU
- Buy more GPUs
- Buy a faster CPU
- Buy a faster NIC
- Buy a faster switch
- Buy a CPU-GPU superchip
- Buy a rack system
- Overclock your GPU (at your own risk)
- Overclock your CPU (ditto)
Porting CPU code to GPU: - Parallelize your overall algorithm at a high level first
- The CPU still controls the overall processing sequence (the GPU is like a slave processor)
- Avoid swapping compute from CPU to GPU and back again (network cost)
- Unrolled loops in CPU code parallelize well on a GPU.
- CPU SIMD code (x86 AVX or Arm Neon) is a good candidate for GPU vectorization.
- Porting CPU multithreading directly to GPU kernels is often a poor strategy.
- Sharded data tables indicate a likely parallel algorithm for the GPU.
- Linearized data structures parallelize well (i.e., arrays, matrices, tensors)
- Dynamic data structures may not work well on GPUs (e.g., linked lists, tries, binary trees)
General CUDA Algorithm Optimizations: - Use CUDA libraries rather than roll-your-own (eg. cuBLAS, cuDNN, Cutlass, etc.)
- Use open-source CUDA libraries (e.g. RAPIDS)
- Element-wise "vertical" operations (eg. vector-add)
- Reductions or "horizontal" algorithms (e.g., min/max/sum/avg of a vector)
- Combined element-wise and reductions (e.g., vector dot-product)
- Latency hiding
- More complicated access patterns (e.g. matrix multiplication)
Profiler Tools for GPU: - Roofline analysis: determine if kernel is memory-bound or compute-bound
- nvprof
- ncu
- NVIDIA Visual Profiler
- Nsight Systems
- Nsight Compute
- Nsight Graphics
- Nsight Deep Learning Designer
- CUDA Profiling Tools Interface (CUPTI)
- NVIDIA Tools Extension SDK (NVTX)
- Nsight Perf SDK
- Nsight Tools JupyterLab Extension
- Code your own C or C++ timing code (e.g., clock, chrono)
- CUDA C++ timers
- CUDA event timers
Compiler Settings and Hints for CUDA C++: - nvcc optimizer flags for host code: -O or --optimize
- nvcc optimizer flags for device code: -dopt or --dopt kind
- Link-time optimization flags: --dlink-time-opt (-dlto)
- Flush-to-Zero (FTZ) mode: -ftz=true
- Fast math mode: --use_fast_math
- Division lower precision: -prec-div=false
- Square root lower precision: -prec-sqrt=false
- Disable Compiler Debug Flags: “-g” (host debug) and “-G” (device debug)
Compiler Hints and Compile-Time Techniques: - restricted pointers: "__restrict__"
- inline functions
- const, constexpr, consteval, constinit, etc.
- Examine PTX assembly with "-keep" (if you must)
- Disable debugging print statements
- Consider disabling kernel
assert()
calls - Template metaprogramming (mostly superceded by constexpr)
- [[likely]] and [[unlikely]] path attributes (C++20)
- [[fallthrough]] (C++17)
- [[expects]], [[ensures]], [[assert]] (C++20)
- __assume (CUDA 11.2)
- __builtin_assume (CUDA 11.2)
- __builtin_assume_aligned (CUDA 11.2)
- __builtin_unreachable (CUDA 11.3)
- GCC builtin functions are available in host code.
CUDA Memory Hierarchy Optimizations: - Minimize global memory usage
- Shared memory usage
- Avoid overuse of shared memory (it's limited)
- Constant read-only memory
- Warp shuffle and other warp-level primitives
- Local memory (on the stack)
- Registers
- Avoid register spills (e.g., not too many local variables; careful with scope/lifetime)
- Texture memory (good for 2D/3D spatial algorithms)
- General C++ memory reduction optimizations still apply (e.g., smaller data sizes)
CUDA Memory Allocation Optimizations: - Dynamic memory allocation (should be minimized in both CPU and GPU code)
- Asynchronous memory allocation
- Late allocation & early free
- Avoid memory leaks
- Allocated stack memory (
alloca
)
CUDA Memory Access Optimizations: - Use CUDA builtin memory functions (e.g. memset/cudaMemset, memcpy/cudaMemcpy, etc.)
- Coalesced memory access to global memory
- Cache locality (e.g., tiling) but note that it's different on CPU versus GPU
- Contiguous memory blocks (even for 2D matrices or 3D tensors)
- Minimize use of scatter and gather approaches
- Avoid bank conflicts with shared memory
- Avoid unaligned addresses in memory accesses
- Use 128-byte aligned addresses for fully optimized coalesced data accesses
- Avoid exceeding Unified Memory limits
- Use data structure sizes that are multiples of 32 (to split across warps)
- Reduce total memory accesses in algorithms (e.g., tiling, kernel fusion)
Data Layout "Packing" Optimizations: - Row-major versus column-major ordering (Cache-friendly memory-layout)
- Padding addition (to align data with cache sizes or boundaries)
- Matrix transpose (precomputed or on-the-fly)
- Blocking/tiling pre-copy (to faster memory)
- Block reordering
CUDA C++ Compute Optimization Techniques: - Maximize warp coherence / avoid thread divergence (warp divergence)
- Instruction cache locality (e.g., tight loops/blocks)
- Branchless coding tricks (reduce thread divergence)
- Instruction-level parallelism (ILP) of GPU instructions
- Total thread parallelism (thin kernels)
- Grid-stride loops (GSL)
- Avoid segmented loop kernels (prefer grid-stride loops)
- Loop unrolling
- Don't use recursion
- Grid size optimizations
- Different GPUs need different grid configurations
- Maximize occupancy
- Use occupancy calculators
- Explicit predication with
if(cond)
andif(!cond)
(rarely) - Minimize unused redundant threads (e.g., where
if (i < n)
is false). - Avoid accidental redundant threads (multiple threads computing the same thing)
- Blocktiling (block-level tiling)
- Warptiling (warp-level tiling)
- Thread tiling (thread-level tiling)
- Double buffering (e.g., in MatMul)
- Thread swizzling
GPU Thread Overhead Reduction: - Goldilocks threads (not too many, not too few)
- Thread coarsening (don't have threads doing too little work)
- Persistent GPU kernels
- CUDA thread pools
- Producer-consumer thread pools
- Megakernel thread pools (multi-action producer-consumer kernels)
- Avoid undisciplined megakernels (e.g. an entire CPU thread converted to a GPU kernel)
- Lazy loading of GPU kernels (as part of kernel initialization)
- Work stealing (in thread pools/work queues)
CPU-GPU Communication Optimizations: - Upload fixed data to GPU once only (e.g., AI model weights)
- Use larger kernel launch parameters as an alternative to cudaMemcpy
- Pinned host memory blocks (on CPU)
- Unified Memory for shared CPU-GPU data (e.g., cudaMallocManaged)
- Avoid downloading interim results back to the CPU
- Overlap host-device CPU-to-GPU communication with GPU compute kernels
- Segmented overlapped CPU-GPU communication (partial overlapping)
- Reduce data sizes being transferred (e.g., smaller types, data compression)
- Batch multiple small data transfers into a single upload.
- Direct Memory Access (DMA)
- GPUDirect
Advanced GPU Communication Methods: - Zero-copy NIC-to-GPU memory strategies
- Use sticky sessions to avoid copying user data between servers (e.g., KV caches)
- Multi-GPU communication technologies
- Multi-GPU peer-to-peer memory access
- Remote Direct Memory Access (RDMA)
- nvlink
- Lazy connection establishment for NCCL protocol
Strategies for Combined CPU-GPU Techniques: - Combine CPU and GPU code (general parallel execution)
- Use CPU SIMD (AVX or Arm Neon) with GPU code
- Use CPU SIMD + GPU + CPU Instruction-Level Parallelism (ILP)
Synchronization Cost Optimizations: - Don't overuse explicit host synchronization with
cudaDeviceSynchronize()
- Avoid implicit host synchronizations (in various CUDA runtime APIs such as "cudaMemcpy")
- Use cudaPeekAtLastError to avoid implicit synchronization.
- Avoid redundant barriers (unnecessary synchronization)
- Minimize calls to
__syncwarp()
- Minimize calls to
__syncthreads()
- print statements can become a bottleneck.
- Accidentally leaving “serialized kernel launch” settings enabled (e.g., CUDA_LAUNCH_BLOCKING).
Advanced CUDA Optimizations: - Asynchronous CUDA operations
- Heterogenous Memory Management (HMM)
- CUDA Dynamic parallelism (kernels launching kernels)
- CUDA Streams
- CUDA Graphs
- Load balancing of SM compute
- BF16x9 emulation of FP32 arithmetic (Blackwell)
- CUDA SIMD data types (e.g. float2, float3, float4)
- CUDA intrinsic functions (e.g., __fmul_rn(), __fdividef(), etc.)
- Tensor cores
- Memory pools
- Grouped GEMM APIs
- Native FP4 data type
- Native FP8 data type
- Thread Block Clusters ("super blocks")
- Dynamic Cluster Launching
- Cooperative Groups API (with the "cg::" prefix)
- BF16 "brain float" and other floating-point numeric formats
- Custom memory allocators
- Shared memory tiling methods
- Unified L1/Share/Texture Cache (Unified Cache Hierarchy)
- Fused Epilogues
- Fused prologues
- Shared Epilogues
- Block-Scaled Quantization (FP32 to FP4/FP8 with per-block scaling factors)
- Block Floating-Point (BFP)
- UMMA (Universal Matrix Multiply Accumulate) instructions for mixed-precision GEMM
- Wave-level optimizations
- Single-wave kernels
- Avoid the "last wave" or "tail effect" problem
- Lookup tables
- Source code precomputation
- GPU isolation (multi-GPU)
- Persistent L2 cache
- Tensor Memory Accelerator (TMA)
- Multicast TMA loads
- Distributed shared memory
- Memory prefetching (e.g. cudaMemPrefetchAsync)
- Inline PTX assembly code inside C++
- Advanced branchless methods: branch fusion, tail merging, control-flow melding, thread data remapping
General C++ Low-Latency and Efficiency Technqiues:
List of 600+ low-latency C++ techniques
More AI Research Topics
Read more about:
Aussie AI Advanced C++ Coding Books
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
![]() |
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
![]() |
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
![]() |
Advanced C++ Memory Techniques: Efficiency & Safety:
Get your copy from Amazon: Advanced C++ Memory Techniques |
![]() |
Safe C++: Fixing Memory Safety Issues:
Get it from Amazon: Safe C++: Fixing Memory Safety Issues |
![]() |
Efficient C++ Multithreading: Modern Concurrency Optimization:
Get your copy from Amazon: Efficient C++ Multithreading |
![]() |
Efficient Modern C++ Data Structures:
Get your copy from Amazon: Efficient C++ Data Structures |
![]() |
Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
Get your copy from Amazon: Low Latency C++ |