Aussie AI Blog

List of CUDA C++ Optimization Techniques

September 22, 2025

by David Spuler, Ph.D.

List of CUDA C++ Optimization Techniques

This is a compilation of CUDA C++ coding efficiency techniques from various books and articles:

CUDA C++ Optimization, David Spuler, June 2024.
C++ Ultra Low Latency, David Spuler, July 2025.
C++ Low Latency, David Spuler, March 2025.
Generative AI in C++, David Spuler, March 2024.
500+ LLM Inference Optimization Techniques (blog article)

Here’s the list:

Overall GPU System Optimizations:

Buy a bigger GPU
Buy the next generation GPU
Buy more GPUs
Buy a faster CPU
Buy a faster NIC
Buy a faster switch
Buy a CPU-GPU superchip
Buy a rack system
Overclock your GPU (at your own risk)
Overclock your CPU (ditto)

Porting CPU code to GPU:
Parallelize your overall algorithm at a high level first
The CPU still controls the overall processing sequence (the GPU is like a slave processor)
Avoid swapping compute from CPU to GPU and back again (network cost)
Unrolled loops in CPU code parallelize well on a GPU.
CPU SIMD code (x86 AVX or Arm Neon) is a good candidate for GPU vectorization.
Porting CPU multithreading directly to GPU kernels is often a poor strategy.
Sharded data tables indicate a likely parallel algorithm for the GPU.
Linearized data structures parallelize well (i.e., arrays, matrices, tensors)
Dynamic data structures may not work well on GPUs (e.g., linked lists, tries, binary trees)

General CUDA Algorithm Optimizations:
Use CUDA libraries rather than roll-your-own (eg. cuBLAS, cuDNN, Cutlass, etc.)
Use open-source CUDA libraries (e.g. RAPIDS)
Element-wise "vertical" operations (eg. vector-add)
Reductions or "horizontal" algorithms (e.g., min/max/sum/avg of a vector)
Combined element-wise and reductions (e.g., vector dot-product)
Latency hiding
More complicated access patterns (e.g. matrix multiplication)

Profiler Tools for GPU:
Roofline analysis: determine if kernel is memory-bound or compute-bound
nvprof
ncu
NVIDIA Visual Profiler
Nsight Systems
Nsight Compute
Nsight Graphics
Nsight Deep Learning Designer
CUDA Profiling Tools Interface (CUPTI)
NVIDIA Tools Extension SDK (NVTX)
Nsight Perf SDK
Nsight Tools JupyterLab Extension
Code your own C or C++ timing code (e.g., clock, chrono)
CUDA C++ timers
CUDA event timers

Compiler Settings and Hints for CUDA C++:
nvcc optimizer flags for host code: -O or --optimize
nvcc optimizer flags for device code: -dopt or --dopt kind
Link-time optimization flags: --dlink-time-opt (-dlto)
Flush-to-Zero (FTZ) mode: -ftz=true
Fast math mode: --use_fast_math
Division lower precision: -prec-div=false
Square root lower precision: -prec-sqrt=false
Disable Compiler Debug Flags: “-g” (host debug) and “-G” (device debug)

Compiler Hints and Compile-Time Techniques:
restricted pointers: "__restrict__"
inline functions
const, constexpr, consteval, constinit, etc.
Examine PTX assembly with "-keep" (if you must)
Disable debugging print statements
Consider disabling kernel assert() calls
Template metaprogramming (mostly superceded by constexpr)
[[likely]] and [[unlikely]] path attributes (C++20)
[[fallthrough]] (C++17)
[[expects]], [[ensures]], [[assert]] (C++20)
__assume (CUDA 11.2)
__builtin_assume (CUDA 11.2)
__builtin_assume_aligned (CUDA 11.2)
__builtin_unreachable (CUDA 11.3)
GCC builtin functions are available in host code.

CUDA Memory Hierarchy Optimizations:
Minimize global memory usage
Shared memory usage
Avoid overuse of shared memory (it's limited)
Constant read-only memory
Warp shuffle and other warp-level primitives
Local memory (on the stack)
Registers
Avoid register spills (e.g., not too many local variables; careful with scope/lifetime)
Texture memory (good for 2D/3D spatial algorithms)
General C++ memory reduction optimizations still apply (e.g., smaller data sizes)

CUDA Memory Allocation Optimizations:
Dynamic memory allocation (should be minimized in both CPU and GPU code)
Asynchronous memory allocation
Late allocation & early free
Avoid memory leaks
Allocated stack memory (alloca)

CUDA Memory Access Optimizations:
Use CUDA builtin memory functions (e.g. memset/cudaMemset, memcpy/cudaMemcpy, etc.)
Coalesced memory access to global memory
Cache locality (e.g., tiling) but note that it's different on CPU versus GPU
Contiguous memory blocks (even for 2D matrices or 3D tensors)
Minimize use of scatter and gather approaches
Avoid bank conflicts with shared memory
Avoid unaligned addresses in memory accesses
Use 128-byte aligned addresses for fully optimized coalesced data accesses
Avoid exceeding Unified Memory limits
Use data structure sizes that are multiples of 32 (to split across warps)
Reduce total memory accesses in algorithms (e.g., tiling, kernel fusion)

Data Layout "Packing" Optimizations:
Row-major versus column-major ordering (Cache-friendly memory-layout)
Padding addition (to align data with cache sizes or boundaries)
Matrix transpose (precomputed or on-the-fly)
Blocking/tiling pre-copy (to faster memory)
Block reordering

CUDA C++ Compute Optimization Techniques:
Maximize warp coherence / avoid thread divergence (warp divergence)
Instruction cache locality (e.g., tight loops/blocks)
Branchless coding tricks (reduce thread divergence)
Instruction-level parallelism (ILP) of GPU instructions
Total thread parallelism (thin kernels)
Grid-stride loops (GSL)
Avoid segmented loop kernels (prefer grid-stride loops)
Loop unrolling
Don't use recursion
Grid size optimizations
Different GPUs need different grid configurations
Maximize occupancy
Use occupancy calculators
Explicit predication with if(cond) and if(!cond) (rarely)
Minimize unused redundant threads (e.g., where if (i < n) is false).
Avoid accidental redundant threads (multiple threads computing the same thing)
Blocktiling (block-level tiling)
Warptiling (warp-level tiling)
Thread tiling (thread-level tiling)
Double buffering (e.g., in MatMul)
Thread swizzling

GPU Thread Overhead Reduction:
Goldilocks threads (not too many, not too few)
Thread coarsening (don't have threads doing too little work)
Persistent GPU kernels
CUDA thread pools
Producer-consumer thread pools
Megakernel thread pools (multi-action producer-consumer kernels)
Avoid undisciplined megakernels (e.g. an entire CPU thread converted to a GPU kernel)
Lazy loading of GPU kernels (as part of kernel initialization)
Work stealing (in thread pools/work queues)

CPU-GPU Communication Optimizations:
Upload fixed data to GPU once only (e.g., AI model weights)
Use larger kernel launch parameters as an alternative to cudaMemcpy
Pinned host memory blocks (on CPU)
Unified Memory for shared CPU-GPU data (e.g., cudaMallocManaged)
Avoid downloading interim results back to the CPU
Overlap host-device CPU-to-GPU communication with GPU compute kernels
Segmented overlapped CPU-GPU communication (partial overlapping)
Reduce data sizes being transferred (e.g., smaller types, data compression)
Batch multiple small data transfers into a single upload.
Direct Memory Access (DMA)
GPUDirect

Advanced GPU Communication Methods:
Zero-copy NIC-to-GPU memory strategies
Use sticky sessions to avoid copying user data between servers (e.g., KV caches)
Multi-GPU communication technologies
Multi-GPU peer-to-peer memory access
Remote Direct Memory Access (RDMA)
nvlink
Lazy connection establishment for NCCL protocol

Strategies for Combined CPU-GPU Techniques:
Combine CPU and GPU code (general parallel execution)
Use CPU SIMD (AVX or Arm Neon) with GPU code
Use CPU SIMD + GPU + CPU Instruction-Level Parallelism (ILP)

Synchronization Cost Optimizations:
Don't overuse explicit host synchronization with cudaDeviceSynchronize()
Avoid implicit host synchronizations (in various CUDA runtime APIs such as "cudaMemcpy")
Use cudaPeekAtLastError to avoid implicit synchronization.
Avoid redundant barriers (unnecessary synchronization)
Minimize calls to __syncwarp()
Minimize calls to __syncthreads()
print statements can become a bottleneck.
Accidentally leaving “serialized kernel launch” settings enabled (e.g., CUDA_LAUNCH_BLOCKING).

Advanced CUDA Optimizations:
Asynchronous CUDA operations
Heterogenous Memory Management (HMM)
CUDA Dynamic parallelism (kernels launching kernels)
CUDA Streams
CUDA Graphs
Load balancing of SM compute
BF16x9 emulation of FP32 arithmetic (Blackwell)
CUDA SIMD data types (e.g. float2, float3, float4)
CUDA intrinsic functions (e.g., __fmul_rn(), __fdividef(), etc.)
Tensor cores
Memory pools
Grouped GEMM APIs
Native FP4 data type
Native FP8 data type
Thread Block Clusters ("super blocks")
Dynamic Cluster Launching
Cooperative Groups API (with the "cg::" prefix)
BF16 "brain float" and other floating-point numeric formats
Custom memory allocators
Shared memory tiling methods
Unified L1/Share/Texture Cache (Unified Cache Hierarchy)
Fused Epilogues
Fused prologues
Shared Epilogues
Block-Scaled Quantization (FP32 to FP4/FP8 with per-block scaling factors)
Block Floating-Point (BFP)
UMMA (Universal Matrix Multiply Accumulate) instructions for mixed-precision GEMM
Wave-level optimizations
Single-wave kernels
Avoid the "last wave" or "tail effect" problem
Lookup tables
Source code precomputation
GPU isolation (multi-GPU)
Persistent L2 cache
Tensor Memory Accelerator (TMA)
Multicast TMA loads
Distributed shared memory
Memory prefetching (e.g. cudaMemPrefetchAsync)
Inline PTX assembly code inside C++
Advanced branchless methods: branch fusion, tail merging, control-flow melding, thread data remapping

General C++ Low-Latency and Efficiency Technqiues:
List of 600+ low-latency C++ techniques