Aussie AI Blog

List of CUDA C++ Optimization Techniques

  • September 22, 2025
  • by David Spuler, Ph.D.

List of CUDA C++ Optimization Techniques

This is a compilation of CUDA C++ coding efficiency techniques from various books and articles:

Here’s the list:

    Overall GPU System Optimizations:
  1. Buy a bigger GPU
  2. Buy the next generation GPU
  3. Buy more GPUs
  4. Buy a faster CPU
  5. Buy a faster NIC
  6. Buy a faster switch
  7. Buy a CPU-GPU superchip
  8. Buy a rack system
  9. Overclock your GPU (at your own risk)
  10. Overclock your CPU (ditto)

    Porting CPU code to GPU:
  11. Parallelize your overall algorithm at a high level first
  12. The CPU still controls the overall processing sequence (the GPU is like a slave processor)
  13. Avoid swapping compute from CPU to GPU and back again (network cost)
  14. Unrolled loops in CPU code parallelize well on a GPU.
  15. CPU SIMD code (x86 AVX or Arm Neon) is a good candidate for GPU vectorization.
  16. Porting CPU multithreading directly to GPU kernels is often a poor strategy.
  17. Sharded data tables indicate a likely parallel algorithm for the GPU.
  18. Linearized data structures parallelize well (i.e., arrays, matrices, tensors)
  19. Dynamic data structures may not work well on GPUs (e.g., linked lists, tries, binary trees)

    General CUDA Algorithm Optimizations:
  20. Use CUDA libraries rather than roll-your-own (eg. cuBLAS, cuDNN, Cutlass, etc.)
  21. Use open-source CUDA libraries (e.g. RAPIDS)
  22. Element-wise "vertical" operations (eg. vector-add)
  23. Reductions or "horizontal" algorithms (e.g., min/max/sum/avg of a vector)
  24. Combined element-wise and reductions (e.g., vector dot-product)
  25. Latency hiding
  26. More complicated access patterns (e.g. matrix multiplication)

    Profiler Tools for GPU:
  27. Roofline analysis: determine if kernel is memory-bound or compute-bound
  28. nvprof
  29. ncu
  30. NVIDIA Visual Profiler
  31. Nsight Systems
  32. Nsight Compute
  33. Nsight Graphics
  34. Nsight Deep Learning Designer
  35. CUDA Profiling Tools Interface (CUPTI)
  36. NVIDIA Tools Extension SDK (NVTX)
  37. Nsight Perf SDK
  38. Nsight Tools JupyterLab Extension
  39. Code your own C or C++ timing code (e.g., clock, chrono)
  40. CUDA C++ timers
  41. CUDA event timers

    Compiler Settings and Hints for CUDA C++:
  42. nvcc optimizer flags for host code: -O or --optimize
  43. nvcc optimizer flags for device code: -dopt or --dopt kind
  44. Link-time optimization flags: --dlink-time-opt (-dlto)
  45. Flush-to-Zero (FTZ) mode: -ftz=true
  46. Fast math mode: --use_fast_math
  47. Division lower precision: -prec-div=false
  48. Square root lower precision: -prec-sqrt=false
  49. Disable Compiler Debug Flags: “-g” (host debug) and “-G” (device debug)

    Compiler Hints and Compile-Time Techniques:
  50. restricted pointers: "__restrict__"
  51. inline functions
  52. const, constexpr, consteval, constinit, etc.
  53. Examine PTX assembly with "-keep" (if you must)
  54. Disable debugging print statements
  55. Consider disabling kernel assert() calls
  56. Template metaprogramming (mostly superceded by constexpr)
  57. [[likely]] and [[unlikely]] path attributes (C++20)
  58. [[fallthrough]] (C++17)
  59. [[expects]], [[ensures]], [[assert]] (C++20)
  60. __assume (CUDA 11.2)
  61. __builtin_assume (CUDA 11.2)
  62. __builtin_assume_aligned (CUDA 11.2)
  63. __builtin_unreachable (CUDA 11.3)
  64. GCC builtin functions are available in host code.

    CUDA Memory Hierarchy Optimizations:
  65. Minimize global memory usage
  66. Shared memory usage
  67. Avoid overuse of shared memory (it's limited)
  68. Constant read-only memory
  69. Warp shuffle and other warp-level primitives
  70. Local memory (on the stack)
  71. Registers
  72. Avoid register spills (e.g., not too many local variables; careful with scope/lifetime)
  73. Texture memory (good for 2D/3D spatial algorithms)
  74. General C++ memory reduction optimizations still apply (e.g., smaller data sizes)

    CUDA Memory Allocation Optimizations:
  75. Dynamic memory allocation (should be minimized in both CPU and GPU code)
  76. Asynchronous memory allocation
  77. Late allocation & early free
  78. Avoid memory leaks
  79. Allocated stack memory (alloca)

    CUDA Memory Access Optimizations:
  80. Use CUDA builtin memory functions (e.g. memset/cudaMemset, memcpy/cudaMemcpy, etc.)
  81. Coalesced memory access to global memory
  82. Cache locality (e.g., tiling) but note that it's different on CPU versus GPU
  83. Contiguous memory blocks (even for 2D matrices or 3D tensors)
  84. Minimize use of scatter and gather approaches
  85. Avoid bank conflicts with shared memory
  86. Avoid unaligned addresses in memory accesses
  87. Use 128-byte aligned addresses for fully optimized coalesced data accesses
  88. Avoid exceeding Unified Memory limits
  89. Use data structure sizes that are multiples of 32 (to split across warps)
  90. Reduce total memory accesses in algorithms (e.g., tiling, kernel fusion)

    Data Layout "Packing" Optimizations:
  91. Row-major versus column-major ordering (Cache-friendly memory-layout)
  92. Padding addition (to align data with cache sizes or boundaries)
  93. Matrix transpose (precomputed or on-the-fly)
  94. Blocking/tiling pre-copy (to faster memory)
  95. Block reordering

    CUDA C++ Compute Optimization Techniques:
  96. Maximize warp coherence / avoid thread divergence (warp divergence)
  97. Instruction cache locality (e.g., tight loops/blocks)
  98. Branchless coding tricks (reduce thread divergence)
  99. Instruction-level parallelism (ILP) of GPU instructions
  100. Total thread parallelism (thin kernels)
  101. Grid-stride loops (GSL)
  102. Avoid segmented loop kernels (prefer grid-stride loops)
  103. Loop unrolling
  104. Don't use recursion
  105. Grid size optimizations
  106. Different GPUs need different grid configurations
  107. Maximize occupancy
  108. Use occupancy calculators
  109. Explicit predication with if(cond) and if(!cond) (rarely)
  110. Minimize unused redundant threads (e.g., where if (i < n) is false).
  111. Avoid accidental redundant threads (multiple threads computing the same thing)
  112. Blocktiling (block-level tiling)
  113. Warptiling (warp-level tiling)
  114. Thread tiling (thread-level tiling)
  115. Double buffering (e.g., in MatMul)
  116. Thread swizzling

    GPU Thread Overhead Reduction:
  117. Goldilocks threads (not too many, not too few)
  118. Thread coarsening (don't have threads doing too little work)
  119. Persistent GPU kernels
  120. CUDA thread pools
  121. Producer-consumer thread pools
  122. Megakernel thread pools (multi-action producer-consumer kernels)
  123. Avoid undisciplined megakernels (e.g. an entire CPU thread converted to a GPU kernel)
  124. Lazy loading of GPU kernels (as part of kernel initialization)
  125. Work stealing (in thread pools/work queues)

    CPU-GPU Communication Optimizations:
  126. Upload fixed data to GPU once only (e.g., AI model weights)
  127. Use larger kernel launch parameters as an alternative to cudaMemcpy
  128. Pinned host memory blocks (on CPU)
  129. Unified Memory for shared CPU-GPU data (e.g., cudaMallocManaged)
  130. Avoid downloading interim results back to the CPU
  131. Overlap host-device CPU-to-GPU communication with GPU compute kernels
  132. Segmented overlapped CPU-GPU communication (partial overlapping)
  133. Reduce data sizes being transferred (e.g., smaller types, data compression)
  134. Batch multiple small data transfers into a single upload.
  135. Direct Memory Access (DMA)
  136. GPUDirect

    Advanced GPU Communication Methods:
  137. Zero-copy NIC-to-GPU memory strategies
  138. Use sticky sessions to avoid copying user data between servers (e.g., KV caches)
  139. Multi-GPU communication technologies
  140. Multi-GPU peer-to-peer memory access
  141. Remote Direct Memory Access (RDMA)
  142. nvlink
  143. Lazy connection establishment for NCCL protocol

    Strategies for Combined CPU-GPU Techniques:
  144. Combine CPU and GPU code (general parallel execution)
  145. Use CPU SIMD (AVX or Arm Neon) with GPU code
  146. Use CPU SIMD + GPU + CPU Instruction-Level Parallelism (ILP)

    Synchronization Cost Optimizations:
  147. Don't overuse explicit host synchronization with cudaDeviceSynchronize()
  148. Avoid implicit host synchronizations (in various CUDA runtime APIs such as "cudaMemcpy")
  149. Use cudaPeekAtLastError to avoid implicit synchronization.
  150. Avoid redundant barriers (unnecessary synchronization)
  151. Minimize calls to __syncwarp()
  152. Minimize calls to __syncthreads()
  153. print statements can become a bottleneck.
  154. Accidentally leaving “serialized kernel launch” settings enabled (e.g., CUDA_LAUNCH_BLOCKING).

    Advanced CUDA Optimizations:
  155. Asynchronous CUDA operations
  156. Heterogenous Memory Management (HMM)
  157. CUDA Dynamic parallelism (kernels launching kernels)
  158. CUDA Streams
  159. CUDA Graphs
  160. Load balancing of SM compute
  161. BF16x9 emulation of FP32 arithmetic (Blackwell)
  162. CUDA SIMD data types (e.g. float2, float3, float4)
  163. CUDA intrinsic functions (e.g., __fmul_rn(), __fdividef(), etc.)
  164. Tensor cores
  165. Memory pools
  166. Grouped GEMM APIs
  167. Native FP4 data type
  168. Native FP8 data type
  169. Thread Block Clusters ("super blocks")
  170. Dynamic Cluster Launching
  171. Cooperative Groups API (with the "cg::" prefix)
  172. BF16 "brain float" and other floating-point numeric formats
  173. Custom memory allocators
  174. Shared memory tiling methods
  175. Unified L1/Share/Texture Cache (Unified Cache Hierarchy)
  176. Fused Epilogues
  177. Fused prologues
  178. Shared Epilogues
  179. Block-Scaled Quantization (FP32 to FP4/FP8 with per-block scaling factors)
  180. Block Floating-Point (BFP)
  181. UMMA (Universal Matrix Multiply Accumulate) instructions for mixed-precision GEMM
  182. Wave-level optimizations
  183. Single-wave kernels
  184. Avoid the "last wave" or "tail effect" problem
  185. Lookup tables
  186. Source code precomputation
  187. GPU isolation (multi-GPU)
  188. Persistent L2 cache
  189. Tensor Memory Accelerator (TMA)
  190. Multicast TMA loads
  191. Distributed shared memory
  192. Memory prefetching (e.g. cudaMemPrefetchAsync)
  193. Inline PTX assembly code inside C++
  194. Advanced branchless methods: branch fusion, tail merging, control-flow melding, thread data remapping

    General C++ Low-Latency and Efficiency Technqiues:
    List of 600+ low-latency C++ techniques

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Mordern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++