Aussie AI Blog

List of 600+ Low-Latency C++ Techniques

  • September 22, 2025
  • by David Spuler, Ph.D.

List of C++ Low-Latency and Efficiency Techniques

This is a compilation of coding efficiency and low latency C++ programming techniques from various books and articles:

Here’s the long list:

    Low Latency C++ General Software Approaches:
  1. Cache warming
  2. Core pinning (“affinity”)
  3. False sharing (avoiding)
  4. Branch prediction optimizations
  5. Hotpath optimizations
  6. Slowpath removal
  7. Kernel bypass
  8. Lock contention (reducing)
  9. Lock-free programming (with atomics and memory ordering issues)
  10. Thread pools
  11. SIMD CPU instructions
  12. Inline assembly language (“asm” statements)
  13. Intrinsic functions (often closely mapping to machine code instructions)
  14. In-memory logging
  15. Cache locality (for L1/L2/L3 memory caches and instruction caches)
  16. Specialized data structures
  17. Thread-Local Storage (TLS) (“thread_local” type in C++11)
  18. Shared memory (e.g., shmctl “shared memory control” function, shmget, shm_open, ftruncate)
  19. Memory mapped files/devices (e.g., mmap, munmap)
  20. Asynchronous programming (std::async)

    Concurrency-Friendly Data Structures:
  21. Read-only data structures
  22. Reader-friendly data structures (e.g., many readers, one writer)
  23. Copy-on-write data structures (for readers)
  24. Versioned data structures (for readers)
  25. Partition data across threads (vertically: columns)
  26. Shard data across threads (horizontally: rows)
  27. Read-Copy-Update(RCU)—mostly the same as copy-on-write.
  28. NUMA-aware data structures—reduce cross-node communications
  29. Transactional memory (synchronization efficiency, reduces contention) — use atomic/isolated transactions (an emerging technology)

    Hotpath Optimizations:
  30. Optimize all steps in the hotpath (e.g., data ingestion, decision, trade execution, logging, risk management)
  31. Profile the hotpath specifically (e.g., a test mode that always runs the hotpath)
  32. Examine assembly code of the hotpath
  33. Avoid memory allocation calls on hotpath (e.g., memory pools, preallocation)
  34. Avoid free/deallocation of memory on hotpath
  35. Use preallocated memory on hotpath
  36. Review data de-serialization and serialization costs
  37. Use in-memory databases for any significant amounts of incoming data
  38. Keep the client network connection warm (method depends on the API)
  39. Re-use objects to avoid constructor/destructor calls on hotpath

    General Tuning Advice:
  40. Avoid micro-optimization
  41. Avoid optimizing error handling code (it’s a slowpath)
  42. Loop optimizations (see below)
  43. Avoid nested loops
  44. Tune inner loop for nested loops
  45. Avoid excessive function wrapper overhead

    Performance Profiling Tools:
  46. gprof
  47. perf
  48. prof (older)
  49. pixie (older)

    Lock Contention Reduction:
  50. Late lock acquisition
  51. Early lock release
  52. Short critical section of code
  53. Generally reduce total numbers of locks used
  54. Locking fine-grain vs coarse-grain
  55. Use fine-grain locks for contested resources
  56. Use a hybrid fine-grain/coarse-grain lock strategy
  57. Release locks before significant computation
  58. Copy data to temporary variables to unlock before computation
  59. Release locks before blocking for I/O
  60. Release locks before blocking for system calls
  61. Release locks before blocking for networking
  62. Tolerate lockless output overlaps
  63. std::shared_mutex and std::shared_lock — multiple reads, one writer.
  64. Double lock check method (check first without a lock)
  65. Use message-passing via std::promise and std::future rather than shared memory.
  66. Thread-specific queues and “work stealing” design pattern
  67. Use a lock-free queue data structure
  68. thread_local keyword (C++11)
  69. std::lock_guard (C++11)
  70. std::lock_guard early release by scope control
  71. std::unique_lock (C++11) (more granular control than std::lock_guard)
  72. std::scoped_lock (C++17)
  73. Locking with timeouts (try locks)
  74. Avoid spinlock busy waiting
  75. Exponential backoff to avoid spinlock costs
    See also “lock-free programming”
    See also “concurrency-friendly data structures”

    Thread/lock overhead reduction (generally):
  76. Reduce thread launch overhead
  77. Reduce thread destruction overhead
  78. Reduce lock acquisition/release overhead
  79. Reduce lock contention overhead
  80. std::make_shared() or std::allocate_shared() do only one allocation (combined shared pointer and control block), whereas shared_ptr<type> does two allocations (shared pointer and the control block are separate).
  81. Weak pointers (std::weak_ptr) can delay the deallocation of a shared_ptr and its object even after the main reference count is zero.

    System code optimizations (general ideas):
  82. Avoid system calls to reduce context switches (in Linux)
  83. Use C++ “intrinsics” functions (highly optimized assembly-level code)

    Linux socket programming:
  84. Non-blocking sockets versus using select() with a timeout—allows thread to do “other” useful work rather than just wait.
  85. poll() or epoll() system call rather than waiting

    Context Switching Reduction:
  86. Thread counts (not too many threads)
  87. Thread specialization
  88. Thread specialization (producer-consumer thread model)
  89. Use custom thread pools with only preallocated memory block pools.
  90. spinlocks avoid context switches (especially good if spins for only a short time)
  91. Avoid context switch cost by having a thread do “other” work, rather than just blocking.

    Cache Locality Optimizations:
  92. Tiling/blocking algorithms
  93. Tiling/blocking matrix multiplication (MatMul/GEMM)
  94. Smaller data type sizes for increased locality
  95. Choose a CPU with a larger L1 “cache line size” (64-256 bytes common)
  96. std::hardware_destructive_interference_size, std::hardware_constructive_interference_size (C++17)
  97. std::initializer_list (C++11) can be used as a lightweight container with contiguous elements
    See also “cache warming (prefetch)” optimizations
    See also “false sharing (avoid)” optimizations

    Instruction Cache Locality Optimizations:
  98. Prefer shorter blocks of code in the hotpath
  99. Consider not inlining function calls (for instruction cache locality)
    See also “branch prediction optimizations”

    Branch Prediction Optimizations (General):
  100. Branch elimination
  101. Branch compiler hints
  102. Branch prediction heuristics
  103. Branch profiling (two-phase)
  104. Branchless programming
  105. Tools—measure branch prediction data (e.g., perf)

    Branch Reductions Techniques:
  106. Algorithm-level changes to reduce branches
  107. Keep loop bodies short (shorter branches)
  108. Reduce far branching (e.g., function calls)
  109. Reduce overall use of function calls (see function call optimizations)
  110. Reduce use of if statements
  111. Reduce use of loops
  112. Reduce use of break statements (in loops, not switch!)
  113. Reduce use of continue statements
  114. Reduce use of switch statements
  115. Reduce short-circuiting in &&/|| operators
  116. Reduce short-circuiting of ?: ternary operator
  117. Avoid virtual function calls (hidden dynamic branches)
  118. Avoid pointer-to-functions (hidden dynamic branches; blocks inlining)
  119. Avoid function objects/functors (hidden dynamic branches)
  120. Avoid lambda functions passed as arguments (depends on how well the optimizer can handle them)
  121. Reduce long if-else-if sequences
  122. Reduce nested if-else sequences
  123. Avoid branches depending on anything unpredictable
  124. Avoid branches depending on user inputs
  125. Avoid branches depending on random numbers
  126. Avoid branches depending on system clocks
  127. Sort array data for efficient branch prediction, if scanning through the array comparing the data (e.g., before testing for error range)
    See also “compile-time optimizations” (remove branches at compile-time)
    See also “loop optimizations” (reduce loop iterations, e.g., loop unrolling)

    Branch Prediction Heuristics:
  128. Common case code in if block
  129. Uncommon case code in else block
  130. Error handling code in else block (uncommon code)
  131. Avoid zero-iteration loops (never entered)
  132. Avoid single-iteration loops (never loop back)

    Branch Prediction Compiler Hints:
  133. [[likely]] and [[unlikely]] path attributes (C++20)
  134. likely() and unlikely() expressions (C++20)
  135. __builtin_expect (GCC)
  136. Define LIKELY and UNLIKELY macros with __builtin_expect (pre-C++20)
  137. [[noreturn]] (C++11)
  138. [[assume(expression)]] attribute (C++23)
  139. hot (GCC function attribute)
  140. GCC __builtin_unreachable
  141. std::unreachable—helps branch prediction (C++23)
  142. [[fallthrough]] — more for safety than speed (C++17)
  143. -fdelayed-branch compiler flag
  144. -fguess-branch-probability compiler flag
  145. -fif-conversion and -fif-conversion2 compiler flags
  146. Use “likely” and “unlikely” in custom assertion macros
  147. Use “likely” and “unlikely” in error handling code macros

    Branch Profiling:
  148. -fprofile-arcs (GCC option)
  149. -fprofile-generate (GCC command-line argument)
  150. -fprofile-use (GCC command-line argument)
  151. Branch profiling with 100% hotpath (test modes)

    Branchless Programming Techniques:
  152. Ternary operator preferred over if statements (if CMOV instruction)
  153. Boolean variables as 0 or 1 in arithmetic
  154. Logical operators (&&/||) as 0 or 1 in arithmetic
  155. Bitwise operators (&/|) replace logical operators (&&/||)
  156. Sign bit extension bit masks
  157. Lookup tables for branchless programming
  158. XOR trick to swap two integer variables without a temporary variable

    Slowpath Removal:
  159. Optimize error checking pathways
  160. Remove error checking tests
  161. Defer error checking tests to later
  162. Combine error checking tests together (and do it later)
  163. Avoid adding error checks deeper in the call hierarchy
  164. Never-failing functions (cannot return an error)
  165. Don’t use memory allocation (avoids memory allocation failure)

    Cache Warming Methods:
  166. Prefetch memory primitives
  167. __builtin_prefetch (GCC)
  168. _mm_prefetch (GCC)
  169. volatile on temporary variables
  170. Dry-run execution mode
  171. Branchless dry-run execution with arr[2] declarations
  172. Use read-only cache warming pathways (avoids cache invalidation for other threads)
  173. Use deep cache warming all the way down into the NIC
  174. Optimize cache warming code by reducing data reads (relies on cache line sizes)
  175. Reduce cache warming code to the maximum size of the memory cache (avoids redundant cache warming when cache is already full).

    False Sharing (Avoiding):
  176. Using alignas(64) or 128 or 256 to avoid false sharing (C++11)
  177. Use alignas on all shared memory or atomics (C++11)
  178. Tools to automatically detect false sharing (DRD fails?)

    Parallelism (General Categories):
  179. Multithreading
  180. Multiprocess
  181. Vectorization
  182. Pipelining
  183. Parallel execution modes (C++17)
  184. Coroutines (C++20)

    Advanced C++ Concurrency Data Structures:
  185. Read-only (“immutable”) data structures
  186. Lock-free algorithms and data structures
  187. Linear search can be efficient for small sizes because of cache prefetching (e.g., rather than binary search; also doesn’t need sorting maintained)

    SIMD Instructions:
  188. AVX (x86 CPUs)
  189. ARM Neon
  190. std::simd (experimental/C++26)
  191. <immintrin.h>

    Linux O/S Optimizations:
  192. Process priority upgrades (“nice” command or system call)
  193. Disable unimportant processes
  194. Overclocking CPU
  195. Overclocking GPU
  196. Disable Security Enhanced (SE) Linux
  197. Disable accounting mode in Linux (should be off anyway)

    Linux Kernel Optimizations:
  198. Scheduling algorithm kernel modifications
  199. Tweak TCP/UDP network buffer settings (Linux kernel)
  200. Turn off file “last access date” storage (“noatime” in /etc/fstab)

    System Hardware Optimizations (Categories):
  201. Processor hardware (CPU)
  202. Network optimizations
  203. Disk optimizations
  204. RAM Memory optimizations

    Processor Hardware Major Categories of Optimizations:
  205. CPU
  206. GPU
  207. NPU
  208. FPGA
  209. ASIC

    Networking Hardware Optimizations (Categories):
  210. NIC
  211. Switches
  212. Load balancer devices
  213. Size of the packet buffer of a switch (optimizing for)

    Networking Transmission/Protocol Optimizations (Categories):
  214. Physical proximity
  215. Co-Lo
  216. TCP
  217. UDP (faster than TCP but unreliable)
  218. Optical networking (optical fiber cables)
  219. Microwave network transmission
  220. Packet fragment manipulations (e.g., out-of-order)
  221. Reduce packet fragment collation overhead
  222. Reduce packet consistency checking (error safety overhead)

    Networking Software Optimizations:
  223. TcpDirect/Onload
  224. SolarFlare/OpenOnload (kernel bypass)
  225. Exablaze (NIC with kernel bypass support)
  226. DMA
  227. PCIe bus
  228. Compress data sizes for your network transmissions
  229. Sticky sessions (avoids needing to send user-specific caches between servers)
  230. Shared storage rather than other server-to-server networking (e.g., NAS/SAN)
  231. Use custom wrappers for TCP and UDP network processing

    GPU & Distributed Networking Optimizations:
  232. RDMA
  233. nvlink
  234. Infiniband
  235. RoCE
  236. GPUDirect
  237. PXN

    Deployment Optimizations (Website backends):
  238. DNS optimizations
  239. Round-Robin DNS (RRDNS)
  240. SSL time optimizations
  241. etags (website server speedup)
  242. Multiple identical servers architecture
  243. Use subdomains for static files
  244. CDN for static files
  245. Compression modes enabled
  246. Static files compressed
  247. Minify static files (CSS, JavaScript)
  248. Merge multiple small files together
  249. Use smaller image files (low precision)
  250. Merge multiple small icon images into one image file
  251. Cache duration settings
  252. Database optimizations (various, e.g., MySQL/MariaDB/MongoDB)
  253. Database indexes
  254. Application server optimizations (e.g., Tomcat)

    Apache/Nginx Subprocess Optimizations:
  255. Use FCGI not classic CGI integrations
  256. Flush stdout of subprocesses (sends partial output earlier to Apache or Nginx)
  257. Close stdout of subprocesses before shutdown sequence (finishes earlier to Apache or Nginx)
  258. Early tests for violations and invalidity (fails quickly)

    Algorithm Enhancements:
  259. Precomputation (lookup tables)
  260. Precomputation to data file
  261. Precomputation of source code
  262. Incremental algorithms
  263. Data structure augmentation
  264. Parallelization
  265. Vectorization
  266. Caching
  267. Lazy evaluation
  268. Common case first
  269. Simple case first
  270. Approximate tests first
  271. Bounding box approximate tests
  272. Bounding sphere approximate tests
  273. Avoiding sqrt by using arithmetic on squares
  274. Integer arithmetic on squares: avoid floating-point by using arithmetic on squares
  275. Use variance not standard-deviation (arithmetic on squares)
  276. Approximations
  277. Compute budget algorithms
  278. Probabilistic/stochastic algorithms
  279. Skipping algorithms
  280. Heuristic algorithms
  281. Greedy algorithms

    Memory Reduction Strategies:
  282. Take care with memory reduction as some methods can reduce speed (trade-offs)
  283. Reduce allocated memory
  284. Smaller data sizes
  285. Pack data into smaller integer sizes
  286. Pack data into bits
  287. Pack data using bit-fields
  288. Pack data into unions
  289. Use std::bitvector
  290. Use std::vector<bool> (it is a special bit-packed template instantiation)
  291. Structure packing (also for class data members): reorder different-sized data members for better packing and fewer padding bytes
  292. Structure packing: biggest data types first (heuristic)
  293. Structure packing: MSVS /d1reportSingleClassLayout compiler option to report on it
  294. #pragma pack reduces padding to reduce size, but may worsen structure access costs
  295. Stack data reductions
  296. Avoid deallocation of heap memory when in shutting-down mode

    Heap Allocated Memory Reduction Strategies:
  297. Fewer allocated memory blocks
  298. Avoid frequent small allocations
  299. Preallocation of dynamic memory
  300. Memory fragmentation avoidance
  301. Memory leak avoidance
  302. Merge memory allocations together
  303. Memory pools (fixed-size allocations, often a type of preallocation)
  304. Memory pool with O(1) deletion and O(1) insertion via permutation array
  305. Merge fixed-size allocated objects into a large array
  306. Custom memory allocators (generalized)
  307. Class-specific memory allocator
  308. Custom global memory allocator
  309. Late allocation (allocate memory as late as possible)
  310. Early free memory (deallocate as early as possible)
  311. Early delete memory (deallocate early)
  312. Avoid realloc (slow, memory fragmentation)
  313. Smart dynamic buffers (hybrid of allocated and non-allocated memory)
  314. std::aligned_alloc - memory alignment improvement (C++17)
  315. std::aligned_union (C++11)

    Static Memory Size Reductions:
  316. Avoid large global arrays and buffers
  317. Avoid large static arrays and buffers
  318. Avoid large static C++ data members
  319. String literal memory reductions

    Stack Memory Size Reductions:
  320. Avoid large local arrays and buffers
  321. Avoid large function non-reference parameter arrays and buffers
  322. Use pass-by-reference on large function parameters
  323. Use integer parameters as local variables
  324. Consider stack versus memory allocation
  325. Flattening/reducing function call hierarchy
  326. Inline small functions (compiler can disappear them)
  327. Use #define macros for small functions (versus inlining)
    See also: function call hierarchy flattening
    See also: recursion avoidance

    Code Size Reduction Strategies:
  328. Code size reductions
  329. DLLs versus static libraries
  330. Remove executable debug information
  331. Avoid the compiler “-g” debug option
  332. Avoid the compiler “-p” profiler option
  333. Unix strip command
  334. Avoid large inline functions (instruction cache locality)
  335. Don’t overuse “always inline” or “force inline”
  336. Template overuse
  337. Google “bloaty” tool

    Standard Library Optimizations (STL Optimizations):
  338. String processing efficiency (e.g., “+” for std::string can be slow)
  339. std::vector of non-trivial class objects calls constructor/destructors
  340. Control array size for std::vector using “reserve()
  341. Use std::sort rather than qsort
  342. bsearch is not your friend
  343. Consider hard-coded arrays versus std::array versus std::vector
  344. Compare the first letters of strings before calling strcmp
  345. Consider type casts to int versus round(), ceil(), floor()
  346. Avoid printf/fprintf format string processing with putchar/putc/fputc or puts/fputs
  347. Hand-code versions of abs and fabs/fabsf that don’t handle Inf/NaN numbers (but benchmark it).
  348. Change strlen("literal") to char arr[]="literal" and use sizeof(arr)-1
  349. Don’t use strlen(s) in a for loop condition
  350. Consider your own atoi/itoa versions that don’t handle all the obscure cases.
  351. Avoid sprintf and snprintf (both are slow)
  352. sync_with_stdio(false)
  353. std::stringstream is slow (hand-code text field processing instead)

    Data Structures:
  354. Hashing (basic)
  355. Perfect hashing
  356. Bit vectors
  357. Bit sets
  358. Bloom filters (bit vectors + hashing)
  359. Binary tree
  360. Sorted arrays
  361. Unsorted arrays
  362. Stacks
  363. Queues
  364. Dequeues
  365. Vector hashing
  366. Permutation arrays
  367. Locality-sensitive hashing (LSH)
  368. Bit signatures (vector algorithm)
  369. K-means clustering (vector algorithm)
  370. Hyper-cube (vector algorithm)
  371. Approximate nearest neighbor (ANN) (vector algorithm)

    Variable Optimizations:
  372. Prefer int types to char or short (usually)
  373. Prefer int types to unsigned int (usually)
  374. Prefer int types to size_t (usually it’s an unsigned long; consider uint32_t)
  375. Avoid unnecessary initializations
  376. Re-use objects to avoid initializations/destruction
  377. Avoid temporary variables
  378. Use reference variables instead of full temporary variables
  379. Avoid creating temporary objects
  380. Put commonly used data fields first in struct/class
  381. Declare variables as close as possible to usage
  382. if initializer syntax (C++17)
  383. switch initializer syntax (C++17)
  384. Avoid bit-fields (smaller but slower to access or set)
  385. Use memory alignment primitives to avoid slow-downs
  386. Put the most-used data member first (it has a zero offset)
  387. Order data members most used to least used (smaller offsets are faster, in theory)
  388. Array initializer lists as local variables (re-initialized each call)
  389. Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).

    Arithmetic Optimizations:
  390. Operator strength reduction
  391. Reciprocal multiplication
  392. Integer arithmetic
  393. Use float not double

    Expression Optimizations:
  394. Expression transformations
  395. const
  396. mutable keyword — bypasses const (C++98) (speedy but unsafe)
  397. Common subexpression elimination (CSE)
  398. Constant folding
  399. Template fold expressions (C++17) are concise but often lots of computation
  400. Expression templates—avoids explicit temporary variables, compiler optimizes it better.
  401. Constant propagation
  402. Redundant assignment removal
  403. Strength reduction
  404. Algebraic identities
  405. Implicit type conversions (avoiding; type consistency)
  406. explicit keyword (prevent implicit type conversions) (C++98)
  407. Brace initialization syntax {} (avoids implicit narrowing conversions)
  408. auto variable declarations avoid accidental temporaries and implicit type conversions.
  409. Don’t mix float/double types (including their constants)
  410. Don’t mix integer types
  411. Prefer signed integers over unsigned types
  412. Short-circuiting of sub-expressions (using &&/||/?:)
  413. Register allocation optimizations
  414. mprotect page system call — used as optimization to make memory writeable
  415. <algorithm> simple algorithms: min, max, etc.
  416. Range check faster with casts via “(unsigned)i < MAX” not “i >=0 && i < MAX

    Memory Block Operations:
  417. Prefer contiguous memory blocks (locality, efficient block operations, etc.)
  418. Different class types can allow block copying: POD (Plain Old Data), trivial types, standard layout types (e.g., check in a template using std::is_trivial)
  419. Copy arrays by wrapping them in a dummy struct
  420. Copy arrays with memcpy
  421. Compare arrays with memcmp (very dangerous: padding bytes, negative zero, NaNs)
  422. Use memcpy not memmove if arguments won’t overlap.
  423. Linearize multi-dimensional arrays (contiguous memory blocks)

    Operator Strength Reduction Optimizations:
  424. Replace * with bitshifts
  425. Replace * with addition
  426. Replace x*2 with x+x
  427. Replace % with bitwise-and (&)
  428. Replace % with increment and test
  429. Replace % with type casts (if byte sizes)

    Bitwise Optimizations:
  430. Intrinsic bitwise functions
  431. CLZ (count leading zeros) bitwise intrinsics
  432. CTZ (count trailing zeros) bitwise intrinsics
  433. Popcount bitwise intrinsics (set bit count)
  434. Kernighan bit trick (find highest bit set)
  435. Fast NOR/NAND/XNOR via assembly instructions
  436. Fast LOG2 of integers
  437. Fast largest power-of-two of integers

    Floating-Point Optimizations:
  438. Convert float to 32-bit integers (float bit manipulations)
  439. FTZ (Flush to Zero) mode
  440. DAZ (Denormals Are Zero) mode
  441. LOG2 of floating-point is the exponent
  442. Zero/negative zero bitwise tests
  443. Disallow negative zero (to use faster zero comparisons)
  444. NaN (Not-a-Number) bitwise tests
  445. Inf/-Inf bitwise tests
  446. Avoid denormalized numbers
  447. Disable denormalized numbers (subnormals) (compiler/library modes)
  448. Avoid underflow in floating-point (ignore it)
  449. Avoid overflow in floating-point (ignore it)
  450. memcmp float vector equality (disallow special values for fast float vector equality comparison)
  451. Fast detection of special values in float vectors (bitwise operations)
  452. Floating-point intrinsic functions (various)
  453. Exponent addition: bitshifting floating-point by addition of the exponent bits
  454. Sign bit flipping/extraction/setting (bitwise tricks)

    Compiler Settings for Floating-Point:
  455. GCC -ffast-math option — faster math mode.
  456. GCC -fno-math-errno — faster math multithreading by not setting errno.
  457. GCC -ffinite-math-only
  458. GCC fno-trapping-math
  459. MSVS /fp:precise, /fp:strict, /fp:fast
  460. Disable floating-point exceptions

    Loop Optimizations:
  461. Exit loops early (e.g., break or return statements)
  462. Finish loop body early (i.e., continue statement)
  463. Correct choice of loop
  464. Loop unrolling
  465. #pragma unroll
  466. Loop fusion
  467. Loop perforation (probabilistic)
  468. Loop tiling/blocking
  469. Loop fission
  470. Loop reversal (don’t use!)
  471. Loop code motion (“hoisting”)
  472. Loop distribution
  473. Loop iterator strength reduction
  474. Loop coalescing
  475. Loop collapsing
  476. Loop peeling
  477. Loop splitting
  478. Loop interchange
  479. Loop sentinel
  480. Loop strip mining (loop sectioning)
  481. Loop spreading
  482. Loop normalization
  483. Loop skewing
  484. Loop interleaving

    If Statement Optimizations:
  485. Replace if-else-if sequences with switch.
  486. Replace if-else-if sequences with lookup table loop.

    Switch Statement Optimizations:
  487. Use compact numeric ranges in switch (compiler can use a LUT)

    Compile-Time Optimizations:
  488. inline functions
  489. always_inline specifier
  490. GCC flatten_inline specifier
  491. gnu_inline GCC specifier
  492. Keep inline functions short (helps compiler to inline)
  493. Keep inline functions in header files (source available to all its calls)
  494. Avoid making a virtual function “inline”—compiles but usually is a slug.
  495. sizeof
  496. Use sizeof with static_assert (e.g., portability checks)
  497. Virtual functions cannot be inlined (although it compiles)
  498. Pointer-to-function usages of functions cannot be inlined
  499. Function objects (functors) cannot always be inlined
  500. Lambda functions cannot always be inlined
  501. inline variables (C++17) (helps with linking)
  502. static_assert (compile-time assertions)
  503. const is good
  504. constexpr (C++11) is great
  505. constexpr functions allow if, switch, loops, etc. (C++14)
  506. constexpr lambda functions (C++17)
  507. constexpr and placement new (C++26)
  508. References to constexpr variables (C++26)
  509. if constexpr statements
  510. constinit
  511. consteval
  512. if consteval (C++23)
  513. Type traits <type_traits> (C++11)
  514. typeid is slow (RTTI)
  515. std::is_same_v (type trait test)
  516. Template specialization (for specific types)
  517. Template specialization (for constant integers)
  518. Variadic templates (C++11)
  519. Template Meta Programming (TMP) still works, but prefer constexpr
  520. Auto-vectorization (by compiler)
  521. Auto-unrolling of loops (by compiler)
  522. SFINAE tricks (mostly an issue for compiler engineers)

    Pointer Aliasing:
  523. Reorganize functions with awareness of pointer aliasing issues
  524. Restricted pointers (to avoid pointer aliasing slowdowns)
  525. -fstrict-aliasing compiler option (alternative to using “restrict”)

    Pointer Arithmetic:
  526. Loop pointer arithmetic
  527. End pointer address tricks (Loop pointer arithmetic)
  528. Use references not pointers (avoids null testing)
  529. Prefer postfix operations with the *ptr++ idiom (not prefix ++ptr)
  530. Pointer comparison tricks
  531. Pointer difference tricks
  532. Avoid safe pointer class wrappers (prefer raw pointers for speed)

    Pointer Optimizations (Other):
  533. reinterpret_cast (helps the optimizer and is effectively a free compile-time hint)
  534. Avoid dynamic_cast (to downcast from a base to a derived class, which can be helpful for specializing member calls, but dynamic casts can be expensive at runtime because of RTTI)

    Function Optimizations:
  535. Return early from functions
  536. Flatten function call hierarchies
  537. Callbacks are an extra layer of function call
  538. Lambda functions are convenient but are an extra function call layer (though often inlined)
  539. Function objects (functors) are an extra function call
  540. Avoid recursion (completely; we’re not in High School anymore)
  541. Replace simple recursion with a loop
  542. Replace complex recursion with a stack
  543. Tail recursion elimination
  544. Recursion higher base level
  545. Collapse recursion levels
  546. Specialize functions with default arguments (use two versions)
  547. Specialize functions with void and non-void versions (if return value often ignored)
  548. Avoid function pointers (cannot be inline or constexpr)
  549. Merge multiple Boolean function parameters into a “config” object with Boolean data fields.
  550. noexcept attributes allow compiler to avoid adding extra code (C++11)
  551. std::initializer_list can be used to return multiple values (benchmark against other methods)

    C++ Class Optimizations:
  552. friend functions (bypass interfaces)
  553. friend classes (bypass interfaces)
  554. Return references rather than objects
  555. Avoid temporary class objects in expressions
  556. Add extra member functions to avoid temporary object creation
  557. Pass objects by reference to functions (i.e., “&” or “const&”)
  558. Disable copy constructors with “private” or “= delete
  559. Disable assignment operators with “private” and “= delete
  560. Declare assignment operators with void return type (except when defaulting)
  561. Re-use objects to avoid constructor and destructor calls
  562. Avoid calling the destructor when in shutting down mode
  563. Uninitialized memory algorithms, e.g., std::uninitialized_fill (C++17)
  564. CRTP (Curiously Recurring Template Pattern): derived class derives from base class which is itself a template involving a pointer to the derived class (optimizes polymorphism to be compile-time, avoiding virtual function calls; also this allows more inlining of these calls.)
  565. Move constructors
  566. Move assignment operators
  567. std::move (C++11, C++14) is usually a compile-time cast.
  568. Return object reference types (not complicated objects)
  569. Avoid virtual function calls with explicit calls to the specific function
  570. Specialize inherited member functions (for the more restrictive type)
  571. Avoid overloading the postfix increment/decrement operators
  572. Block the overloaded postfix increment/decrement operators (void body or =delete)
  573. Consider skipping destructor cleanup if program is shutting down
  574. Avoid accidental double initialization of data members in constructors
  575. Avoid redundant initialization of same members in both constructor and “setup” methods
  576. Specialize member functions with default arguments (use two versions instead)
  577. Default constructors/destructors with “=default” may be more efficient than hand-coded versions.
  578. Trick for singleton pattern in multithreading — threads initializing function-local static variable, other threads block, once-only initialization guaranteed by C++ compiler.

    Advanced C++ Compiler Optimizations:
  579. Copy elision (compiler auto-optimization with avoidance of a copy constructor in certain cases)
  580. Guaranteed copy elision (C++17)
  581. Named return value elision (a type of copy elision)
  582. Temporary return value elision (a type of copy elision)
  583. Copy elision in exception handling (special case for copy elision)
  584. Allocation elision (new operator) (C++14)
  585. Use xvalue or “expiring value” optimizations (various)
  586. Trick: to disallow creating an object on the stack, make its destructor private.
  587. Trick: to disallow creating an object on the heap, make its new and new[] operators private.

    Byte Block Operations in C++ Classes: (Use with extreme care!)
  588. memset/bzero to zero in a constructor — fast but dangerous, overwrites internal “vtable” data in object if class has any virtual functions, does not call constructors of its data members or base class members; also cannot use an initializer list as this overwrites with zero after any objects were set by the initializer list.
  589. memcpy to bitwise copy in a copy constructor or assignment operator — fast but dangerous, improperly copies internal vtable data in object if class has any virtual functions, does not deeply copy any of its members or base class members nor call their constructors.
  590. memcpy to bitwise copy in a move copy constructor or move assignment operator — fast but dangerous; improperly copies “vtable”.
  591. memcmp to bitwise compare for equality/inequality tests — fast but fails in many situations due to pitfalls: padding bytes, bit-field members, negative versus positive zero floating-point values, NaN floating-point values.
  592. Virtual inheritance — usually for pure virtual base classes; avoids double objects if the same base class is inherited in two different ways.

    Timing C++ Methods:
  593. std::chrono C++ class (highly granular)
  594. clock() C/C++ function
  595. time command (Linux shell)
  596. time() function (granularity is only in seconds)
  597. gettimeofday()

    Benchmarking C++ Methods:
  598. Loop unrolling for accurate benchmarking
  599. Use volatile specifier for accurate benchmarking
  600. Loop overhead measurement for accurate benchmarking
  601. Google Benchmark: Apache 2 license; code: https://github.com/google/benchmark

    Compiler Settings:
  602. Optimizer settings
  603. Optimizing for space/memory size (compiler flags)

    General Build & Software Development Practices for Efficiency:
  604. Maintain separate builds for slow testables versus production executables
  605. Compile-out assertions
  606. Compile-out self-testing code
  607. Compile-out debug code or tracing code
  608. Ensure test code not accidentally left in production (test a global flag based on these macros at startup)

    CUDA C++ GPU Optimizations:
  609. Coalesced memory accesses
  610. Thread specialization (GPU)
  611. GPU thread pools
  612. Producer-consumer thread pools
  613. GPU kernel optimizations
  614. Striding (GPU kernels)
  615. Overlapping GPU uploads and compute
  616. Overlapping with recomputation/rematerialization
  617. Offloading to CPU
  618. Pinned memory blocks
  619. Warp divergence (warp coherence)
  620. Grid optimizations
  621. Grid size optimizations

    Core Utility Classes (Efficiency Helpers): (to build for overall efficiency practices)
  622. Bitwise macro library (bitflag management)
  623. Floating-point fast bitwise operations macro library
  624. Benchmarking/timing library
  625. Smart buffer library (reduce allocations by combining allocated/non-allocated memory management)
  626. TCP/UDP wrapper library
  627. Specialized data structures for small amounts of data (faster than STL)
  628. Sorted array and binary search (small array size)
  629. Lock-free queues
  630. Perfect hashing library
  631. Bit vector data structures (possibly based on STL)
  632. Bit set data structures (possibly based on STL)
  633. Bloom filter library
  634. Vector hashing library
  635. Caching utilities library
  636. Source code precomputation library
  637. Basic data and statistics on vectors (e.g., averages, std dev/variance, etc.)
  638. Incremental vector algorithms (averages, min, max, etc.)
  639. Branchless coding primitives library
  640. Graph library for locking analysis
  641. Data compression library
  642. Approximate tests library
  643. Math library (versus STL)
  644. Memory pools library (fixed-size custom memory allocators)
  645. Custom memory allocator library
  646. Placement new operator versions
  647. Placement delete operator (write your own)
  648. Multi-dimensional array library (linearize your vectors/matrices/tables/tensors)


    AI Kernel Optimizations (using LLM Inference Optimizations for non-AI low latency applications): (subset of methods to consider)
    Reference: 500+ LLM Inference Optimization Techniques (blog article)

  649. Kernel fusion
  650. Kernel fission
  651. Kernel tiling/blocking
  652. Quantization (integer-based approximation of floating-point)
  653. Low-bit quantization
  654. Binary quantization (1-bit)
  655. Integer-only arithmetic
  656. Floating-point quantization (FP16/FP8/FP4)
  657. Mixed precision quantization
  658. Logarithmic quantization
  659. Dyadic quantization
  660. Low rank matrices
  661. MatMul/GEMM optimizations (many)
  662. MatMul data locality optimizations
  663. Sparse MatMul
  664. Approximate matrix multiplication
  665. Contiguous memory block matrix multiplication
  666. Cached transpose MatMul
  667. Fused transpose MatMul
  668. Tiled/blocked MatMul
  669. Sparsification (Pruning/Sparsity)
  670. Token pruning (input compression)
  671. Token skipping
  672. Token merging
  673. Data compression algorithms
  674. Early exiting (of layers)
  675. Caching optimizations
  676. Vector computation caching
  677. Zero skipping
  678. Negative skipping
  679. Padding optimizations
  680. Zero padding removal
  681. Zero-multiplication arithmetic
  682. Adder/addition (zero-multiply)
  683. Bitshifts (zero-multiply)
  684. Bitshift-add (zero-multiply)
  685. Double bitshift-add (zero-multiply)
  686. Add-as-integer (zero-multiply)
  687. Logarithmic arithmetic (zero-multiply)
  688. Hadamard element-wise matrix multiplication
  689. End-to-end integer arithmetic
  690. Table lookup matrix multiplication
  691. Weight clustering (grouped quantization)
  692. Vector quantization
  693. Parameter sharing
  694. Activation function optimizations (non-linear functions)
  695. Precomputation of Activation functions
  696. Approximation of Activation functions
  697. Integer-only approximation of Activation functions
  698. Fused activation functions
  699. Normalization optimizations (non-linear vector data functions)
  700. Fused normalization optimizations
  701. FFN optimizations (double MatMul)
  702. FFN approximations
  703. FFN integer-only
  704. Decoding algorithm optimizations
  705. Speculative decoding
  706. Multi-token decoding
  707. Ensemble decoding
  708. Consensus/majority-vote decoding
  709. Easy-hard queries
  710. Batching computations
  711. Advanced number systems
  712. Posit numbers
  713. Dyadic numbers
  714. Hybrid number systems
  715. Fixed point numbers (integers not floating-point)
  716. Block floating-point (BFP) hybrids
  717. Logarithmic number system (LNS)
  718. Disaggregation (prefill/decoding)
  719. Computation re-use
  720. Conditional computation
  721. Approximate caching
  722. Addition arithmetic optimizations
  723. Approximate addition
  724. Bitwise arithmetic optimizations
  725. Fast multiplication arithmetic
  726. Approximate multiplication
  727. Logarithmic approximate multiplication
  728. Approximate division
  729. Bitserial arithmetic

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Mordern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging