Aussie AI Blog

List of 600+ Low-Latency C++ Techniques

  • Updated: March 1st 2026
  • by David Spuler, Ph.D.

List of C++ Low-Latency and Efficiency Techniques

This is a compilation of coding efficiency and low latency C++ programming techniques from various books and articles:

Here’s the long list:

    Low Latency C++ General Software Approaches:
  1. Cache warming
  2. Core pinning (“affinity”)
  3. False sharing (avoiding it!)
  4. Branch prediction optimizations
  5. Hotpath optimizations
  6. System optimizations
  7. Slowpath removal
  8. Kernel bypass
  9. Lock contention (reducing delays)
  10. Lock-free programming (with atomics and memory ordering issues/a>)
  11. Thread pools
  12. SIMD CPU instructions (including AVX x86, ARM Neon)
  13. Inline assembly language (“asm” statements)
  14. Intrinsic functions (often closely mapping to machine code instructions)
  15. In-memory logging
  16. Cache locality (for L1/L2/L3 memory caches and instruction caches)
  17. Specialized data structures
  18. Thread-Local Storage (TLS) (“thread_local” type in C++11)
  19. Shared memory (e.g., shmctl “shared memory control” function, shmget, shm_open, ftruncate)
  20. Memory mapped files/devices (e.g., mmap, munmap)
  21. Asynchronous programming (std::async)
  22. Instruction-level parallelism (ILP)

    Concurrency-Friendly Data Structures:
  23. Parallel data structures (overview)
  24. Read-only data structures
  25. Reader-friendly data structures (e.g., many readers, one writer)
  26. Copy-on-write data structures (for readers)
  27. Versioned data structures (for readers)
  28. Partition data across threads (vertically: columns)
  29. Shard data across threads (horizontally: rows)
  30. Read-Copy-Update(RCU)—mostly the same as copy-on-write.
  31. NUMA-aware data structures—reduce cross-node communications
  32. Transactional memory (synchronization efficiency, reduces contention) — use atomic/isolated transactions (an emerging technology)

    Optimized Low-Latency Data Structures:
  33. Linear array with full unrolling (for small n)
  34. Branchless binary search (sorted array)
  35. Swiss hash tables (cache-optimized hashing with probing)
  36. Cached wrapper (around data structures)
  37. Bucket array data structure (extending array with delayed deletion)
  38. Hive data structure (generalizes bucket arrays, std::hive in C++26)

    Hotpath Optimizations:
  39. Hotpath optimizations (overview)
  40. Optimize all steps in the hotpath (e.g., data ingestion, decision, trade execution, logging, risk management)
  41. Profile the hotpath specifically (e.g., a test mode that always runs the hotpath)
  42. Examine assembly code of the hotpath
  43. Avoid memory allocation calls on hotpath (e.g., memory pools, preallocation)
  44. Avoid free/deallocation of memory on hotpath
  45. Use preallocated memory on hotpath
  46. Review data de-serialization and serialization costs
  47. Use in-memory databases for any significant amounts of incoming data
  48. Keep the client network connection warm (method depends on the API)
  49. Re-use objects to avoid constructor/destructor calls on hotpath

    General Tuning Advice:
  50. Avoid micro-optimization
  51. Avoid optimizing error handling code (it’s a slowpath)
  52. Loop optimizations (see below)
  53. Avoid nested loops
  54. Tune inner loop for nested loops
  55. Avoid excessive function wrapper overhead

    Performance Profiling Tools:
  56. gprof
  57. perf
  58. prof (older)
  59. pixie (older)

    Lock Contention Reduction:
  60. Lock contention (overview)
  61. Late lock acquisition
  62. Early lock release
  63. Short critical section of code
  64. Generally reduce total numbers of locks used
  65. Locking fine-grain vs coarse-grain
  66. Use fine-grain locks for contested resources
  67. Use a hybrid fine-grain/coarse-grain lock strategy
  68. Release locks before significant computation
  69. Copy data to temporary variables to unlock before computation
  70. Release locks before blocking for I/O
  71. Release locks before blocking for system calls
  72. Release locks before blocking for networking
  73. Tolerate lockless output overlaps
  74. std::shared_mutex and std::shared_lock — multiple reads, one writer.
  75. Double lock check method (check first without a lock)
  76. Use message-passing via std::promise and std::future rather than shared memory.
  77. Thread-specific queues and “work stealing” design pattern
  78. Use a lock-free queue data structure
  79. thread_local keyword (C++11)
  80. std::lock_guard (C++11)
  81. std::lock_guard early release by scope control
  82. std::unique_lock (C++11) (more granular control than std::lock_guard)
  83. std::scoped_lock (C++17)
  84. Locking with timeouts (try locks)
  85. Avoid spinlock busy waiting
  86. Exponential backoff to avoid spinlock costs
    See also “lock-free programming”
    See also “concurrency-friendly data structures”

    Thread/lock overhead reduction (generally):
  87. Reduce thread launch overhead
  88. Reduce thread destruction overhead
  89. Reduce lock acquisition/release overhead
  90. Reduce lock contention overhead
  91. std::make_shared() or std::allocate_shared() do only one allocation (combined shared pointer and control block), whereas shared_ptr<type> does two allocations (shared pointer and the control block are separate).
  92. Weak pointers (std::weak_ptr) can delay the deallocation of a shared_ptr and its object even after the main reference count is zero.

    System code optimizations (general ideas):
  93. Avoid system calls to reduce context switches (in Linux)
  94. Use C++ “intrinsics” functions (highly optimized assembly-level code)

    Linux socket programming:
  95. Non-blocking sockets versus using select() with a timeout—allows thread to do “other” useful work rather than just wait.
  96. poll() or epoll() system call rather than waiting

    Context Switching Reduction:
  97. Thread counts (not too many threads)
  98. Thread specialization
  99. Thread specialization (producer-consumer thread model)
  100. Use custom thread pools with only preallocated memory block pools.
  101. spinlocks avoid context switches (especially good if spins for only a short time)
  102. Avoid context switch cost by having a thread do “other” work, rather than just blocking.

    Cache Locality Optimizations:
  103. Tiling/blocking algorithms
  104. Tiling/blocking matrix multiplication (MatMul/GEMM)
  105. Smaller data type sizes for increased locality
  106. Choose a CPU with a larger L1 “cache line size” (64-256 bytes common)
  107. std::hardware_destructive_interference_size, std::hardware_constructive_interference_size (C++17)
  108. std::initializer_list (C++11) can be used as a lightweight container with contiguous elements
    See also “cache warming (prefetch)” optimizations
    See also “false sharing (avoid)” optimizations

    Instruction Cache Locality Optimizations:
  109. Prefer shorter blocks of code in the hotpath
  110. Consider not inlining function calls (for instruction cache locality)
    See also “branch prediction optimizations”

    Branch Prediction Optimizations (General):
  111. Branch prediction (overview)
  112. Branch elimination
  113. Branch compiler hints
  114. Branch prediction heuristics
  115. Branch profiling (two-phase)
  116. Branchless programming
  117. Tools—measure branch prediction data (e.g., perf)

    Branch Reductions Techniques:
  118. Algorithm-level changes to reduce branches
  119. Keep loop bodies short (shorter branches)
  120. Reduce far branching (e.g., function calls)
  121. Reduce overall use of function calls (see function call optimizations)
  122. Reduce use of if statements
  123. Reduce use of loops
  124. Reduce use of break statements (in loops, not switch!)
  125. Reduce use of continue statements
  126. Reduce use of switch statements
  127. Reduce short-circuiting in &&/|| operators
  128. Reduce short-circuiting of ?: ternary operator
  129. Avoid virtual function calls (hidden dynamic branches)
  130. Avoid pointer-to-functions (hidden dynamic branches; blocks inlining)
  131. Avoid function objects/functors (hidden dynamic branches)
  132. Avoid lambda functions passed as arguments (depends on how well the optimizer can handle them)
  133. Reduce long if-else-if sequences
  134. Reduce nested if-else sequences
  135. Avoid branches depending on anything unpredictable
  136. Avoid branches depending on user inputs
  137. Avoid branches depending on random numbers
  138. Avoid branches depending on system clocks
  139. Sort array data for efficient branch prediction, if scanning through the array comparing the data (e.g., before testing for error range)
    See also “compile-time optimizations” (remove branches at compile-time)
    See also “loop optimizations” (reduce loop iterations, e.g., loop unrolling)

    Branch Prediction Heuristics:
  140. Common case code in if block
  141. Uncommon case code in else block
  142. Error handling code in else block (uncommon code)
  143. Avoid zero-iteration loops (never entered)
  144. Avoid single-iteration loops (never loop back)

    Branch Prediction Compiler Hints:
  145. [[likely]] and [[unlikely]] path attributes (C++20)
  146. likely() and unlikely() expressions (C++20)
  147. __builtin_expect (GCC)
  148. Define LIKELY and UNLIKELY macros with __builtin_expect (pre-C++20)
  149. [[noreturn]] (C++11)
  150. [[assume(expression)]] attribute (C++23)
  151. hot (GCC function attribute)
  152. GCC __builtin_unreachable
  153. std::unreachable—helps branch prediction (C++23)
  154. [[fallthrough]] — more for safety than speed (C++17)
  155. -fdelayed-branch compiler flag
  156. -fguess-branch-probability compiler flag
  157. -fif-conversion and -fif-conversion2 compiler flags
  158. Use “likely” and “unlikely” in custom assertion macros
  159. Use “likely” and “unlikely” in error handling code macros

    Branch Profiling:
  160. -fprofile-arcs (GCC option)
  161. -fprofile-generate (GCC command-line argument)
  162. -fprofile-use (GCC command-line argument)
  163. Branch profiling with 100% hotpath (test modes)

    Branchless Programming Techniques:
  164. Branchless programming (overview)
  165. Ternary operator preferred over if statements (if CMOV instruction)
  166. Boolean variables as 0 or 1 in arithmetic
  167. Logical operators (&&/||) as 0 or 1 in arithmetic
  168. Bitwise operators (&/|) replace logical operators (&&/||)
  169. Sign bit extension bit masks
  170. Lookup tables for branchless programming (maybe, take care with cache locality)
  171. XOR trick to swap two integer variables without a temporary variable

    Slowpath Removal:
  172. Slowpath removal (overview)
  173. Optimize error checking pathways
  174. Remove error checking tests
  175. Defer error checking tests to later
  176. Combine error checking tests together (and do it later)
  177. Avoid adding error checks deeper in the call hierarchy
  178. Never-failing functions (cannot return an error)
  179. Don’t use memory allocation (avoids memory allocation failure)

    Cache Warming Methods:
  180. Cache warming (overview)
  181. Prefetch memory primitives
  182. __builtin_prefetch (GCC)
  183. _mm_prefetch (GCC)
  184. volatile on temporary variables
  185. Dry-run execution mode
  186. Branchless dry-run execution with arr[2] declarations
  187. Use read-only cache warming pathways (avoids cache invalidation for other threads)
  188. Use deep cache warming all the way down into the NIC
  189. Optimize cache warming code by reducing data reads (relies on cache line sizes)
  190. Reduce cache warming code to the maximum size of the memory cache (avoids redundant cache warming when cache is already full).

    False Sharing (Avoiding):
  191. False sharing (overview)
  192. Using alignas(64) or 128 or 256 to avoid false sharing (C++11)
  193. Use alignas on all shared memory or atomics (C++11)
  194. Tools to automatically detect false sharing (DRD fails?)

    Parallelism (General Categories):
  195. Multithreading
  196. Multiprocess
  197. Vectorization
  198. Pipelining
  199. Parallel execution modes (C++17)
  200. Coroutines (C++20)

    Advanced C++ Concurrency Data Structures:
  201. Read-only (“immutable”) data structures
  202. Lock-free algorithms and data structures
  203. Linear search can be efficient for small sizes because of cache prefetching (e.g., rather than binary search; also doesn’t need sorting maintained)

    SIMD Instructions:
  204. AVX (x86 CPUs)
  205. ARM Neon
  206. std::simd (experimental/C++26)
  207. <immintrin.h>

    Linux O/S Optimizations:
  208. Process priority upgrades (“nice” command or system call)
  209. Disable unimportant processes
  210. Overclocking CPU
  211. Overclocking GPU
  212. Disable Security Enhanced (SE) Linux
  213. Disable accounting mode in Linux (should be off anyway)

    Linux Kernel Optimizations:
  214. Scheduling algorithm kernel modifications
  215. Tweak TCP/UDP network buffer settings (Linux kernel)
  216. Turn off file “last access date” storage (“noatime” in /etc/fstab)

    System Hardware Optimizations (Categories):
  217. System optimizations (overview)
  218. Processor hardware (CPU)
  219. Network optimizations
  220. Disk optimizations
  221. RAM Memory optimizations

    Processor Hardware Major Categories of Optimizations:
  222. CPU
  223. GPU
  224. NPU
  225. FPGA
  226. ASIC

    Networking Hardware Optimizations (Categories):
  227. NIC
  228. Switches
  229. Load balancer devices
  230. Size of the packet buffer of a switch (optimizing for)

    Networking Transmission/Protocol Optimizations (Categories):
  231. Physical proximity
  232. Co-Lo
  233. TCP
  234. UDP (faster than TCP but unreliable)
  235. Optical networking (optical fiber cables)
  236. Microwave network transmission
  237. Packet fragment manipulations (e.g., out-of-order)
  238. Reduce packet fragment collation overhead
  239. Reduce packet consistency checking (error safety overhead)

    Networking Software Optimizations:
  240. TcpDirect/Onload
  241. SolarFlare/OpenOnload (kernel bypass)
  242. Exablaze (NIC with kernel bypass support)
  243. DMA
  244. PCIe bus
  245. Compress data sizes for your network transmissions
  246. Sticky sessions (avoids needing to send user-specific caches between servers)
  247. Shared storage rather than other server-to-server networking (e.g., NAS/SAN)
  248. Use custom wrappers for TCP and UDP network processing

    GPU & Distributed Networking Optimizations:
  249. RDMA
  250. nvlink
  251. Infiniband
  252. RoCE
  253. GPUDirect
  254. PXN

    Deployment Optimizations (Website backends):
  255. DNS optimizations
  256. Round-Robin DNS (RRDNS)
  257. SSL time optimizations
  258. etags (website server speedup)
  259. Multiple identical servers architecture
  260. Use subdomains for static files
  261. CDN for static files
  262. Compression modes enabled
  263. Static files compressed
  264. Minify static files (CSS, JavaScript)
  265. Merge multiple small files together
  266. Use smaller image files (low precision)
  267. Merge multiple small icon images into one image file
  268. Cache duration settings
  269. Database optimizations (various, e.g., MySQL/MariaDB/MongoDB)
  270. Database indexes
  271. Application server optimizations (e.g., Tomcat)

    Apache/Nginx Subprocess Optimizations:
  272. Use FCGI not classic CGI integrations
  273. Flush stdout of subprocesses (sends partial output earlier to Apache or Nginx)
  274. Close stdout of subprocesses before shutdown sequence (finishes earlier to Apache or Nginx)
  275. Early tests for violations and invalidity (fails quickly)

    Algorithm Enhancements:
  276. Algorithm optimizations (overview)
  277. Precomputation
  278. Lookup tables (with care, reduces data cache locality)
  279. — Precomputation to data file
  280. Precomputation of source code
  281. Incremental algorithms
  282. Data structure augmentation
  283. Parallelization
  284. Vectorization
  285. Caching
  286. Lazy evaluation
  287. Common case first
  288. Simple case first
  289. Approximate tests first
  290. Integer arithmetic (not floating-point)
  291. — Avoiding sqrt by using arithmetic on squares
  292. — Integer arithmetic on squares: avoid floating-point by using arithmetic on squares
  293. — Use variance not standard-deviation (arithmetic on squares)
  294. Approximations
  295. — Linear approximations
  296. — Bounding box approximate tests
  297. — Bounding sphere approximate tests
  298. Compute budget algorithms
  299. Probabilistic/stochastic algorithms
  300. Skipping algorithms
  301. Heuristic algorithms
  302. Greedy algorithms

    Memory Reduction Strategies:
  303. Take care with memory reduction as some methods can reduce speed (trade-offs)
  304. Reduce allocated memory
  305. Smaller data sizes
  306. Pack data into smaller integer sizes
  307. Pack data into bits
  308. Pack data using bit-fields
  309. Pack data into unions
  310. Use std::bitvector
  311. Use std::vector<bool> (it is a special bit-packed template instantiation)
  312. Structure packing (also for class data members): reorder different-sized data members for better packing and fewer padding bytes
  313. Structure packing: biggest data types first (heuristic)
  314. Structure packing: MSVS /d1reportSingleClassLayout compiler option to report on it
  315. #pragma pack reduces padding to reduce size, but may worsen structure access costs
  316. Stack data reductions
  317. Avoid deallocation of heap memory when in shutting-down mode

    Heap Allocated Memory Reduction Strategies:
  318. Fewer allocated memory blocks
  319. Avoid frequent small allocations
  320. Preallocation of dynamic memory
  321. Memory fragmentation avoidance
  322. Memory leak avoidance
  323. Merge memory allocations together
  324. Memory pools (fixed-size allocations, often a type of preallocation)
  325. Memory pool with O(1) deletion and O(1) insertion via permutation array
  326. Merge fixed-size allocated objects into a large array
  327. Custom memory allocators (generalized)
  328. Class-specific memory allocator
  329. Custom global memory allocator
  330. Late allocation (allocate memory as late as possible)
  331. Early free memory (deallocate as early as possible)
  332. Early delete memory (deallocate early)
  333. Avoid realloc (slow, memory fragmentation)
  334. Smart dynamic buffers (hybrid of allocated and non-allocated memory)
  335. std::aligned_alloc - memory alignment improvement (C++17)
  336. std::aligned_union (C++11)

    Static Memory Size Reductions:
  337. Avoid large global arrays and buffers
  338. Avoid large static arrays and buffers
  339. Avoid large static C++ data members
  340. String literal memory reductions

    Stack Memory Size Reductions:
  341. Stack memory reduction (overview)
  342. Avoid large local arrays and buffers
  343. Avoid large function non-reference parameter arrays and buffers
  344. Use pass-by-reference on large function parameters
  345. Use integer parameters as local variables
  346. Consider stack versus memory allocation
  347. Flattening/reducing function call hierarchy
  348. Inline small functions (compiler can disappear them)
  349. Use #define macros for small functions (versus inlining)
    See also: function call hierarchy flattening
    See also: recursion avoidance

    Code Size Reduction Strategies:
  350. Code size reductions
  351. DLLs versus static libraries
  352. Remove executable debug information
  353. Avoid the compiler “-g” debug option
  354. Avoid the compiler “-p” profiler option
  355. Unix strip command
  356. Avoid large inline functions (instruction cache locality)
  357. Don’t overuse “always inline” or “force inline”
  358. Template overuse
  359. Google “bloaty” tool

    Standard Library Optimizations (STL Optimizations):
  360. String processing efficiency (e.g., “+” for std::string can be slow)
  361. std::vector of non-trivial class objects calls constructor/destructors
  362. Control array size for std::vector using “reserve()
  363. Use std::sort rather than qsort
  364. bsearch is not your friend
  365. Consider hard-coded arrays versus std::array versus std::vector
  366. Compare the first letters of strings before calling strcmp
  367. Consider type casts to int versus round(), ceil(), floor()
  368. Avoid printf/fprintf format string processing with putchar/putc/fputc or puts/fputs
  369. Hand-code versions of abs and fabs/fabsf that don’t handle Inf/NaN numbers (but benchmark it).
  370. Change strlen("literal") to char arr[]="literal" and use sizeof(arr)-1
  371. Don’t use strlen(s) in a for loop condition
  372. Consider your own atoi/itoa versions that don’t handle all the obscure cases.
  373. Avoid sprintf and snprintf (both are slow)
  374. sync_with_stdio(false)
  375. std::stringstream is slow (hand-code text field processing instead)

    Data Structures:
  376. Hashing (basic)
  377. Perfect hashing
  378. Bit vectors
  379. Bit sets
  380. Bloom filters (bit vectors + hashing)
  381. Binary tree
  382. Arrays
  383. Sorted arrays
  384. Unsorted arrays
  385. Order of insertion maintaining data structures
  386. LRU Caching
  387. Stacks
  388. Queues
  389. Dequeues
  390. Ring buffer (circular buffer)
  391. Vector hashing
  392. Permutation arrays
  393. Locality-sensitive hashing (LSH)
  394. Bit signatures (vector algorithm)
  395. K-means clustering (vector algorithm)
  396. Hyper-cube (vector algorithm)
  397. Approximate nearest neighbor (ANN) (vector algorithm)

    Variable Optimizations:
  398. Prefer int types to char or short (usually)
  399. Prefer int types to unsigned int (usually)
  400. Prefer int types to size_t (usually it’s an unsigned long; consider uint32_t)
  401. Avoid unnecessary initializations
  402. Re-use objects to avoid initializations/destruction
  403. Avoid temporary variables
  404. Use reference variables instead of full temporary variables
  405. Avoid creating temporary objects
  406. Put commonly used data fields first in struct/class
  407. Declare variables as close as possible to usage
  408. if initializer syntax (C++17)
  409. switch initializer syntax (C++17)
  410. Avoid bit-fields (smaller but slower to access or set)
  411. Use memory alignment primitives to avoid slow-downs
  412. Put the most-used data member first (it has a zero offset)
  413. Order data members most used to least used (smaller offsets are faster, in theory)
  414. Array initializer lists as local variables (re-initialized each call)
  415. Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).

    Arithmetic Optimizations:
  416. Operator strength reduction
  417. Reciprocal multiplication
  418. Integer arithmetic
  419. Use float not double

    Expression Optimizations:
  420. Expression transformations
  421. const
  422. mutable keyword — bypasses const (C++98) (speedy but unsafe)
  423. Common subexpression elimination (CSE)
  424. Constant folding
  425. Template fold expressions (C++17) are concise but often lots of computation
  426. Expression templates—avoids explicit temporary variables, compiler optimizes it better.
  427. Constant propagation
  428. Redundant assignment removal
  429. Strength reduction
  430. Algebraic identities
  431. Implicit type conversions (avoiding; type consistency)
  432. explicit keyword (prevent implicit type conversions) (C++98)
  433. Brace initialization syntax {} (avoids implicit narrowing conversions)
  434. auto variable declarations avoid accidental temporaries and implicit type conversions.
  435. Don’t mix float/double types (including their constants)
  436. Don’t mix integer types
  437. Prefer signed integers over unsigned types
  438. Short-circuiting of sub-expressions (using &&/||/?:)
  439. Register allocation optimizations
  440. mprotect page system call — used as optimization to make memory writeable
  441. <algorithm> simple algorithms: min, max, etc.
  442. Range check faster with casts via “(unsigned)i < MAX” not “i >=0 && i < MAX

    Memory Block Operations:
  443. Prefer contiguous memory blocks (locality, efficient block operations, etc.)
  444. Different class types can allow block copying: POD (Plain Old Data), trivial types, standard layout types (e.g., check in a template using std::is_trivial)
  445. Copy arrays by wrapping them in a dummy struct
  446. Copy arrays with memcpy
  447. Compare arrays with memcmp (very dangerous: padding bytes, negative zero, NaNs)
  448. Use memcpy not memmove if arguments won’t overlap.
  449. Linearize multi-dimensional arrays (contiguous memory blocks)

    Operator Strength Reduction Optimizations:
  450. Arithmetic optimizations (overview)
  451. Replace * with bitshifts
  452. Replace * with addition
  453. Replace x*2 with x+x
  454. Replace % with bitwise-and (&)
  455. Replace % with increment and test
  456. Replace % with type casts (if byte sizes)

    Bitwise Optimizations:
  457. Bitwise optimizations (overview)
  458. Intrinsic bitwise functions
  459. CLZ (count leading zeros) bitwise intrinsics
  460. CTZ (count trailing zeros) bitwise intrinsics
  461. Popcount bitwise intrinsics (set bit count)
  462. Kernighan bit trick (find highest bit set)
  463. Fast NOR/NAND/XNOR via assembly instructions
  464. Fast LOG2 of integers
  465. Fast largest power-of-two of integers

    Floating-Point Optimizations:
  466. Floating-point optimizations (overview)
  467. Convert float to 32-bit integers (float bit manipulations)
  468. FTZ mode (Flush to Zero) mode
  469. DAZ mode (Denormals Are Zero)
  470. LOG2 of floating-point is the exponent
  471. Zero/negative zero bitwise tests
  472. Disallow negative zero (to use faster zero comparisons)
  473. NaN (Not-a-Number) bitwise tests
  474. Inf/-Inf bitwise tests
  475. Avoid denormalized numbers
  476. Disable denormalized numbers (subnormals) (compiler/library modes)
  477. Avoid underflow in floating-point (ignore it)
  478. Avoid overflow in floating-point (ignore it)
  479. memcmp float vector equality (disallow special values for fast float vector equality comparison)
  480. Fast detection of special values in float vectors (bitwise operations)
  481. Floating-point intrinsic functions (various)
  482. Exponent addition: bitshifting floating-point by addition of the exponent bits
  483. Sign bit flipping/extraction/setting (bitwise tricks)

    Compiler Settings for Floating-Point:
  484. GCC -ffast-math option — faster math mode.
  485. GCC -fno-math-errno — faster math multithreading by not setting errno.
  486. GCC -ffinite-math-only
  487. GCC fno-trapping-math
  488. MSVS /fp:precise, /fp:strict, /fp:fast
  489. Disable floating-point exceptions

    Loop Optimizations:
  490. Loop vectorization
  491. Exit loops early (e.g., break or return statements)
  492. Finish loop body early (i.e., continue statement)
  493. Correct choice of loop
  494. Loop unrolling
  495. #pragma unroll
  496. Loop fusion
  497. Loop perforation (probabilistic iteration skipping)
  498. Loop tiling/blocking
  499. Loop fission
  500. Loop reversal (don’t use!)
  501. Loop code motion (“hoisting”)
  502. Loop distribution
  503. Loop iterator strength reduction
  504. Loop coalescing
  505. Loop collapsing
  506. Loop peeling
  507. Loop splitting
  508. Loop interchange
  509. Loop sentinel
  510. Loop strip mining (loop sectioning)
  511. Loop spreading
  512. Loop normalization
  513. Loop skewing
  514. Loop interleaving
  515. Loop reordering

    If Statement Optimizations:
  516. Replace if-else-if sequences with switch.
  517. Replace if-else-if sequences with lookup table loop.

    Switch Statement Optimizations:
  518. Use compact numeric ranges in switch (compiler can use a LUT)

    Compile-Time Optimizations:
  519. Compile-time optimizations (overview)
  520. Zero runtime cost operations
  521. inline functions
  522. always_inline specifier
  523. — GCC flatten_inline specifier
  524. gnu_inline GCC specifier
  525. — Keep inline functions short (helps compiler to inline)
  526. — Keep inline functions in header files (source available to all its calls)
  527. — Avoid making a virtual function “inline”—compiles but usually is a slug.
  528. — Virtual functions cannot be inlined (although it compiles)
  529. — Pointer-to-function usages of functions cannot be inlined
  530. — Function objects (functors) cannot always be inlined
  531. — Lambda functions cannot always be inlined
  532. sizeof
  533. — Use sizeof with static_assert (e.g., portability checks)
  534. inline variables (C++17) (helps with linking)
  535. static_assert (compile-time assertions)
  536. Constant specifiers
  537. const is good
  538. constexpr (C++11) is great
  539. constexpr functions allow if, switch, loops, etc. (C++14)
  540. constexpr lambda functions (C++17)
  541. constexpr and placement new (C++26)
  542. — References to constexpr variables (C++26)
  543. if constexpr statements
  544. constinit
  545. consteval
  546. if consteval (C++23)
  547. Type traits <type_traits> (C++11)
  548. typeid is slow (RTTI)
  549. std::is_same_v (type trait test)
  550. Template specialization (for specific types)
  551. Template specialization (for constant integers)
  552. Variadic templates (C++11)
  553. Template Meta Programming (TMP) still works, but prefer constexpr
  554. Auto-vectorization (by compiler)
  555. Auto-unrolling of loops (by compiler)
  556. SFINAE tricks (mostly an issue for compiler engineers)

    Pointer Aliasing:
  557. Reorganize functions with awareness of pointer aliasing issues
  558. Restricted pointers (to avoid pointer aliasing slowdowns)
  559. -fstrict-aliasing compiler option (alternative to using “restrict”)

    Pointer Arithmetic:
  560. Pointer arithmetic (overview)
  561. Loop pointer arithmetic
  562. End pointer address tricks (Loop pointer arithmetic)
  563. Use references not pointers (avoids null testing)
  564. Prefer postfix operations with the *ptr++ idiom (not prefix ++ptr)
  565. Pointer comparison tricks
  566. Pointer difference tricks
  567. Avoid safe pointer class wrappers (prefer raw pointers for speed)

    Pointer Optimizations (Other):
  568. reinterpret_cast (helps the optimizer and is effectively a free compile-time hint)
  569. Avoid dynamic_cast (to downcast from a base to a derived class, which can be helpful for specializing member calls, but dynamic casts can be expensive at runtime because of RTTI)

    Function Optimizations:
  570. Return early from functions
  571. Flatten function call hierarchies
  572. Callbacks are an extra layer of function call
  573. Lambda functions are convenient but are an extra function call layer (though often inlined)
  574. Function objects (functors) are an extra function call
  575. Specialize functions with default arguments (use two versions)
  576. Specialize functions with void and non-void versions (if return value often ignored)
  577. Avoid function pointers (cannot be inline or constexpr)
  578. Merge multiple Boolean function parameters into a “config” object with Boolean data fields.
  579. noexcept attributes allow compiler to avoid adding extra code (C++11)
  580. std::initializer_list can be used to return multiple values (benchmark against other methods)

    Recursion Optimizations:
  581. Avoid recursion (completely; we’re not in High School anymore)
  582. Replace simple recursion with a loop
  583. Replace complex recursion with a stack
  584. Tail recursion elimination
  585. Recursion higher base level
  586. Collapse recursion levels

    C++ Class Optimizations:
  587. friend functions (bypass interfaces)
  588. friend classes (bypass interfaces)
  589. Return references rather than objects
  590. Avoid temporary class objects in expressions
  591. Add extra member functions to avoid temporary object creation
  592. Pass objects by reference to functions (i.e., “&” or “const&”)
  593. Disable copy constructors with “private” or “= delete
  594. Disable assignment operators with “private” and “= delete
  595. Declare assignment operators with void return type (except when defaulting)
  596. Re-use objects to avoid constructor and destructor calls
  597. Avoid calling the destructor when in shutting down mode
  598. Uninitialized memory algorithms, e.g., std::uninitialized_fill (C++17)
  599. CRTP (Curiously Recurring Template Pattern): derived class derives from base class which is itself a template involving a pointer to the derived class (optimizes polymorphism to be compile-time, avoiding virtual function calls; also this allows more inlining of these calls.)
  600. Move semantics
  601. — Move constructors
  602. — Move assignment operators
  603. std::move (C++11, C++14) is usually a compile-time cast.
  604. — Return object reference types (not complicated objects)
  605. Avoid virtual function calls with explicit calls to the specific function
  606. Specialize inherited member functions (for the more restrictive type)
  607. Avoid overloading the postfix increment/decrement operators
  608. Block the overloaded postfix increment/decrement operators (void body or =delete)
  609. Consider skipping destructor cleanup if program is shutting down
  610. Avoid accidental double initialization of data members in constructors
  611. Avoid redundant initialization of same members in both constructor and “setup” methods
  612. Specialize member functions with default arguments (use two versions instead)
  613. Default constructors/destructors with “=default” may be more efficient than hand-coded versions.
  614. Trick for singleton pattern in multithreading — threads initializing function-local static variable, other threads block, once-only initialization guaranteed by C++ compiler.

    Advanced C++ Compiler Optimizations:
  615. Copy elision (compiler auto-optimization with avoidance of a copy constructor in certain cases)
  616. — Guaranteed copy elision (C++17)
  617. — Named return value elision (a type of copy elision)
  618. — Temporary return value elision (a type of copy elision)
  619. — Copy elision in exception handling (special case for copy elision)
  620. — Allocation elision (new operator) (C++14)
  621. Use xvalue or “expiring value” optimizations (various)
  622. Trick: to disallow creating an object on the stack, make its destructor private.
  623. Trick: to disallow creating an object on the heap, make its new and new[] operators private.

    Byte Block Operations in C++ Classes: (Use with extreme care!)
  624. memset/bzero to zero in a constructor — fast but dangerous, overwrites internal “vtable” data in object if class has any virtual functions, does not call constructors of its data members or base class members; also cannot use an initializer list as this overwrites with zero after any objects were set by the initializer list.
  625. memcpy to bitwise copy in a copy constructor or assignment operator — fast but dangerous, improperly copies internal vtable data in object if class has any virtual functions, does not deeply copy any of its members or base class members nor call their constructors.
  626. memcpy to bitwise copy in a move copy constructor or move assignment operator — fast but dangerous; improperly copies “vtable”.
  627. memcmp to bitwise compare for equality/inequality tests — fast but fails in many situations due to pitfalls: padding bytes, bit-field members, negative versus positive zero floating-point values, NaN floating-point values.
  628. Virtual inheritance — usually for pure virtual base classes; avoids double objects if the same base class is inherited in two different ways.

    Timing C++ Methods:
  629. std::chrono C++ class (highly granular)
  630. clock() C/C++ function
  631. time command (Linux shell)
  632. time() function (granularity is only in seconds)
  633. gettimeofday()

    Benchmarking C++ Methods:
  634. Loop unrolling for accurate benchmarking
  635. Use volatile specifier for accurate benchmarking
  636. Loop overhead measurement for accurate benchmarking
  637. Google Benchmark: Apache 2 license; code: https://github.com/google/benchmark

    Compiler Settings:
  638. Optimizer settings
  639. Optimizing for space/memory size (compiler flags)

    General Build & Software Development Practices for Efficiency:
  640. Maintain separate builds for slow testables versus production executables
  641. Compile-out assertions
  642. Compile-out self-testing code
  643. Compile-out debug code or tracing code
  644. Ensure test code not accidentally left in production (test a global flag based on these macros at startup)

    CUDA C++ GPU Optimizations:
  645. Coalesced memory accesses
  646. Thread specialization (GPU)
  647. GPU thread pools
  648. Producer-consumer thread pools
  649. GPU kernel optimizations
  650. Striding (GPU kernels)
  651. Overlapping GPU uploads and compute
  652. Overlapping with recomputation/rematerialization
  653. Offloading to CPU
  654. Pinned memory blocks
  655. Warp divergence (warp coherence)
  656. Grid optimizations
  657. Grid size optimizations

    Core Utility Classes (Efficiency Helpers): (to build for overall efficiency practices)
  658. Bitwise macro library (bitflag management)
  659. Floating-point fast bitwise operations macro library
  660. Benchmarking/timing library
  661. Smart buffer library (reduce allocations by combining allocated/non-allocated memory management)
  662. TCP/UDP wrapper library
  663. Specialized data structures for small amounts of data (faster than STL)
  664. Sorted array and binary search (small array size)
  665. Lock-free queues
  666. Perfect hashing library
  667. Bit vector data structures (possibly based on STL)
  668. Bit set data structures (possibly based on STL)
  669. Bloom filter library
  670. Vector hashing library
  671. Caching utilities library
  672. Source code precomputation library
  673. Basic data and statistics on vectors (e.g., averages, std dev/variance, etc.)
  674. Incremental vector algorithms (averages, min, max, etc.)
  675. Branchless coding primitives library
  676. Graph library for locking analysis
  677. Data compression library
  678. Approximate tests library
  679. Math library (versus STL)
  680. Memory pools library (fixed-size custom memory allocators)
  681. Custom memory allocator library
  682. Placement new operator versions
  683. Placement delete operator (write your own)
  684. Multi-dimensional array library (linearize your vectors/matrices/tables/tensors)


    AI Kernel Optimizations (using LLM Inference Optimizations for non-AI low latency applications): (subset of methods to consider)
    Reference: 500+ LLM Inference Optimization Techniques (blog article)

  685. Kernel fusion
  686. Kernel fission
  687. Kernel tiling/blocking
  688. Quantization (integer-based approximation of floating-point)
  689. Low-bit quantization
  690. Binary quantization (1-bit)
  691. Integer-only arithmetic
  692. Floating-point quantization (FP16/FP8/FP4)
  693. Mixed precision quantization
  694. Logarithmic quantization
  695. Dyadic quantization
  696. Low rank matrices
  697. MatMul/GEMM optimizations (many)
  698. MatMul data locality optimizations
  699. Sparse MatMul
  700. Approximate matrix multiplication
  701. Contiguous memory block matrix multiplication
  702. Cached transpose MatMul
  703. Fused transpose MatMul
  704. Tiled/blocked MatMul
  705. Sparsification (Pruning/Sparsity)
  706. Token pruning (input compression)
  707. Token skipping
  708. Token merging
  709. Data compression algorithms
  710. Early exiting (of layers)
  711. Caching optimizations
  712. Vector computation caching
  713. Zero skipping
  714. Negative skipping
  715. Padding optimizations
  716. Zero padding removal
  717. Zero-multiplication arithmetic
  718. Adder/addition (zero-multiply)
  719. Bitshifts (zero-multiply)
  720. Bitshift-add (zero-multiply)
  721. Double bitshift-add (zero-multiply)
  722. Add-as-integer (zero-multiply)
  723. Logarithmic arithmetic (zero-multiply)
  724. Hadamard element-wise matrix multiplication
  725. End-to-end integer arithmetic
  726. Table lookup matrix multiplication
  727. Weight clustering (grouped quantization)
  728. Vector quantization
  729. Parameter sharing
  730. Activation function optimizations (non-linear functions)
  731. Precomputation of Activation functions
  732. Approximation of Activation functions
  733. Integer-only approximation of Activation functions
  734. Fused activation functions
  735. Normalization optimizations (non-linear vector data functions)
  736. Fused normalization optimizations
  737. FFN optimizations (double MatMul)
  738. FFN approximations
  739. FFN integer-only
  740. Decoding algorithm optimizations
  741. Speculative decoding
  742. Multi-token decoding
  743. Ensemble decoding
  744. Consensus/majority-vote decoding
  745. Easy-hard queries
  746. Batching computations
  747. Advanced number systems
  748. Posit numbers
  749. Dyadic numbers
  750. Hybrid number systems
  751. Fixed point numbers (integers not floating-point)
  752. Block floating-point (BFP) hybrids
  753. Logarithmic number system (LNS)
  754. Disaggregation (prefill/decoding)
  755. Computation re-use
  756. Conditional computation
  757. Approximate caching
  758. Addition arithmetic optimizations
  759. Approximate addition
  760. Bitwise arithmetic optimizations
  761. Fast multiplication arithmetic
  762. Approximate multiplication
  763. Logarithmic approximate multiplication
  764. Approximate division
  765. Bitserial arithmetic

Free AI and C++ Books

Generative AI programming books:

  1. The Sweetest Lesson: Your Brain Versus AI, November 2025: full text online, free PDF available
  2. RAG Optimization: Accurate and Efficient LLM Applications, June 2025: full text online, free PDF available
  3. Generative AI Applications: Planning, Design and Implementation, November 2024: full text online, free PDF available
  4. Generative AI in C++ (Spuler, March 2024): full text online, free PDF available, table of contents, bonus materials, reference lists, source code

CUDA C++ GPU Programming Books:

  1. CUDA C++ Optimization: Coding Faster GPU Kernels, July 2024: full text online, bonus materials, free PDF available
  2. CUDA C++ Debugging: Safer GPU Kernel Programming, July 2024: full text online, free PDF available

Modern C++ Programming Books

  1. C++ AVX Optimization: CPU SIMD Vectorization, 2025: full text online, free PDF available
  2. C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations, 2025: full text online, free PDF available
  3. Advanced C++ Memory Techniques: Efficiency and Safety, 2025: full text online, free PDF available
  4. Efficient C++ Multithreading: Modern Concurrency Optimization, 2025: free PDF available
  5. Efficient Modern C++ Data Structures: Container and Algorithm Optimizations, 2025: free PDF available
  6. C++ Low Latency: Multithreading and Hotpath Optimizations, 2025: free PDF available
  7. Safe C++: Fixing Memory Safety Issues, Oct 2024: full text online, free PDF available

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Mordern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging