Aussie AI Blog

List of 600+ Low-Latency C++ Techniques

September 22, 2025

by David Spuler, Ph.D.

List of C++ Low-Latency and Efficiency Techniques

This is a compilation of coding efficiency and low latency C++ programming techniques from various books and articles:

C++ Ultra Low Latency, David Spuler, July 2025.
C++ Low Latency, David Spuler, March 2025.
CUDA C++ Optimization, David Spuler, June 2024.
Generative AI in C++, David Spuler, March 2024.
500+ LLM Inference Optimization Techniques (blog article)

Here’s the long list:

Low Latency C++ General Software Approaches:

Cache warming
Core pinning (“affinity”)
False sharing (avoiding)
Branch prediction optimizations
Hotpath optimizations
Slowpath removal
Kernel bypass
Lock contention (reducing)
Lock-free programming (with atomics and memory ordering issues)
Thread pools
SIMD CPU instructions
Inline assembly language (“asm” statements)
Intrinsic functions (often closely mapping to machine code instructions)
In-memory logging
Cache locality (for L1/L2/L3 memory caches and instruction caches)
Specialized data structures
Thread-Local Storage (TLS) (“thread_local” type in C++11)
Shared memory (e.g., shmctl “shared memory control” function, shmget, shm_open, ftruncate)
Memory mapped files/devices (e.g., mmap, munmap)
Asynchronous programming (std::async)

Concurrency-Friendly Data Structures:
Read-only data structures
Reader-friendly data structures (e.g., many readers, one writer)
Copy-on-write data structures (for readers)
Versioned data structures (for readers)
Partition data across threads (vertically: columns)
Shard data across threads (horizontally: rows)
Read-Copy-Update(RCU)—mostly the same as copy-on-write.
NUMA-aware data structures—reduce cross-node communications
Transactional memory (synchronization efficiency, reduces contention) — use atomic/isolated transactions (an emerging technology)

Hotpath Optimizations:
Optimize all steps in the hotpath (e.g., data ingestion, decision, trade execution, logging, risk management)
Profile the hotpath specifically (e.g., a test mode that always runs the hotpath)
Examine assembly code of the hotpath
Avoid memory allocation calls on hotpath (e.g., memory pools, preallocation)
Avoid free/deallocation of memory on hotpath
Use preallocated memory on hotpath
Review data de-serialization and serialization costs
Use in-memory databases for any significant amounts of incoming data
Keep the client network connection warm (method depends on the API)
Re-use objects to avoid constructor/destructor calls on hotpath

General Tuning Advice:
Avoid micro-optimization
Avoid optimizing error handling code (it’s a slowpath)
Loop optimizations (see below)
Avoid nested loops
Tune inner loop for nested loops
Avoid excessive function wrapper overhead

Performance Profiling Tools:
gprof
perf
prof (older)
pixie (older)

Lock Contention Reduction:
Late lock acquisition
Early lock release
Short critical section of code
Generally reduce total numbers of locks used
Locking fine-grain vs coarse-grain
Use fine-grain locks for contested resources
Use a hybrid fine-grain/coarse-grain lock strategy
Release locks before significant computation
Copy data to temporary variables to unlock before computation
Release locks before blocking for I/O
Release locks before blocking for system calls
Release locks before blocking for networking
Tolerate lockless output overlaps
std::shared_mutex and std::shared_lock — multiple reads, one writer.
Double lock check method (check first without a lock)
Use message-passing via std::promise and std::future rather than shared memory.
Thread-specific queues and “work stealing” design pattern
Use a lock-free queue data structure
thread_local keyword (C++11)
std::lock_guard (C++11)
std::lock_guard early release by scope control
std::unique_lock (C++11) (more granular control than std::lock_guard)
std::scoped_lock (C++17)
Locking with timeouts (try locks)
Avoid spinlock busy waiting
Exponential backoff to avoid spinlock costs
See also “lock-free programming”
See also “concurrency-friendly data structures”

Thread/lock overhead reduction (generally):
Reduce thread launch overhead
Reduce thread destruction overhead
Reduce lock acquisition/release overhead
Reduce lock contention overhead
std::make_shared() or std::allocate_shared() do only one allocation (combined shared pointer and control block), whereas shared_ptr<type> does two allocations (shared pointer and the control block are separate).
Weak pointers (std::weak_ptr) can delay the deallocation of a shared_ptr and its object even after the main reference count is zero.

System code optimizations (general ideas):
Avoid system calls to reduce context switches (in Linux)
Use C++ “intrinsics” functions (highly optimized assembly-level code)

Linux socket programming:
Non-blocking sockets versus using select() with a timeout—allows thread to do “other” useful work rather than just wait.
poll() or epoll() system call rather than waiting

Context Switching Reduction:
Thread counts (not too many threads)
Thread specialization
Thread specialization (producer-consumer thread model)
Use custom thread pools with only preallocated memory block pools.
spinlocks avoid context switches (especially good if spins for only a short time)
Avoid context switch cost by having a thread do “other” work, rather than just blocking.

Cache Locality Optimizations:
Tiling/blocking algorithms
Tiling/blocking matrix multiplication (MatMul/GEMM)
Smaller data type sizes for increased locality
Choose a CPU with a larger L1 “cache line size” (64-256 bytes common)
std::hardware_destructive_interference_size, std::hardware_constructive_interference_size (C++17)
std::initializer_list (C++11) can be used as a lightweight container with contiguous elements
See also “cache warming (prefetch)” optimizations
See also “false sharing (avoid)” optimizations

Instruction Cache Locality Optimizations:
Prefer shorter blocks of code in the hotpath
Consider not inlining function calls (for instruction cache locality)
See also “branch prediction optimizations”

Branch Prediction Optimizations (General):
Branch elimination
Branch compiler hints
Branch prediction heuristics
Branch profiling (two-phase)
Branchless programming
Tools—measure branch prediction data (e.g., perf)

Branch Reductions Techniques:
Algorithm-level changes to reduce branches
Keep loop bodies short (shorter branches)
Reduce far branching (e.g., function calls)
Reduce overall use of function calls (see function call optimizations)
Reduce use of if statements
Reduce use of loops
Reduce use of break statements (in loops, not switch!)
Reduce use of continue statements
Reduce use of switch statements
Reduce short-circuiting in &&/|| operators
Reduce short-circuiting of ?: ternary operator
Avoid virtual function calls (hidden dynamic branches)
Avoid pointer-to-functions (hidden dynamic branches; blocks inlining)
Avoid function objects/functors (hidden dynamic branches)
Avoid lambda functions passed as arguments (depends on how well the optimizer can handle them)
Reduce long if-else-if sequences
Reduce nested if-else sequences
Avoid branches depending on anything unpredictable
Avoid branches depending on user inputs
Avoid branches depending on random numbers
Avoid branches depending on system clocks
Sort array data for efficient branch prediction, if scanning through the array comparing the data (e.g., before testing for error range)
See also “compile-time optimizations” (remove branches at compile-time)
See also “loop optimizations” (reduce loop iterations, e.g., loop unrolling)

Branch Prediction Heuristics:
Common case code in if block
Uncommon case code in else block
Error handling code in else block (uncommon code)
Avoid zero-iteration loops (never entered)
Avoid single-iteration loops (never loop back)

Branch Prediction Compiler Hints:
[[likely]] and [[unlikely]] path attributes (C++20)
likely() and unlikely() expressions (C++20)
__builtin_expect (GCC)
Define LIKELY and UNLIKELY macros with __builtin_expect (pre-C++20)
[[noreturn]] (C++11)
[[assume(expression)]] attribute (C++23)
hot (GCC function attribute)
GCC __builtin_unreachable
std::unreachable—helps branch prediction (C++23)
[[fallthrough]] — more for safety than speed (C++17)
-fdelayed-branch compiler flag
-fguess-branch-probability compiler flag
-fif-conversion and -fif-conversion2 compiler flags
Use “likely” and “unlikely” in custom assertion macros
Use “likely” and “unlikely” in error handling code macros

Branch Profiling:
-fprofile-arcs (GCC option)
-fprofile-generate (GCC command-line argument)
-fprofile-use (GCC command-line argument)
Branch profiling with 100% hotpath (test modes)

Branchless Programming Techniques:
Ternary operator preferred over if statements (if CMOV instruction)
Boolean variables as 0 or 1 in arithmetic
Logical operators (&&/||) as 0 or 1 in arithmetic
Bitwise operators (&/|) replace logical operators (&&/||)
Sign bit extension bit masks
Lookup tables for branchless programming
XOR trick to swap two integer variables without a temporary variable

Slowpath Removal:
Optimize error checking pathways
Remove error checking tests
Defer error checking tests to later
Combine error checking tests together (and do it later)
Avoid adding error checks deeper in the call hierarchy
Never-failing functions (cannot return an error)
Don’t use memory allocation (avoids memory allocation failure)

Cache Warming Methods:
Prefetch memory primitives
__builtin_prefetch (GCC)
_mm_prefetch (GCC)
volatile on temporary variables
Dry-run execution mode
Branchless dry-run execution with arr[2] declarations
Use read-only cache warming pathways (avoids cache invalidation for other threads)
Use deep cache warming all the way down into the NIC
Optimize cache warming code by reducing data reads (relies on cache line sizes)
Reduce cache warming code to the maximum size of the memory cache (avoids redundant cache warming when cache is already full).

False Sharing (Avoiding):
Using alignas(64) or 128 or 256 to avoid false sharing (C++11)
Use alignas on all shared memory or atomics (C++11)
Tools to automatically detect false sharing (DRD fails?)

Parallelism (General Categories):
Multithreading
Multiprocess
Vectorization
Pipelining
Parallel execution modes (C++17)
Coroutines (C++20)

Advanced C++ Concurrency Data Structures:
Read-only (“immutable”) data structures
Lock-free algorithms and data structures
Linear search can be efficient for small sizes because of cache prefetching (e.g., rather than binary search; also doesn’t need sorting maintained)

SIMD Instructions:
AVX (x86 CPUs)
ARM Neon
std::simd (experimental/C++26)
<immintrin.h>

Linux O/S Optimizations:
Process priority upgrades (“nice” command or system call)
Disable unimportant processes
Overclocking CPU
Overclocking GPU
Disable Security Enhanced (SE) Linux
Disable accounting mode in Linux (should be off anyway)

Linux Kernel Optimizations:
Scheduling algorithm kernel modifications
Tweak TCP/UDP network buffer settings (Linux kernel)
Turn off file “last access date” storage (“noatime” in /etc/fstab)

System Hardware Optimizations (Categories):
Processor hardware (CPU)
Network optimizations
Disk optimizations
RAM Memory optimizations

Processor Hardware Major Categories of Optimizations:
CPU
GPU
NPU
FPGA
ASIC

Networking Hardware Optimizations (Categories):
NIC
Switches
Load balancer devices
Size of the packet buffer of a switch (optimizing for)

Networking Transmission/Protocol Optimizations (Categories):
Physical proximity
Co-Lo
TCP
UDP (faster than TCP but unreliable)
Optical networking (optical fiber cables)
Microwave network transmission
Packet fragment manipulations (e.g., out-of-order)
Reduce packet fragment collation overhead
Reduce packet consistency checking (error safety overhead)

Networking Software Optimizations:
TcpDirect/Onload
SolarFlare/OpenOnload (kernel bypass)
Exablaze (NIC with kernel bypass support)
DMA
PCIe bus
Compress data sizes for your network transmissions
Sticky sessions (avoids needing to send user-specific caches between servers)
Shared storage rather than other server-to-server networking (e.g., NAS/SAN)
Use custom wrappers for TCP and UDP network processing

GPU & Distributed Networking Optimizations:
RDMA
nvlink
Infiniband
RoCE
GPUDirect
PXN

Deployment Optimizations (Website backends):
DNS optimizations
Round-Robin DNS (RRDNS)
SSL time optimizations
etags (website server speedup)
Multiple identical servers architecture
Use subdomains for static files
CDN for static files
Compression modes enabled
Static files compressed
Minify static files (CSS, JavaScript)
Merge multiple small files together
Use smaller image files (low precision)
Merge multiple small icon images into one image file
Cache duration settings
Database optimizations (various, e.g., MySQL/MariaDB/MongoDB)
Database indexes
Application server optimizations (e.g., Tomcat)

Apache/Nginx Subprocess Optimizations:
Use FCGI not classic CGI integrations
Flush stdout of subprocesses (sends partial output earlier to Apache or Nginx)
Close stdout of subprocesses before shutdown sequence (finishes earlier to Apache or Nginx)
Early tests for violations and invalidity (fails quickly)

Algorithm Enhancements:
Precomputation (lookup tables)
Precomputation to data file
Precomputation of source code
Incremental algorithms
Data structure augmentation
Parallelization
Vectorization
Caching
Lazy evaluation
Common case first
Simple case first
Approximate tests first
Bounding box approximate tests
Bounding sphere approximate tests
Avoiding sqrt by using arithmetic on squares
Integer arithmetic on squares: avoid floating-point by using arithmetic on squares
Use variance not standard-deviation (arithmetic on squares)
Approximations
Compute budget algorithms
Probabilistic/stochastic algorithms
Skipping algorithms
Heuristic algorithms
Greedy algorithms

Memory Reduction Strategies:
Take care with memory reduction as some methods can reduce speed (trade-offs)
Reduce allocated memory
Smaller data sizes
Pack data into smaller integer sizes
Pack data into bits
Pack data using bit-fields
Pack data into unions
Use std::bitvector
Use std::vector<bool> (it is a special bit-packed template instantiation)
Structure packing (also for class data members): reorder different-sized data members for better packing and fewer padding bytes
Structure packing: biggest data types first (heuristic)
Structure packing: MSVS /d1reportSingleClassLayout compiler option to report on it
#pragma pack reduces padding to reduce size, but may worsen structure access costs
Stack data reductions
Avoid deallocation of heap memory when in shutting-down mode

Heap Allocated Memory Reduction Strategies:
Fewer allocated memory blocks
Avoid frequent small allocations
Preallocation of dynamic memory
Memory fragmentation avoidance
Memory leak avoidance
Merge memory allocations together
Memory pools (fixed-size allocations, often a type of preallocation)
Memory pool with O(1) deletion and O(1) insertion via permutation array
Merge fixed-size allocated objects into a large array
Custom memory allocators (generalized)
Class-specific memory allocator
Custom global memory allocator
Late allocation (allocate memory as late as possible)
Early free memory (deallocate as early as possible)
Early delete memory (deallocate early)
Avoid realloc (slow, memory fragmentation)
Smart dynamic buffers (hybrid of allocated and non-allocated memory)
std::aligned_alloc - memory alignment improvement (C++17)
std::aligned_union (C++11)

Static Memory Size Reductions:
Avoid large global arrays and buffers
Avoid large static arrays and buffers
Avoid large static C++ data members
String literal memory reductions

Stack Memory Size Reductions:
Avoid large local arrays and buffers
Avoid large function non-reference parameter arrays and buffers
Use pass-by-reference on large function parameters
Use integer parameters as local variables
Consider stack versus memory allocation
Flattening/reducing function call hierarchy
Inline small functions (compiler can disappear them)
Use #define macros for small functions (versus inlining)
See also: function call hierarchy flattening
See also: recursion avoidance

Code Size Reduction Strategies:
Code size reductions
DLLs versus static libraries
Remove executable debug information
Avoid the compiler “-g” debug option
Avoid the compiler “-p” profiler option
Unix strip command
Avoid large inline functions (instruction cache locality)
Don’t overuse “always inline” or “force inline”
Template overuse
Google “bloaty” tool

Standard Library Optimizations (STL Optimizations):
String processing efficiency (e.g., “+” for std::string can be slow)
std::vector of non-trivial class objects calls constructor/destructors
Control array size for std::vector using “reserve()”
Use std::sort rather than qsort
bsearch is not your friend
Consider hard-coded arrays versus std::array versus std::vector
Compare the first letters of strings before calling strcmp
Consider type casts to int versus round(), ceil(), floor()
Avoid printf/fprintf format string processing with putchar/putc/fputc or puts/fputs
Hand-code versions of abs and fabs/fabsf that don’t handle Inf/NaN numbers (but benchmark it).
Change strlen("literal") to char arr[]="literal" and use sizeof(arr)-1
Don’t use strlen(s) in a for loop condition
Consider your own atoi/itoa versions that don’t handle all the obscure cases.
Avoid sprintf and snprintf (both are slow)
sync_with_stdio(false)
std::stringstream is slow (hand-code text field processing instead)

Data Structures:
Hashing (basic)
Perfect hashing
Bit vectors
Bit sets
Bloom filters (bit vectors + hashing)
Binary tree
Sorted arrays
Unsorted arrays
Stacks
Queues
Dequeues
Vector hashing
Permutation arrays
Locality-sensitive hashing (LSH)
Bit signatures (vector algorithm)
K-means clustering (vector algorithm)
Hyper-cube (vector algorithm)
Approximate nearest neighbor (ANN) (vector algorithm)

Variable Optimizations:
Prefer int types to char or short (usually)
Prefer int types to unsigned int (usually)
Prefer int types to size_t (usually it’s an unsigned long; consider uint32_t)
Avoid unnecessary initializations
Re-use objects to avoid initializations/destruction
Avoid temporary variables
Use reference variables instead of full temporary variables
Avoid creating temporary objects
Put commonly used data fields first in struct/class
Declare variables as close as possible to usage
if initializer syntax (C++17)
switch initializer syntax (C++17)
Avoid bit-fields (smaller but slower to access or set)
Use memory alignment primitives to avoid slow-downs
Put the most-used data member first (it has a zero offset)
Order data members most used to least used (smaller offsets are faster, in theory)
Array initializer lists as local variables (re-initialized each call)
Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).

Arithmetic Optimizations:
Operator strength reduction
Reciprocal multiplication
Integer arithmetic
Use float not double

Expression Optimizations:
Expression transformations
const
mutable keyword — bypasses const (C++98) (speedy but unsafe)
Common subexpression elimination (CSE)
Constant folding
Template fold expressions (C++17) are concise but often lots of computation
Expression templates—avoids explicit temporary variables, compiler optimizes it better.
Constant propagation
Redundant assignment removal
Strength reduction
Algebraic identities
Implicit type conversions (avoiding; type consistency)
explicit keyword (prevent implicit type conversions) (C++98)
Brace initialization syntax {} (avoids implicit narrowing conversions)
auto variable declarations avoid accidental temporaries and implicit type conversions.
Don’t mix float/double types (including their constants)
Don’t mix integer types
Prefer signed integers over unsigned types
Short-circuiting of sub-expressions (using &&/||/?:)
Register allocation optimizations
mprotect page system call — used as optimization to make memory writeable
<algorithm> simple algorithms: min, max, etc.
Range check faster with casts via “(unsigned)i < MAX” not “i >=0 && i < MAX”

Memory Block Operations:
Prefer contiguous memory blocks (locality, efficient block operations, etc.)
Different class types can allow block copying: POD (Plain Old Data), trivial types, standard layout types (e.g., check in a template using std::is_trivial)
Copy arrays by wrapping them in a dummy struct
Copy arrays with memcpy
Compare arrays with memcmp (very dangerous: padding bytes, negative zero, NaNs)
Use memcpy not memmove if arguments won’t overlap.
Linearize multi-dimensional arrays (contiguous memory blocks)

Operator Strength Reduction Optimizations:
Replace * with bitshifts
Replace * with addition
Replace x*2 with x+x
Replace % with bitwise-and (&)
Replace % with increment and test
Replace % with type casts (if byte sizes)

Bitwise Optimizations:
Intrinsic bitwise functions
CLZ (count leading zeros) bitwise intrinsics
CTZ (count trailing zeros) bitwise intrinsics
Popcount bitwise intrinsics (set bit count)
Kernighan bit trick (find highest bit set)
Fast NOR/NAND/XNOR via assembly instructions
Fast LOG2 of integers
Fast largest power-of-two of integers

Floating-Point Optimizations:
Convert float to 32-bit integers (float bit manipulations)
FTZ (Flush to Zero) mode
DAZ (Denormals Are Zero) mode
LOG2 of floating-point is the exponent
Zero/negative zero bitwise tests
Disallow negative zero (to use faster zero comparisons)
NaN (Not-a-Number) bitwise tests
Inf/-Inf bitwise tests
Avoid denormalized numbers
Disable denormalized numbers (subnormals) (compiler/library modes)
Avoid underflow in floating-point (ignore it)
Avoid overflow in floating-point (ignore it)
memcmp float vector equality (disallow special values for fast float vector equality comparison)
Fast detection of special values in float vectors (bitwise operations)
Floating-point intrinsic functions (various)
Exponent addition: bitshifting floating-point by addition of the exponent bits
Sign bit flipping/extraction/setting (bitwise tricks)

Compiler Settings for Floating-Point:
GCC -ffast-math option — faster math mode.
GCC -fno-math-errno — faster math multithreading by not setting errno.
GCC -ffinite-math-only
GCC fno-trapping-math
MSVS /fp:precise, /fp:strict, /fp:fast
Disable floating-point exceptions

Loop Optimizations:
Exit loops early (e.g., break or return statements)
Finish loop body early (i.e., continue statement)
Correct choice of loop
Loop unrolling
#pragma unroll
Loop fusion
Loop perforation (probabilistic)
Loop tiling/blocking
Loop fission
Loop reversal (don’t use!)
Loop code motion (“hoisting”)
Loop distribution
Loop iterator strength reduction
Loop coalescing
Loop collapsing
Loop peeling
Loop splitting
Loop interchange
Loop sentinel
Loop strip mining (loop sectioning)
Loop spreading
Loop normalization
Loop skewing
Loop interleaving

If Statement Optimizations:
Replace if-else-if sequences with switch.
Replace if-else-if sequences with lookup table loop.

Switch Statement Optimizations:
Use compact numeric ranges in switch (compiler can use a LUT)

Compile-Time Optimizations:
inline functions
always_inline specifier
GCC flatten_inline specifier
gnu_inline GCC specifier
Keep inline functions short (helps compiler to inline)
Keep inline functions in header files (source available to all its calls)
Avoid making a virtual function “inline”—compiles but usually is a slug.
sizeof
Use sizeof with static_assert (e.g., portability checks)
Virtual functions cannot be inlined (although it compiles)
Pointer-to-function usages of functions cannot be inlined
Function objects (functors) cannot always be inlined
Lambda functions cannot always be inlined
inline variables (C++17) (helps with linking)
static_assert (compile-time assertions)
const is good
constexpr (C++11) is great
constexpr functions allow if, switch, loops, etc. (C++14)
constexpr lambda functions (C++17)
constexpr and placement new (C++26)
References to constexpr variables (C++26)
if constexpr statements
constinit
consteval
if consteval (C++23)
Type traits <type_traits> (C++11)
typeid is slow (RTTI)
std::is_same_v (type trait test)
Template specialization (for specific types)
Template specialization (for constant integers)
Variadic templates (C++11)
Template Meta Programming (TMP) still works, but prefer constexpr
Auto-vectorization (by compiler)
Auto-unrolling of loops (by compiler)
SFINAE tricks (mostly an issue for compiler engineers)

Pointer Aliasing:
Reorganize functions with awareness of pointer aliasing issues
Restricted pointers (to avoid pointer aliasing slowdowns)
-fstrict-aliasing compiler option (alternative to using “restrict”)

Pointer Arithmetic:
Loop pointer arithmetic
End pointer address tricks (Loop pointer arithmetic)
Use references not pointers (avoids null testing)
Prefer postfix operations with the *ptr++ idiom (not prefix ++ptr)
Pointer comparison tricks
Pointer difference tricks
Avoid safe pointer class wrappers (prefer raw pointers for speed)

Pointer Optimizations (Other):
reinterpret_cast (helps the optimizer and is effectively a free compile-time hint)
Avoid dynamic_cast (to downcast from a base to a derived class, which can be helpful for specializing member calls, but dynamic casts can be expensive at runtime because of RTTI)

Function Optimizations:
Return early from functions
Flatten function call hierarchies
Callbacks are an extra layer of function call
Lambda functions are convenient but are an extra function call layer (though often inlined)
Function objects (functors) are an extra function call
Avoid recursion (completely; we’re not in High School anymore)
Replace simple recursion with a loop
Replace complex recursion with a stack
Tail recursion elimination
Recursion higher base level
Collapse recursion levels
Specialize functions with default arguments (use two versions)
Specialize functions with void and non-void versions (if return value often ignored)
Avoid function pointers (cannot be inline or constexpr)
Merge multiple Boolean function parameters into a “config” object with Boolean data fields.
noexcept attributes allow compiler to avoid adding extra code (C++11)
std::initializer_list can be used to return multiple values (benchmark against other methods)

C++ Class Optimizations:
friend functions (bypass interfaces)
friend classes (bypass interfaces)
Return references rather than objects
Avoid temporary class objects in expressions
Add extra member functions to avoid temporary object creation
Pass objects by reference to functions (i.e., “&” or “const&”)
Disable copy constructors with “private” or “= delete”
Disable assignment operators with “private” and “= delete”
Declare assignment operators with void return type (except when defaulting)
Re-use objects to avoid constructor and destructor calls
Avoid calling the destructor when in shutting down mode
Uninitialized memory algorithms, e.g., std::uninitialized_fill (C++17)
CRTP (Curiously Recurring Template Pattern): derived class derives from base class which is itself a template involving a pointer to the derived class (optimizes polymorphism to be compile-time, avoiding virtual function calls; also this allows more inlining of these calls.)
Move constructors
Move assignment operators
std::move (C++11, C++14) is usually a compile-time cast.
Return object reference types (not complicated objects)
Avoid virtual function calls with explicit calls to the specific function
Specialize inherited member functions (for the more restrictive type)
Avoid overloading the postfix increment/decrement operators
Block the overloaded postfix increment/decrement operators (void body or =delete)
Consider skipping destructor cleanup if program is shutting down
Avoid accidental double initialization of data members in constructors
Avoid redundant initialization of same members in both constructor and “setup” methods
Specialize member functions with default arguments (use two versions instead)
Default constructors/destructors with “=default” may be more efficient than hand-coded versions.
Trick for singleton pattern in multithreading — threads initializing function-local static variable, other threads block, once-only initialization guaranteed by C++ compiler.

Advanced C++ Compiler Optimizations:
Copy elision (compiler auto-optimization with avoidance of a copy constructor in certain cases)
Guaranteed copy elision (C++17)
Named return value elision (a type of copy elision)
Temporary return value elision (a type of copy elision)
Copy elision in exception handling (special case for copy elision)
Allocation elision (new operator) (C++14)
Use xvalue or “expiring value” optimizations (various)
Trick: to disallow creating an object on the stack, make its destructor private.
Trick: to disallow creating an object on the heap, make its new and new[] operators private.

Byte Block Operations in C++ Classes: (Use with extreme care!)
memset/bzero to zero in a constructor — fast but dangerous, overwrites internal “vtable” data in object if class has any virtual functions, does not call constructors of its data members or base class members; also cannot use an initializer list as this overwrites with zero after any objects were set by the initializer list.
memcpy to bitwise copy in a copy constructor or assignment operator — fast but dangerous, improperly copies internal vtable data in object if class has any virtual functions, does not deeply copy any of its members or base class members nor call their constructors.
memcpy to bitwise copy in a move copy constructor or move assignment operator — fast but dangerous; improperly copies “vtable”.
memcmp to bitwise compare for equality/inequality tests — fast but fails in many situations due to pitfalls: padding bytes, bit-field members, negative versus positive zero floating-point values, NaN floating-point values.
Virtual inheritance — usually for pure virtual base classes; avoids double objects if the same base class is inherited in two different ways.

Timing C++ Methods:
std::chrono C++ class (highly granular)
clock() C/C++ function
time command (Linux shell)
time() function (granularity is only in seconds)
gettimeofday()

Benchmarking C++ Methods:
Loop unrolling for accurate benchmarking
Use volatile specifier for accurate benchmarking
Loop overhead measurement for accurate benchmarking
Google Benchmark: Apache 2 license; code: https://github.com/google/benchmark

Compiler Settings:
Optimizer settings
Optimizing for space/memory size (compiler flags)

General Build & Software Development Practices for Efficiency:
Maintain separate builds for slow testables versus production executables
Compile-out assertions
Compile-out self-testing code
Compile-out debug code or tracing code
Ensure test code not accidentally left in production (test a global flag based on these macros at startup)

CUDA C++ GPU Optimizations:
Coalesced memory accesses
Thread specialization (GPU)
GPU thread pools
Producer-consumer thread pools
GPU kernel optimizations
Striding (GPU kernels)
Overlapping GPU uploads and compute
Overlapping with recomputation/rematerialization
Offloading to CPU
Pinned memory blocks
Warp divergence (warp coherence)
Grid optimizations
Grid size optimizations

Core Utility Classes (Efficiency Helpers): (to build for overall efficiency practices)
Bitwise macro library (bitflag management)
Floating-point fast bitwise operations macro library
Benchmarking/timing library
Smart buffer library (reduce allocations by combining allocated/non-allocated memory management)
TCP/UDP wrapper library
Specialized data structures for small amounts of data (faster than STL)
Sorted array and binary search (small array size)
Lock-free queues
Perfect hashing library
Bit vector data structures (possibly based on STL)
Bit set data structures (possibly based on STL)
Bloom filter library
Vector hashing library
Caching utilities library
Source code precomputation library
Basic data and statistics on vectors (e.g., averages, std dev/variance, etc.)
Incremental vector algorithms (averages, min, max, etc.)
Branchless coding primitives library
Graph library for locking analysis
Data compression library
Approximate tests library
Math library (versus STL)
Memory pools library (fixed-size custom memory allocators)
Custom memory allocator library
Placement new operator versions
Placement delete operator (write your own)
Multi-dimensional array library (linearize your vectors/matrices/tables/tensors)

AI Kernel Optimizations (using LLM Inference Optimizations for non-AI low latency applications): (subset of methods to consider)
Reference: 500+ LLM Inference Optimization Techniques (blog article)
Kernel fusion
Kernel fission
Kernel tiling/blocking
Quantization (integer-based approximation of floating-point)
Low-bit quantization
Binary quantization (1-bit)
Integer-only arithmetic
Floating-point quantization (FP16/FP8/FP4)
Mixed precision quantization
Logarithmic quantization
Dyadic quantization
Low rank matrices
MatMul/GEMM optimizations (many)
MatMul data locality optimizations
Sparse MatMul
Approximate matrix multiplication
Contiguous memory block matrix multiplication
Cached transpose MatMul
Fused transpose MatMul
Tiled/blocked MatMul
Sparsification (Pruning/Sparsity)
Token pruning (input compression)
Token skipping
Token merging
Data compression algorithms
Early exiting (of layers)
Caching optimizations
Vector computation caching
Zero skipping
Negative skipping
Padding optimizations
Zero padding removal
Zero-multiplication arithmetic
Adder/addition (zero-multiply)
Bitshifts (zero-multiply)
Bitshift-add (zero-multiply)
Double bitshift-add (zero-multiply)
Add-as-integer (zero-multiply)
Logarithmic arithmetic (zero-multiply)
Hadamard element-wise matrix multiplication
End-to-end integer arithmetic
Table lookup matrix multiplication
Weight clustering (grouped quantization)
Vector quantization
Parameter sharing
Activation function optimizations (non-linear functions)
Precomputation of Activation functions
Approximation of Activation functions
Integer-only approximation of Activation functions
Fused activation functions
Normalization optimizations (non-linear vector data functions)
Fused normalization optimizations
FFN optimizations (double MatMul)
FFN approximations
FFN integer-only
Decoding algorithm optimizations
Speculative decoding
Multi-token decoding
Ensemble decoding
Consensus/majority-vote decoding
Easy-hard queries
Batching computations
Advanced number systems
Posit numbers
Dyadic numbers
Hybrid number systems
Fixed point numbers (integers not floating-point)
Block floating-point (BFP) hybrids
Logarithmic number system (LNS)
Disaggregation (prefill/decoding)
Computation re-use
Conditional computation
Approximate caching
Addition arithmetic optimizations
Approximate addition
Bitwise arithmetic optimizations
Fast multiplication arithmetic
Approximate multiplication
Logarithmic approximate multiplication
Approximate division
Bitserial arithmetic