Aussie AI Blog
List of 600+ Low-Latency C++ Techniques
-
September 22, 2025
-
by David Spuler, Ph.D.
List of C++ Low-Latency and Efficiency Techniques
This is a compilation of coding efficiency and low latency C++ programming techniques from various books and articles:
- C++ Ultra Low Latency, David Spuler, July 2025.
- C++ Low Latency, David Spuler, March 2025.
- CUDA C++ Optimization, David Spuler, June 2024.
- Generative AI in C++, David Spuler, March 2024.
- 500+ LLM Inference Optimization Techniques (blog article)
Here’s the long list:
-
Low Latency C++ General Software Approaches:
- Cache warming
- Core pinning (“affinity”)
- False sharing (avoiding)
- Branch prediction optimizations
- Hotpath optimizations
- Slowpath removal
- Kernel bypass
- Lock contention (reducing)
- Lock-free programming (with atomics and memory ordering issues)
- Thread pools
- SIMD CPU instructions
- Inline assembly language (“asm” statements)
- Intrinsic functions (often closely mapping to machine code instructions)
- In-memory logging
- Cache locality (for L1/L2/L3 memory caches and instruction caches)
- Specialized data structures
- Thread-Local Storage (TLS) (“
thread_local
” type in C++11) - Shared memory (e.g., shmctl “shared memory control” function,
shmget
,shm_open
,ftruncate
) - Memory mapped files/devices (e.g.,
mmap
,munmap
) - Asynchronous programming (
std::async
)
Concurrency-Friendly Data Structures: - Read-only data structures
- Reader-friendly data structures (e.g., many readers, one writer)
- Copy-on-write data structures (for readers)
- Versioned data structures (for readers)
- Partition data across threads (vertically: columns)
- Shard data across threads (horizontally: rows)
- Read-Copy-Update(RCU)—mostly the same as copy-on-write.
- NUMA-aware data structures—reduce cross-node communications
- Transactional memory (synchronization efficiency, reduces contention) — use atomic/isolated transactions (an emerging technology)
Hotpath Optimizations: - Optimize all steps in the hotpath (e.g., data ingestion, decision, trade execution, logging, risk management)
- Profile the hotpath specifically (e.g., a test mode that always runs the hotpath)
- Examine assembly code of the hotpath
- Avoid memory allocation calls on hotpath (e.g., memory pools, preallocation)
- Avoid free/deallocation of memory on hotpath
- Use preallocated memory on hotpath
- Review data de-serialization and serialization costs
- Use in-memory databases for any significant amounts of incoming data
- Keep the client network connection warm (method depends on the API)
- Re-use objects to avoid constructor/destructor calls on hotpath
General Tuning Advice: - Avoid micro-optimization
- Avoid optimizing error handling code (it’s a slowpath)
- Loop optimizations (see below)
- Avoid nested loops
- Tune inner loop for nested loops
- Avoid excessive function wrapper overhead
Performance Profiling Tools: gprof
perf
prof
(older)pixie
(older)
Lock Contention Reduction:- Late lock acquisition
- Early lock release
- Short critical section of code
- Generally reduce total numbers of locks used
- Locking fine-grain vs coarse-grain
- Use fine-grain locks for contested resources
- Use a hybrid fine-grain/coarse-grain lock strategy
- Release locks before significant computation
- Copy data to temporary variables to unlock before computation
- Release locks before blocking for I/O
- Release locks before blocking for system calls
- Release locks before blocking for networking
- Tolerate lockless output overlaps
std::shared_mutex
andstd::shared_lock
— multiple reads, one writer.- Double lock check method (check first without a lock)
- Use message-passing via
std::promise
andstd::future
rather than shared memory. - Thread-specific queues and “work stealing” design pattern
- Use a lock-free queue data structure
thread_local
keyword (C++11)std::lock_guard
(C++11)std::lock_guard
early release by scope controlstd::unique_lock
(C++11) (more granular control than std::lock_guard)std::scoped_lock
(C++17)- Locking with timeouts (try locks)
- Avoid spinlock busy waiting
- Exponential backoff to avoid spinlock costs
See also “lock-free programming”
See also “concurrency-friendly data structures”
Thread/lock overhead reduction (generally): - Reduce thread launch overhead
- Reduce thread destruction overhead
- Reduce lock acquisition/release overhead
- Reduce lock contention overhead
std::make_shared()
orstd::allocate_shared()
do only one allocation (combined shared pointer and control block), whereasshared_ptr<type>
does two allocations (shared pointer and the control block are separate).- Weak pointers (
std::weak_ptr
) can delay the deallocation of ashared_ptr
and its object even after the main reference count is zero.
System code optimizations (general ideas): - Avoid system calls to reduce context switches (in Linux)
- Use C++ “intrinsics” functions (highly optimized assembly-level code)
Linux socket programming: - Non-blocking sockets versus using
select()
with a timeout—allows thread to do “other” useful work rather than just wait. poll()
orepoll()
system call rather than waiting
Context Switching Reduction:- Thread counts (not too many threads)
- Thread specialization
- Thread specialization (producer-consumer thread model)
- Use custom thread pools with only preallocated memory block pools.
- spinlocks avoid context switches (especially good if spins for only a short time)
- Avoid context switch cost by having a thread do “other” work, rather than just blocking.
Cache Locality Optimizations: - Tiling/blocking algorithms
- Tiling/blocking matrix multiplication (MatMul/GEMM)
- Smaller data type sizes for increased locality
- Choose a CPU with a larger L1 “cache line size” (64-256 bytes common)
std::hardware_destructive_interference_size
,std::hardware_constructive_interference_size
(C++17)std::initializer_list
(C++11) can be used as a lightweight container with contiguous elements
See also “cache warming (prefetch)” optimizations
See also “false sharing (avoid)” optimizations
Instruction Cache Locality Optimizations:- Prefer shorter blocks of code in the hotpath
- Consider not inlining function calls (for instruction cache locality)
See also “branch prediction optimizations”
Branch Prediction Optimizations (General): - Branch elimination
- Branch compiler hints
- Branch prediction heuristics
- Branch profiling (two-phase)
- Branchless programming
- Tools—measure branch prediction data (e.g.,
perf
)
Branch Reductions Techniques: - Algorithm-level changes to reduce branches
- Keep loop bodies short (shorter branches)
- Reduce far branching (e.g., function calls)
- Reduce overall use of function calls (see function call optimizations)
- Reduce use of
if
statements - Reduce use of loops
- Reduce use of
break
statements (in loops, notswitch
!) - Reduce use of
continue
statements - Reduce use of
switch
statements - Reduce short-circuiting in
&&
/||
operators - Reduce short-circuiting of
?:
ternary operator - Avoid
virtual
function calls (hidden dynamic branches) - Avoid pointer-to-functions (hidden dynamic branches; blocks inlining)
- Avoid function objects/functors (hidden dynamic branches)
- Avoid lambda functions passed as arguments (depends on how well the optimizer can handle them)
- Reduce long
if
-else
-if
sequences - Reduce nested
if
-else
sequences - Avoid branches depending on anything unpredictable
- Avoid branches depending on user inputs
- Avoid branches depending on random numbers
- Avoid branches depending on system clocks
- Sort array data for efficient branch prediction, if scanning through the array comparing the data (e.g., before testing for error range)
See also “compile-time optimizations” (remove branches at compile-time)
See also “loop optimizations” (reduce loop iterations, e.g., loop unrolling)
Branch Prediction Heuristics: - Common case code in
if
block - Uncommon case code in
else
block - Error handling code in
else
block (uncommon code) - Avoid zero-iteration loops (never entered)
- Avoid single-iteration loops (never loop back)
Branch Prediction Compiler Hints: [[likely]]
and[[unlikely]]
path attributes (C++20)likely()
andunlikely()
expressions (C++20)__builtin_expect
(GCC)- Define
LIKELY
andUNLIKELY
macros with__builtin_expect
(pre-C++20) [[noreturn]]
(C++11)[[assume(expression)]]
attribute (C++23)hot
(GCC function attribute)- GCC
__builtin_unreachable
std::unreachable
—helps branch prediction (C++23)[[fallthrough]]
— more for safety than speed (C++17)-fdelayed-branch
compiler flag-fguess-branch-probability
compiler flag-fif-conversion
and-fif-conversion2
compiler flags- Use “
likely
” and “unlikely
” in custom assertion macros - Use “
likely
” and “unlikely
” in error handling code macros
Branch Profiling: -fprofile-arcs
(GCC option)-fprofile-generate
(GCC command-line argument)-fprofile-use
(GCC command-line argument)- Branch profiling with 100% hotpath (test modes)
Branchless Programming Techniques: - Ternary operator preferred over
if
statements (ifCMOV
instruction) - Boolean variables as 0 or 1 in arithmetic
- Logical operators (
&&
/||
) as 0 or 1 in arithmetic - Bitwise operators (
&
/|
) replace logical operators (&&
/||
) - Sign bit extension bit masks
- Lookup tables for branchless programming
XOR
trick to swap two integer variables without a temporary variable
Slowpath Removal:- Optimize error checking pathways
- Remove error checking tests
- Defer error checking tests to later
- Combine error checking tests together (and do it later)
- Avoid adding error checks deeper in the call hierarchy
- Never-failing functions (cannot return an error)
- Don’t use memory allocation (avoids memory allocation failure)
Cache Warming Methods: - Prefetch memory primitives
__builtin_prefetch
(GCC)_mm_prefetch
(GCC)volatile
on temporary variables- Dry-run execution mode
- Branchless dry-run execution with
arr[2]
declarations - Use read-only cache warming pathways (avoids cache invalidation for other threads)
- Use deep cache warming all the way down into the NIC
- Optimize cache warming code by reducing data reads (relies on cache line sizes)
- Reduce cache warming code to the maximum size of the memory cache
(avoids redundant cache warming when cache is already full).
False Sharing (Avoiding): - Using
alignas(64)
or 128 or 256 to avoid false sharing (C++11) - Use
alignas
on all shared memory or atomics (C++11) - Tools to automatically detect false sharing (DRD fails?)
Parallelism (General Categories): - Multithreading
- Multiprocess
- Vectorization
- Pipelining
- Parallel execution modes (C++17)
- Coroutines (C++20)
Advanced C++ Concurrency Data Structures: - Read-only (“immutable”) data structures
- Lock-free algorithms and data structures
- Linear search can be efficient for small sizes because of cache prefetching
(e.g., rather than binary search; also doesn’t need sorting maintained)
SIMD Instructions: - AVX (x86 CPUs)
- ARM Neon
std::simd
(experimental/C++26)<immintrin.h>
Linux O/S Optimizations:- Process priority upgrades (“
nice
” command or system call) - Disable unimportant processes
- Overclocking CPU
- Overclocking GPU
- Disable Security Enhanced (SE) Linux
- Disable accounting mode in Linux (should be off anyway)
Linux Kernel Optimizations: - Scheduling algorithm kernel modifications
- Tweak TCP/UDP network buffer settings (Linux kernel)
- Turn off file “last access date” storage (“
noatime
” in/etc/fstab
)
System Hardware Optimizations (Categories): - Processor hardware (CPU)
- Network optimizations
- Disk optimizations
- RAM Memory optimizations
Processor Hardware Major Categories of Optimizations: - CPU
- GPU
- NPU
- FPGA
- ASIC
Networking Hardware Optimizations (Categories): - NIC
- Switches
- Load balancer devices
- Size of the packet buffer of a switch (optimizing for)
Networking Transmission/Protocol Optimizations (Categories): - Physical proximity
- Co-Lo
- TCP
- UDP (faster than TCP but unreliable)
- Optical networking (optical fiber cables)
- Microwave network transmission
- Packet fragment manipulations (e.g., out-of-order)
- Reduce packet fragment collation overhead
- Reduce packet consistency checking (error safety overhead)
Networking Software Optimizations: - TcpDirect/Onload
- SolarFlare/OpenOnload (kernel bypass)
- Exablaze (NIC with kernel bypass support)
- DMA
- PCIe bus
- Compress data sizes for your network transmissions
- Sticky sessions (avoids needing to send user-specific caches between servers)
- Shared storage rather than other server-to-server networking (e.g., NAS/SAN)
- Use custom wrappers for TCP and UDP network processing
GPU & Distributed Networking Optimizations: - RDMA
- nvlink
- Infiniband
- RoCE
- GPUDirect
- PXN
Deployment Optimizations (Website backends): - DNS optimizations
- Round-Robin DNS (RRDNS)
- SSL time optimizations
- etags (website server speedup)
- Multiple identical servers architecture
- Use subdomains for static files
- CDN for static files
- Compression modes enabled
- Static files compressed
- Minify static files (CSS, JavaScript)
- Merge multiple small files together
- Use smaller image files (low precision)
- Merge multiple small icon images into one image file
- Cache duration settings
- Database optimizations (various, e.g., MySQL/MariaDB/MongoDB)
- Database indexes
- Application server optimizations (e.g., Tomcat)
Apache/Nginx Subprocess Optimizations: - Use FCGI not classic CGI integrations
- Flush
stdout
of subprocesses (sends partial output earlier to Apache or Nginx) - Close
stdout
of subprocesses before shutdown sequence (finishes earlier to Apache or Nginx) - Early tests for violations and invalidity (fails quickly)
Algorithm Enhancements: - Precomputation (lookup tables)
- Precomputation to data file
- Precomputation of source code
- Incremental algorithms
- Data structure augmentation
- Parallelization
- Vectorization
- Caching
- Lazy evaluation
- Common case first
- Simple case first
- Approximate tests first
- Bounding box approximate tests
- Bounding sphere approximate tests
- Avoiding
sqrt
by using arithmetic on squares - Integer arithmetic on squares: avoid floating-point by using arithmetic on squares
- Use variance not standard-deviation (arithmetic on squares)
- Approximations
- Compute budget algorithms
- Probabilistic/stochastic algorithms
- Skipping algorithms
- Heuristic algorithms
- Greedy algorithms
Memory Reduction Strategies: - Take care with memory reduction as some methods can reduce speed (trade-offs)
- Reduce allocated memory
- Smaller data sizes
- Pack data into smaller integer sizes
- Pack data into bits
- Pack data using bit-fields
- Pack data into unions
- Use
std::bitvector
- Use
std::vector<bool>
(it is a special bit-packed template instantiation) - Structure packing (also for class data members): reorder different-sized data members for better packing and fewer padding bytes
- Structure packing: biggest data types first (heuristic)
- Structure packing: MSVS
/d1reportSingleClassLayout
compiler option to report on it #pragma pack
reduces padding to reduce size, but may worsen structure access costs- Stack data reductions
- Avoid deallocation of heap memory when in shutting-down mode
Heap Allocated Memory Reduction Strategies: - Fewer allocated memory blocks
- Avoid frequent small allocations
- Preallocation of dynamic memory
- Memory fragmentation avoidance
- Memory leak avoidance
- Merge memory allocations together
- Memory pools (fixed-size allocations, often a type of preallocation)
- Memory pool with O(1) deletion and O(1) insertion via permutation array
- Merge fixed-size allocated objects into a large array
- Custom memory allocators (generalized)
- Class-specific memory allocator
- Custom global memory allocator
- Late allocation (allocate memory as late as possible)
- Early
free
memory (deallocate as early as possible) - Early
delete
memory (deallocate early) - Avoid
realloc
(slow, memory fragmentation) - Smart dynamic buffers (hybrid of allocated and non-allocated memory)
std::aligned_alloc
- memory alignment improvement (C++17)std::aligned_union
(C++11)
Static Memory Size Reductions:- Avoid large global arrays and buffers
- Avoid large static arrays and buffers
- Avoid large static C++ data members
- String literal memory reductions
Stack Memory Size Reductions: - Avoid large local arrays and buffers
- Avoid large function non-reference parameter arrays and buffers
- Use pass-by-reference on large function parameters
- Use integer parameters as local variables
- Consider stack versus memory allocation
- Flattening/reducing function call hierarchy
- Inline small functions (compiler can disappear them)
- Use
#define
macros for small functions (versus inlining)
See also: function call hierarchy flattening
See also: recursion avoidance
Code Size Reduction Strategies: - Code size reductions
- DLLs versus static libraries
- Remove executable debug information
- Avoid the compiler “
-g
” debug option - Avoid the compiler “
-p
” profiler option - Unix
strip
command - Avoid large
inline
functions (instruction cache locality) - Don’t overuse “always inline” or “force inline”
- Template overuse
- Google “
bloaty
” tool
Standard Library Optimizations (STL Optimizations): - String processing efficiency (e.g., “
+
” forstd::string
can be slow) std::vector
of non-trivial class objects calls constructor/destructors- Control array size for
std::vector
using “reserve()
” - Use
std::sort
rather thanqsort
bsearch
is not your friend- Consider hard-coded arrays versus
std::array
versusstd::vector
- Compare the first letters of strings before calling
strcmp
- Consider type casts to
int
versusround()
,ceil()
,floor()
- Avoid
printf
/fprintf
format string processing withputchar
/putc
/fputc
orputs
/fputs
- Hand-code versions of
abs
andfabs
/fabsf
that don’t handleInf
/NaN
numbers (but benchmark it). - Change
strlen("literal")
tochar arr[]="literal"
and usesizeof(arr)-1
- Don’t use
strlen(s)
in afor
loop condition - Consider your own
atoi
/itoa
versions that don’t handle all the obscure cases. - Avoid
sprintf
andsnprintf
(both are slow) sync_with_stdio(false)
std::stringstream
is slow (hand-code text field processing instead)
Data Structures:- Hashing (basic)
- Perfect hashing
- Bit vectors
- Bit sets
- Bloom filters (bit vectors + hashing)
- Binary tree
- Sorted arrays
- Unsorted arrays
- Stacks
- Queues
- Dequeues
- Vector hashing
- Permutation arrays
- Locality-sensitive hashing (LSH)
- Bit signatures (vector algorithm)
- K-means clustering (vector algorithm)
- Hyper-cube (vector algorithm)
- Approximate nearest neighbor (ANN) (vector algorithm)
Variable Optimizations: - Prefer
int
types tochar
orshort
(usually) - Prefer
int
types tounsigned int
(usually) - Prefer
int
types tosize_t
(usually it’s anunsigned long
; consideruint32_t
) - Avoid unnecessary initializations
- Re-use objects to avoid initializations/destruction
- Avoid temporary variables
- Use reference variables instead of full temporary variables
- Avoid creating temporary objects
- Put commonly used data fields first in struct/class
- Declare variables as close as possible to usage
if
initializer syntax (C++17)switch
initializer syntax (C++17)- Avoid bit-fields (smaller but slower to access or set)
- Use memory alignment primitives to avoid slow-downs
- Put the most-used data member first (it has a zero offset)
- Order data members most used to least used (smaller offsets are faster, in theory)
- Array initializer lists as local variables (re-initialized each call)
- Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).
Arithmetic Optimizations: - Operator strength reduction
- Reciprocal multiplication
- Integer arithmetic
- Use
float
notdouble
Expression Optimizations: - Expression transformations
const
mutable
keyword — bypassesconst
(C++98) (speedy but unsafe)- Common subexpression elimination (CSE)
- Constant folding
- Template fold expressions (C++17) are concise but often lots of computation
- Expression templates—avoids explicit temporary variables, compiler optimizes it better.
- Constant propagation
- Redundant assignment removal
- Strength reduction
- Algebraic identities
- Implicit type conversions (avoiding; type consistency)
explicit
keyword (prevent implicit type conversions) (C++98)- Brace initialization syntax
{}
(avoids implicit narrowing conversions) auto
variable declarations avoid accidental temporaries and implicit type conversions.- Don’t mix
float
/double
types (including their constants) - Don’t mix integer types
- Prefer signed integers over unsigned types
- Short-circuiting of sub-expressions (using
&&
/||
/?:
) - Register allocation optimizations
mprotect
page system call — used as optimization to make memory writeable<algorithm>
simple algorithms:min
,max
, etc.- Range check faster with casts via “
(unsigned)i < MAX
” not “i >=0 && i < MAX
”
Memory Block Operations: - Prefer contiguous memory blocks (locality, efficient block operations, etc.)
- Different class types can allow block copying: POD (Plain Old Data), trivial types, standard layout types (e.g., check in a template using std::is_trivial
) - Copy arrays by wrapping them in a dummy
struct
- Copy arrays with
memcpy
- Compare arrays with
memcmp
(very dangerous: padding bytes, negative zero, NaNs)- Use
memcpy
notmemmove
if arguments won’t overlap.- Linearize multi-dimensional arrays (contiguous memory blocks)
Operator Strength Reduction Optimizations:- Replace
*
with bitshifts- Replace
*
with addition- Replace
x*2
withx+x
- Replace
%
with bitwise-and (&
)- Replace
%
with increment and test- Replace
%
with type casts (if byte sizes)
Bitwise Optimizations:- Intrinsic bitwise functions
CLZ
(count leading zeros) bitwise intrinsicsCTZ
(count trailing zeros) bitwise intrinsics- Popcount bitwise intrinsics (set bit count)
- Kernighan bit trick (find highest bit set)
- Fast NOR/NAND/XNOR via assembly instructions
- Fast LOG2 of integers
- Fast largest power-of-two of integers
Floating-Point Optimizations:- Convert float to 32-bit integers (float bit manipulations)
- FTZ (Flush to Zero) mode
- DAZ (Denormals Are Zero) mode
- LOG2 of floating-point is the exponent
- Zero/negative zero bitwise tests
- Disallow negative zero (to use faster zero comparisons)
- NaN (Not-a-Number) bitwise tests
- Inf/-Inf bitwise tests
- Avoid denormalized numbers
- Disable denormalized numbers (subnormals) (compiler/library modes)
- Avoid underflow in floating-point (ignore it)
- Avoid overflow in floating-point (ignore it)
memcmp
float vector equality (disallow special values for fastfloat
vector equality comparison)- Fast detection of special values in
float
vectors (bitwise operations)- Floating-point intrinsic functions (various)
- Exponent addition: bitshifting floating-point by addition of the exponent bits
- Sign bit flipping/extraction/setting (bitwise tricks)
Compiler Settings for Floating-Point:- GCC
-ffast-math
option — faster math mode.- GCC
-fno-math-errno
— faster math multithreading by not settingerrno
.- GCC
-ffinite-math-only
- GCC
fno-trapping-math
- MSVS
/fp:precise
,/fp:strict
,/fp:fast
- Disable floating-point exceptions
Loop Optimizations:- Exit loops early (e.g.,
break
orreturn
statements)- Finish loop body early (i.e.,
continue
statement)- Correct choice of loop
- Loop unrolling
#pragma unroll
- Loop fusion
- Loop perforation (probabilistic)
- Loop tiling/blocking
- Loop fission
- Loop reversal (don’t use!)
- Loop code motion (“hoisting”)
- Loop distribution
- Loop iterator strength reduction
- Loop coalescing
- Loop collapsing
- Loop peeling
- Loop splitting
- Loop interchange
- Loop sentinel
- Loop strip mining (loop sectioning)
- Loop spreading
- Loop normalization
- Loop skewing
- Loop interleaving
If Statement Optimizations:- Replace
if
-else
-if
sequences withswitch
.- Replace
if
-else
-if
sequences with lookup table loop.
Switch Statement Optimizations:- Use compact numeric ranges in
switch
(compiler can use a LUT)
Compile-Time Optimizations:inline
functionsalways_inline
specifier- GCC
flatten_inline
specifiergnu_inline
GCC specifier- Keep
inline
functions short (helps compiler to inline)- Keep
inline
functions in header files (source available to all its calls)- Avoid making a
virtual
function “inline
”—compiles but usually is a slug.sizeof
- Use
sizeof
withstatic_assert
(e.g., portability checks)- Virtual functions cannot be inlined (although it compiles)
- Pointer-to-function usages of functions cannot be inlined
- Function objects (functors) cannot always be inlined
- Lambda functions cannot always be inlined
inline
variables (C++17) (helps with linking)static_assert
(compile-time assertions)const
is goodconstexpr
(C++11) is greatconstexpr
functions allowif
,switch
, loops, etc. (C++14)constexpr
lambda functions (C++17)constexpr
and placementnew
(C++26)- References to
constexpr
variables (C++26)if constexpr
statementsconstinit
consteval
if consteval
(C++23)- Type traits <type_traits> (C++11)
typeid
is slow (RTTI)std::is_same_v
(type trait test)- Template specialization (for specific types)
- Template specialization (for constant integers)
- Variadic templates (C++11)
- Template Meta Programming (TMP) still works, but prefer
constexpr
- Auto-vectorization (by compiler)
- Auto-unrolling of loops (by compiler)
- SFINAE tricks (mostly an issue for compiler engineers)
Pointer Aliasing:- Reorganize functions with awareness of pointer aliasing issues
- Restricted pointers (to avoid pointer aliasing slowdowns)
-fstrict-aliasing
compiler option (alternative to using “restrict
”)
Pointer Arithmetic:- Loop pointer arithmetic
- End pointer address tricks (Loop pointer arithmetic)
- Use references not pointers (avoids null testing)
- Prefer postfix operations with the
*ptr++
idiom (not prefix++ptr
)- Pointer comparison tricks
- Pointer difference tricks
- Avoid safe pointer class wrappers (prefer raw pointers for speed)
Pointer Optimizations (Other):reinterpret_cast
(helps the optimizer and is effectively a free compile-time hint)- Avoid
dynamic_cast
(to downcast from a base to a derived class, which can be helpful for specializing member calls, but dynamic casts can be expensive at runtime because of RTTI)
Function Optimizations:- Return early from functions
- Flatten function call hierarchies
- Callbacks are an extra layer of function call
- Lambda functions are convenient but are an extra function call layer (though often inlined)
- Function objects (functors) are an extra function call
- Avoid recursion (completely; we’re not in High School anymore)
- Replace simple recursion with a loop
- Replace complex recursion with a stack
- Tail recursion elimination
- Recursion higher base level
- Collapse recursion levels
- Specialize functions with default arguments (use two versions)
- Specialize functions with
void
and non-void
versions (if return value often ignored)- Avoid function pointers (cannot be
inline
orconstexpr
)- Merge multiple Boolean function parameters into a “config” object with Boolean data fields.
noexcept
attributes allow compiler to avoid adding extra code (C++11)std::initializer_list
can be used to return multiple values (benchmark against other methods)
C++ Class Optimizations:friend
functions (bypass interfaces)friend
classes (bypass interfaces)- Return references rather than objects
- Avoid temporary class objects in expressions
- Add extra member functions to avoid temporary object creation
- Pass objects by reference to functions (i.e., “
&
” or “const&
”)- Disable copy constructors with “
private
” or “= delete
”- Disable assignment operators with “
private
” and “= delete
”- Declare assignment operators with
void
return type (except when defaulting)- Re-use objects to avoid constructor and destructor calls
- Avoid calling the destructor when in shutting down mode
- Uninitialized memory algorithms, e.g.,
std::uninitialized_fill
(C++17)- CRTP (Curiously Recurring Template Pattern): derived class derives from base class which is itself a template involving a pointer to the derived class (optimizes polymorphism to be compile-time, avoiding virtual function calls; also this allows more inlining of these calls.)
- Move constructors
- Move assignment operators
std::move
(C++11, C++14) is usually a compile-time cast.- Return object reference types (not complicated objects)
- Avoid virtual function calls with explicit calls to the specific function
- Specialize inherited member functions (for the more restrictive type)
- Avoid overloading the postfix increment/decrement operators
- Block the overloaded postfix increment/decrement operators (
void
body or=delete
)- Consider skipping destructor cleanup if program is shutting down
- Avoid accidental double initialization of data members in constructors
- Avoid redundant initialization of same members in both constructor and “setup” methods
- Specialize member functions with default arguments (use two versions instead)
- Default constructors/destructors with “
=default
” may be more efficient than hand-coded versions.- Trick for singleton pattern in multithreading — threads initializing function-local static variable, other threads block, once-only initialization guaranteed by C++ compiler.
Advanced C++ Compiler Optimizations:- Copy elision (compiler auto-optimization with avoidance of a copy constructor in certain cases)
- Guaranteed copy elision (C++17)
- Named return value elision (a type of copy elision)
- Temporary return value elision (a type of copy elision)
- Copy elision in exception handling (special case for copy elision)
- Allocation elision (new operator) (C++14)
- Use xvalue or “expiring value” optimizations (various)
- Trick: to disallow creating an object on the stack, make its destructor private.
- Trick: to disallow creating an object on the heap, make its
new
andnew[]
operators private.
Byte Block Operations in C++ Classes: (Use with extreme care!)memset
/bzero
to zero in a constructor — fast but dangerous, overwrites internal “vtable” data in object if class has anyvirtual
functions, does not call constructors of its data members or base class members; also cannot use an initializer list as this overwrites with zero after any objects were set by the initializer list.memcpy
to bitwise copy in a copy constructor or assignment operator — fast but dangerous, improperly copies internal vtable data in object if class has any virtual functions, does not deeply copy any of its members or base class members nor call their constructors.memcpy
to bitwise copy in a move copy constructor or move assignment operator — fast but dangerous; improperly copies “vtable”.memcmp
to bitwise compare for equality/inequality tests — fast but fails in many situations due to pitfalls: padding bytes, bit-field members, negative versus positive zero floating-point values, NaN floating-point values.- Virtual inheritance — usually for pure virtual base classes; avoids double objects if the same base class is inherited in two different ways.
Timing C++ Methods:std::chrono
C++ class (highly granular)clock()
C/C++ functiontime
command (Linux shell)time()
function (granularity is only in seconds)gettimeofday()
Benchmarking C++ Methods:- Loop unrolling for accurate benchmarking
- Use
volatile
specifier for accurate benchmarking- Loop overhead measurement for accurate benchmarking
- Google Benchmark: Apache 2 license; code: https://github.com/google/benchmark
Compiler Settings:- Optimizer settings
- Optimizing for space/memory size (compiler flags)
General Build & Software Development Practices for Efficiency:- Maintain separate builds for slow testables versus production executables
- Compile-out assertions
- Compile-out self-testing code
- Compile-out debug code or tracing code
- Ensure test code not accidentally left in production (test a global flag based on these macros at startup)
CUDA C++ GPU Optimizations:- Coalesced memory accesses
- Thread specialization (GPU)
- GPU thread pools
- Producer-consumer thread pools
- GPU kernel optimizations
- Striding (GPU kernels)
- Overlapping GPU uploads and compute
- Overlapping with recomputation/rematerialization
- Offloading to CPU
- Pinned memory blocks
- Warp divergence (warp coherence)
- Grid optimizations
- Grid size optimizations
Core Utility Classes (Efficiency Helpers): (to build for overall efficiency practices)- Bitwise macro library (bitflag management)
- Floating-point fast bitwise operations macro library
- Benchmarking/timing library
- Smart buffer library (reduce allocations by combining allocated/non-allocated memory management)
- TCP/UDP wrapper library
- Specialized data structures for small amounts of data (faster than STL)
- Sorted array and binary search (small array size)
- Lock-free queues
- Perfect hashing library
- Bit vector data structures (possibly based on STL)
- Bit set data structures (possibly based on STL)
- Bloom filter library
- Vector hashing library
- Caching utilities library
- Source code precomputation library
- Basic data and statistics on vectors (e.g., averages, std dev/variance, etc.)
- Incremental vector algorithms (averages, min, max, etc.)
- Branchless coding primitives library
- Graph library for locking analysis
- Data compression library
- Approximate tests library
- Math library (versus STL)
- Memory pools library (fixed-size custom memory allocators)
- Custom memory allocator library
- Placement
new
operator versions- Placement
delete
operator (write your own)- Multi-dimensional array library (linearize your vectors/matrices/tables/tensors)
AI Kernel Optimizations (using LLM Inference Optimizations for non-AI low latency applications): (subset of methods to consider)
Reference: 500+ LLM Inference Optimization Techniques (blog article)- Kernel fusion
- Kernel fission
- Kernel tiling/blocking
- Quantization (integer-based approximation of floating-point)
- Low-bit quantization
- Binary quantization (1-bit)
- Integer-only arithmetic
- Floating-point quantization (FP16/FP8/FP4)
- Mixed precision quantization
- Logarithmic quantization
- Dyadic quantization
- Low rank matrices
- MatMul/GEMM optimizations (many)
- MatMul data locality optimizations
- Sparse MatMul
- Approximate matrix multiplication
- Contiguous memory block matrix multiplication
- Cached transpose MatMul
- Fused transpose MatMul
- Tiled/blocked MatMul
- Sparsification (Pruning/Sparsity)
- Token pruning (input compression)
- Token skipping
- Token merging
- Data compression algorithms
- Early exiting (of layers)
- Caching optimizations
- Vector computation caching
- Zero skipping
- Negative skipping
- Padding optimizations
- Zero padding removal
- Zero-multiplication arithmetic
- Adder/addition (zero-multiply)
- Bitshifts (zero-multiply)
- Bitshift-add (zero-multiply)
- Double bitshift-add (zero-multiply)
- Add-as-integer (zero-multiply)
- Logarithmic arithmetic (zero-multiply)
- Hadamard element-wise matrix multiplication
- End-to-end integer arithmetic
- Table lookup matrix multiplication
- Weight clustering (grouped quantization)
- Vector quantization
- Parameter sharing
- Activation function optimizations (non-linear functions)
- Precomputation of Activation functions
- Approximation of Activation functions
- Integer-only approximation of Activation functions
- Fused activation functions
- Normalization optimizations (non-linear vector data functions)
- Fused normalization optimizations
- FFN optimizations (double MatMul)
- FFN approximations
- FFN integer-only
- Decoding algorithm optimizations
- Speculative decoding
- Multi-token decoding
- Ensemble decoding
- Consensus/majority-vote decoding
- Easy-hard queries
- Batching computations
- Advanced number systems
- Posit numbers
- Dyadic numbers
- Hybrid number systems
- Fixed point numbers (integers not floating-point)
- Block floating-point (BFP) hybrids
- Logarithmic number system (LNS)
- Disaggregation (prefill/decoding)
- Computation re-use
- Conditional computation
- Approximate caching
- Addition arithmetic optimizations
- Approximate addition
- Bitwise arithmetic optimizations
- Fast multiplication arithmetic
- Approximate multiplication
- Logarithmic approximate multiplication
- Approximate division
- Bitserial arithmetic
- Copy arrays by wrapping them in a dummy
More AI Research Topics
Read more about:
Aussie AI Advanced C++ Coding Books
![]() |
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
![]() |
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
![]() |
Advanced C++ Memory Techniques: Efficiency & Safety:
Get your copy from Amazon: Advanced C++ Memory Techniques |
![]() |
Safe C++: Fixing Memory Safety Issues:
Get it from Amazon: Safe C++: Fixing Memory Safety Issues |
![]() |
Efficient C++ Multithreading: Modern Concurrency Optimization:
Get your copy from Amazon: Efficient C++ Multithreading |
![]() |
Efficient Modern C++ Data Structures:
Get your copy from Amazon: Efficient C++ Data Structures |
![]() |
Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
Get your copy from Amazon: Low Latency C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |