Aussie AI Blog

500+ LLM Inference Optimization Techniques

  • Updated: April 23, 2026
  • by David Spuler, Ph.D.

LLM Inference Optimization

We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There's more than 500 (600+ now!), but see the blog post links below if you only want to know about the latest LLM inference techniques.

Update in April, 2026: Some of the newer techniques added to this list:

Update in Feb 2026: As we head into 2026, some of the more recent areas of attention include:

Research areas that remain as hot as always include:

Areas of new research relevance include:

And no doubt much more to come in 2026!

Update in March 2025: well, now we're into 2025 and this list has outgrown its title. There are over 600 items on the list below, all of which are related to LLM efficiency. The main change in 2025 is that the recent releases of "reasoning models" has spawned a new area of research in optimizing the efficiency of LLM reasoning algorithms such as Chain-of-Thought.

Free AI C++ books: for more about LLM optimization, read books online or download a PDF:

Popular articles: additional research articles on faster LLM inference:

More lists: lots of general efficiency optimization information:

LLM Inference Optimizations List

Here's the list! It's over 600 and growing!

    Reasoning Efficiency Optimization (REO): it's the latest hot research area in 2025!
  1. Reasoning inference optimization (RIO) (blog article)
  2. Chain-of-Thought (CoT) optimization
  3. CoT token reduction
  4. CoT step skipping
  5. CoT path reduction
  6. CoT early stopping
  7. CoT reasoning decoding
  8. Constrained CoT
  9. Coconut
  10. Concise CoT
  11. Hidden CoT (interim steps in latent space)
  12. CoT prompt sequence optimizations
  13. CoT sparsity
  14. CoT distillation
  15. Long context CoT
  16. — Small Reasoning Model (SRM)
  17. Reasoning tokens
  18. Adaptive inference time compute
  19. — One-step reasoning models (e.g. DeepSeek R1's long answers)
  20. — Augmented scaffold + Small Reasoning Model
  21. Reasoning caching

    Inference Modes and Token API optimizations:
  22. "Fast mode" inference (e.g. from OpenAI or Anthropic)
  23. Cached tokens
  24. Batched tokens
  25. Low batch size inference
  26. Priority batching
  27. API model routing features

    Model compression main subtypes:
  28. Model compression (overview)
  29. Pruning (overview)
  30. Quantization (overview)
  31. Knowledge Distillation (KD)
  32. Parameter sharing (weight sharing)
  33. Low-rank matrices
  34. Small Language Models (SLMs)
  35. Data compression algorithms

    Pruning main types:
  36. Dynamic pruning
  37. Hybrid pruning
  38. Unstructured pruning
  39. Semi-Structured Pruning
  40. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  41. Depthwise structural pruning (overview)
  42. Static layer pruning
  43. Layer pruning
  44. Dynamic layer pruning
  45. Layer skipping
  46. Layer approximation
  47. Shallow decoder architecture
  48. Layer reordering
  49. Layer Importance

    Early exiting (dynamic layerwise pruning): Pruning all the layers from the point of exit:
  50. Early exit (overview)
  51. — Confidence-based exit policy
  52. — Patience-based exit policy
  53. — Entropy-based exit policy
  54. — Learned exit points or exit policies
  55. Early exit KV cache fixes
  56. Early exit knowledge distillation
  57. Early exit speculative decoding
  58. Early exit in training
  59. Layer freezing

    Width-wise structured pruning subtypes:
  60. Widthwise structural pruning (overview)
  61. Attention head pruning
  62. Slimmable networks (width pruning)
  63. FFN pruning
  64. Channel pruning
  65. Filter pruning

    Length-wise structured pruning subtypes:
  66. Lengthwise structural pruning (longitudinal/input/end-to-end):
  67. Token pruning (input pruning)
  68. Dynamic token pruning
  69. Prompt compression
  70. Context compression
  71. Token merging
  72. Token skipping
  73. Token dropping
  74. Zero padding removal
  75. Token reduction
  76. Token compression
  77. Input text compression

    Model dimension embedding pruning subtypes:
  78. Embedding-dimension pruning
  79. Embedding pruning
  80. Embedding matrix compression (embedding pruning)
  81. Embedding low-rank matrix factorization
  82. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  83. Multi-dimensional pruning
  84. Dual pruning
  85. Triple pruning
  86. Quadruple pruning
  87. 3D CNN model pruning
  88. Pyramid inference

    Transformer component pruning:
  89. Normalization pruning
  90. Positional embeddings pruning
  91. Softmax pruning
  92. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  93. Unstructured pruning (overview)
  94. Magnitude pruning
  95. Movement pruning
  96. — Gradual pruning

    Quantization theory and major subtypes:
  97. Post-Training Quantization (PTQ)
  98. Quantization-Aware Training (QAT)
  99. Activation Quantization
  100. Outlier-aware quantization (outlier management)
  101. Dequantization

    Quantization overall algorithms:
  102. Uniform quantization
  103. Non-Uniform quantization
  104. Symmetric quantization
  105. Asymmetric quantization
  106. GPTQ: Gradient PTQ
  107. AQLM: Activation-Quantization Low-Bit Method
  108. SpQR: Sparse Quantized Representations

    Integer quantization subtypes:
  109. Integer quantization (overview)
  110. Integer-only arithmetic quantization
  111. Fixed-point quantization (integer)
  112. Low-bit integer quantization (overview)
  113. Binary quantization
  114. Ternary quantization
  115. 2-bit quantization (INT2)
  116. 3-bit quantization (INT3)
  117. 4-bit quantization (INT4)
  118. 5-bit quantization (INT5)
  119. 6-bit quantization (INT6)
  120. 7-bit quantization (INT7)
  121. 8-bit quantization (INT8)
  122. 9-bit quantization (INT9)
  123. 10-bit quantization (INT10)
  124. 11-bit quantization (INT11)
  125. 12-bit quantization (INT12)
  126. 16-bit INT16 quantization
  127. 32-bit INT32 quantization
  128. — W4A4 quantization
  129. — W4A4KV4 quantization

    Floating-point quantization subtypes:
  130. Floating-point quantization
  131. FP4 quantization
  132. FP6 quantization
  133. FP8 quantization
  134. FP16 quantization
  135. FP32 quantization

    Quantization error mitigation and metrics:
  136. Quantization errors
  137. Outlier mitigation methods
  138. — Mean-Squared Error (MSE)
  139. — SNR degradation

    Outlier mitigation in quantization:
  140. AWQ: Activation‑Aware Weight Quantization

    Other uncommon quantization subtypes:
  141. Logarithmic power-of-two quantization (bitshift quantization)
  142. Double bitshift power-of-two quantization
  143. Division quantization
  144. Cluster-based quantization (Weight clustering)
  145. Hashing-based weight clustering
  146. Dyadic quantization
  147. Fake quantization
  148. Simulated quantization
  149. Stochastic quantization (probabilistic)

    Mixed-precision quantization subtypes:
  150. Mixed-precision quantization

    Granularity-level quantization subtypes:
  151. Granular quantization (overview)
  152. Layerwise Quantization
  153. Blockwise Quantization
  154. — K-quantization
  155. Vector quantization

    Knowledge distillation subtypes:
  156. Knowledge Distillation (overview)
  157. Ensemble Distillation
  158. Unnatural instructions (data sets)
  159. Dataset Distillation
  160. Black Box Distillation
  161. White Box Distillation

    Parameter/weight sharing subtypes:
  162. Parameter/Weight sharing (overview)
  163. Activation sharing
  164. Layer fusion
  165. Clustering (Weights)
  166. Attention head fusion
  167. FFN fusion (sharing parameters)
  168. KV cache layer fusion (depthwise)
  169. KV cache head fusion (widthwise)

    Activation function optimizations:
  170. Activation function optimizations (overview)
  171. Activation function approximation
  172. Integer-only activation functions
  173. Fused activation functions (kernel fusion)
  174. Fused RELU
  175. Fused GELU
  176. Fused SwiGLU
  177. Activation alternatives/replacements
  178. Activation function pruning/removal (bilinear layers)
  179. Activation function reordering

    Normalization optimization types:
  180. Normalization algorithm optimizations (overview)
  181. Approximate normalization
  182. Norm reordering (pre-norm/post-norm)
  183. Integer-only normalization
  184. Normalization alternatives/replacements
  185. Fused normalization (e.g. "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  186. Softmax optimizations (overview)
  187. Softmax pruning
  188. Approximate Softmax
  189. Softmax alternatives/replacements
  190. Integer-only Softmax
  191. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  192. FFN optimizations (overview)
  193. FFN pruning
  194. FFN approximation
  195. Fused add-bias
  196. Bias vector pruning
  197. FFN sparsity
  198. FFN alternatives/replacements
  199. Integer-only FFN
  200. FFN fusion (shared parameters)
  201. Inter-FFN fusion (merging two FFNs)
  202. Intra-FFN fusion (with piecewise linear approximations) (merging two linear projections in one FFN)
  203. — Bias vector addition optimizations
  204. — Bias vector pruning (no bias!)
  205. — FFN matrix merging (similar to "intra-FFN fusion")
  206. — Bulging FFN (per-layer FFN size increases)

    MatMul/GEMM optimization types:
  207. MatMul/GEMM kernel optimizations (overview)
  208. Faster matrix multiplication (e.g. Winograd, Strassen)
  209. Approximate matrix multiplication
  210. Transpose cache
  211. Fused multiply-add (FMA)
  212. Fused transpose
  213. Vector dot product optimization
  214. Sparse MatMul/GEMM
  215. — Tiled MatMul
  216. — Triangular MatMul optimizations (causal masking in attention)
  217. — Tiled skipping

    Positional Encoding optimizations:
  218. Positional encoding optimization (overview)
  219. RoPE (Rotary Positional Encoding)
  220. Pruning positional encoding (removal/NoPE)
  221. — Positional encoding approximation
  222. — Integer-only positional encoding
  223. Partial RoPE (p-RoPE)
  224. — RoPE rescaling
  225. — Attention with Linear Biases (ALiBi)
  226. — Relative Attention Biases (RAB)

    NAS subtypes:
  227. Neural Architecture Search (NAS)
  228. Dynamic NAS
  229. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  230. On-device inference (native phone and PC AI)
  231. AI Phones
  232. AI PCs (desktops/laptops)
  233. Edge device inference (IoT/mobile/PC)
  234. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  235. Decoding algorithms (overview)
  236. Non-autoregressive decoding
  237. Greedy decoding
  238. Top-k decoding
  239. Top-p decoding
  240. Min-P Sampling
  241. Flash decoding
  242. Beam search decoding
  243. Edit decoding
  244. Contrastive decoding
  245. — Approximate top-k algorithms
  246. — Bidirectional decoding
  247. Constrained decoding

    Parallel Decoding algorithms:
  248. Parallel decoding
  249. Blockwise parallel decoding
  250. n-gram parallel decoding
  251. Lookahead decoding
  252. Medusa decoding
  253. Consensus decoding
  254. — Mutually-guided decoding
  255. — Multi-token generation
  256. — Eagle decoding

    Speculative decoding subtypes:
  257. Speculative decoding (overview)
  258. Generalized speculative decoding
  259. Aggressive decoding
  260. Lookup decoding
  261. Retrieval lookup decoding
  262. Prompt lookup decoding
  263. — Multi-query prompt lookup decoding (across entire LLM history)
  264. Self speculative decoding
  265. Tree speculative decoding
  266. Superposed decoding
  267. Hierarchical speculative decoding
  268. Heuristic speculative decoding
  269. Multi-token speculative decoding
  270. Sequential speculative decoding
  271. Eagle speculative decoding
  272. — Redrafting

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  273. PEFT (overview)
  274. LoRA
  275. Multi-LoRA inference
  276. QLoRa (Quantized Low-Rank Adapters)
  277. LoRA inference optimizations (load/unload)
  278. Prompt Tuning (Extended Vocabulary PEFT)
  279. Prefix Tuning

    Mixture-of-Experts (MoE):
  280. Mixture of Experts (MoE)
  281. MoE-specific compute optimizations
  282. Hybrid MoE (dense FFN)
  283. — Shared experts
  284. — MoE routing optimizations
  285. — MoE gating optimizations

    Tool Integration Optimizations: LLMs using tools has gone mainstream, and there is also newer research on speeding it up:
  286. Tool optimizations
  287. — Tool execution pipelining (overlap with prefill or decode)
  288. — Speculative tool execution
  289. — Tool token reduction
  290. — Concise tool output
  291. — Disaggregated tool execution
  292. — Multi-tool parallel execution

    Ensemble multi-LLM subtypes:
  293. Ensemble inference (overview of multi-model AI engines)
  294. Model selection algorithms
  295. Big-little architectures
  296. Cascades
  297. Collaborative inference
  298. Consensus decoding
  299. — Swarm ensemble architectures
  300. — Committee ensemble architectures
  301. — Ensemble averaging
  302. Easy-hard queries
  303. Submodels (Many-Models-in-One)
  304. Distributed Inference

    Orchestration, Deployment and Serving:
  305. Cloud inference servers
  306. Orchestration frameworks
  307. Scheduling optimizations
  308. Serving
  309. Load balancing
  310. Batching
  311. Static Batching
  312. Dynamic Batching
  313. Continuous batching
  314. Deployment
  315. Serverless
  316. Networking optimizations
  317. In-flight batching

    Attention optimization subtypes:
  318. Attention optimizations (overview)
  319. Multi-Head Attention (MHA)
  320. Group Query Attention (GQA)
  321. Multi-Query Attention (MQA)
  322. Sparse attention
  323. Local attention
  324. Memory-efficient attention algorithms
  325. Flash Attention
  326. Paged Attention
  327. Linear attention
  328. Cross attention
  329. Tree attention
  330. Sliding window attention
  331. Approximate attention heads
  332. Attention alternatives/replacements
  333. Fused MHA
  334. Low-rank matrix attention
  335. Medusa attention
  336. Block attention
  337. Cross attention
  338. Fused head attention
  339. Hybrid local-global attention
  340. FFT attention
  341. Additive attention
  342. Multiplicative attention
  343. Graph attention
  344. Attention sink
  345. Attention steering
  346. Bilinear attention
  347. Attention-free methods
  348. Star attention
  349. Ring attention
  350. — Flex attention
  351. — Razor attention
  352. — Contiguous QKV tensor
  353. — Relative Attention Bias (RAB)
  354. Lightning attention
  355. Multihead Latent Attention (MLA (DeepSeek)
  356. — FFT attention
  357. — Round attention
  358. Delta attention
  359. Gated attention
  360. KIVI attention
  361. K=V (KV compute sharing)
  362. Bulging attention (per-layer attention module size increases)

    Attention compute optimizations:
  363. Chunked attention
  364. QKV computation optimizations
  365. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  366. Mixture-of-Attention (MoA) (MoE attention)

    Long context optimizations (attention):
  367. Long context models
  368. Length generalization
  369. Quadratic attention complexity
  370. Long RAG

    Caching optimizations:
  371. Caching (overview)
  372. Inference Cache (text-to-text)
  373. Inference cache (global KV caching)
  374. Prompt caching
  375. Input Similarity-Based Caching (frame skipping in video)
  376. Semantic caching (text-to-text)
  377. Semantic KV caching
  378. Vector database caching
  379. Chatbot caching
  380. Vector Caching (Vector hashing)
  381. Caching vector dot products
  382. Caching general theory

    KV cache optimizations:
  383. KV Caching (overview)
  384. KV cache global (multi-query KV caching)
  385. KV cache reuse
  386. Global semantic KV caching (difficult!)
  387. Context cache (global KV caching)
  388. Prefix KV Caching
  389. KV cache recomputation with early exit
  390. Session KV cache (multi-turn KV caching)
  391. Substring/fused/concatenated KV cache (Lengthwise-fused KV caching)
  392. — Paged KV caching (related to paged attention)
  393. — KV cache offloading (to CPU)
  394. — KV sharding

    KV cache memory size reduction:
  395. KV cache compression
  396. KV cache quantization
  397. KV cache sparsity
  398. KV cache token pruning
  399. — Salient token-based KV cache token pruning
  400. KV cache eviction policies
  401. KV cache layer fusion
  402. KV cache layer pruning
  403. KV Cache low-rank matrix factorization
  404. — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)
  405. — KV cache token merging
  406. — KV head fusion
  407. — KV head pruning
  408. — KV mixed-precision quantization
  409. — KV context compression
  410. — KV block pruning
  411. — SnapKV

    Non-Multiplication AI Models:
  412. Zero-Multiplication Models (overview)
  413. Binary quantization
  414. Ternary quantization
  415. 2-bit quantization (INT2)
  416. Adder networks
  417. Bitshift-add networks
  418. Bitshift power-of-2 quantization (logarithmic quantization)
  419. Double bitshift quantization
  420. Add-as-integer networks
  421. Logarithmic Models
  422. Bitwise neural networks
  423. Diff-squared networks
  424. Log-sum-exp (LSE) networks
  425. Max-Plus networks
  426. Min-Max-Plus networks
  427. Morphological networks
  428. Trigonometric approximate inference
  429. Weightless Neural Networks (WNNs)
  430. XNOR networks
  431. Hadamard elementwise matrix multiplication models
  432. Other addition-related zero-multiplication networks
  433. Table lookups replace multiplication
  434. Other multiplication-free neural networks

    Advanced Number System optimizations:
  435. Advanced Number Systems (overview)
  436. Posit number system (PNS)
  437. Residue number system (RNS)
  438. Dyadic numbers
  439. Double-base number system (DBNS)
  440. Dynamic number systems
  441. Hybrid number systems
  442. Tropical algebra (max-plus)
  443. MiniMax algebra
  444. Multi-dimensional logarithmic number system (MDLNS)
  445. Multiple-Base Number System (MBNS)
  446. — Semi-Logarithmic Number System (SLNS)
  447. — Lattice algebra

    Logarithmic Number System optimizations:
  448. Logarithmic number system (LNS) (overview)
  449. End-to-end LNS logarithmic model
  450. LNS addition and subtraction
  451. LNS in AI models
  452. LNS Hardware Acceleration
  453. LNS mathematical and algorithmic theory
  454. LNS algebra
  455. LNS extensions

    Prefill phase optimizations:
  456. Prefill optimizations (overview)
  457. Chunked prefill
  458. Disaggregated prefill scheduling (Phase splitting)
  459. Deep prefill, shallow decoder architecture
  460. Mini-prefill recomputation
  461. Prefill first-layer precomputation
  462. Prefill last-layer FFN skipping
  463. Prefill first-token optimizations
  464. Layerwise Pipelined Prefill-Decoding

    Parallel Programming Optimization Techniques:
  465. Parallelization techniques (overview)
  466. Hardware acceleration
  467. Hardware-software co-design
  468. Vectorization
  469. Pipelining (pipeline parallelism)
  470. Overlapping (new)
  471. Overlapping communications and computation (new)
  472. Overlapping rematerialization (new)
  473. Overlapping memory access & computation (new)
  474. Offloading
  475. Partitioning
  476. Dataflow optimizations
  477. — Sharding
  478. — Overlapping
  479. Data parallelism
  480. Query parallelism
  481. Tensor parallelism
  482. Model parallelism
  483. — Prefetching
  484. — Speculative execution
  485. Sequence Parallelism
  486. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  487. Hardware Acceleration (overview)
  488. Software accelerations
  489. Hardware-software co-design
  490. GPU
  491. GPU software platforms
  492. Multi-GPU
  493. CPU Execution
  494. Single Instruction Multiple Data (SIMD)
  495. AVX (AVX/AVX-2/AVX-512)
  496. — ARM NEON
  497. Neural Processing Unit (NPU)
  498. — Overclocking CPU
  499. — Overclocking GPU
  500. Assembly language

    RAG Architecture Optimizations:
  501. RAG architectures (overview)
  502. RAG cache
  503. RAG optimizations
  504. — RAG retriever datastore indexing
  505. Advanced RAG
  506. — Speculative RAG
  507. Reranker in RAG
  508. — Chunk-specific global KV caching
  509. — Chunk-specific prefix KV caching
  510. RAG Knowledge Graph
  511. RAG Ontologies/Taxonomies
  512. RAG fusion
  513. Mini-RAG (single-document RAG)

    Sparsity Optimizations:
  514. Sparsification techniques (overview)
  515. Activation Sparsity
  516. Dynamic Sparsity
  517. Block sparsity
  518. Vector sparsity
  519. Tensor sparsity
  520. Sparse matrix kernels
  521. Outlier-aware sparsification

    Memory Utilization Optimizations:
  522. Memory optimization techniques (overview)
  523. Parameter sharing
  524. Model compression
  525. Low-bit integer quantization
  526. Binary quantization
  527. Ternary quantization
  528. Layer fusion
  529. Recomputation: trading time for space
  530. Memory-bound versus CPU-bound
  531. Data locality optimization
  532. Compute-in-Memory (CIM) architectures (also called PIM)
  533. — Memory cache management algorithms
  534. Kernel operator fusion
  535. — Flash Inference (FlashInfer)
  536. — Checkpointing
  537. Offloading
  538. SSD storage

    Numerical representation subtypes:
  539. Floating-point representations (overview)
  540. Floating Point Bit Tricks
  541. Block floating-point arithmetic
  542. Fixed point number system (FXP) optimizations
  543. Floating point number system (FLP) optimizations
  544. Foating point bitwise arithmetic
  545. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  546. Kernel optimizations (overview)
  547. Kernel operator fusion (merging, aka "kernel fusion" or "fusion")
  548. — Fused epilogues (post-MatMul fusion: fused MatMul then activation/normalization)
  549. — Fused prologues (pre-MatMul fusion: fused activation/normalization then MatMul)
  550. Kernel fission (splitting one kernel apart)
  551. Kernel tiling
  552. — Operator reordering
  553. Graph operator fusion (Deep learning compilers)

    Computation optimizations:
  554. Advanced AI Mathematics
  555. Approximate activation functions
  556. Caching / memoization
  557. Computation reuse
  558. Precomputation
  559. Source code precomputation
  560. Conditional computation
  561. Approximations
  562. Integer-only arithmetic quantization
  563. Weight precomputations
  564. Zero-skipping
  565. Low-Level Zero Skipping
  566. High-Level Zero Skipping
  567. Negative skipping
  568. Approximate caching
  569. End-to-End integer inference
  570. Padding usage
  571. Incremental inference (new)
  572. BF16x9 emulation of FP32 computations (on Blackwell GPU)
  573. FP64 arithmetic emulation using 8-bit/16-bit/32-bit computations
  574. Thread block clusters (Blackwell/Rubin)

    Arithmetic optimizations:
  575. Integer operations
  576. Addition optimizations
  577. Bitwise operation tricks
  578. Approximate addition
  579. Multiplication algorithms
  580. Approximate division
  581. Approximate multiplication
  582. Bitwise operator inference
  583. Bitserial operations
  584. Division optimizations
  585. Logarithmic approximate multiplication
  586. Integer Dot Product
  587. Vector dot product optimization

    Advanced matrix algebra optimizations:
  588. Matrix Algebra (overview)
  589. Approximate matrix multiplication
  590. Butterfly matrices
  591. Monarch matrices
  592. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  593. Low-rank matrix factorization (overview)
  594. — Tensor decomposition
  595. — Tucker decomposition
  596. Embedding low-rank matrix factorization
  597. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  598. Transformer architectures (overview)
  599. Transformer low-level optimizations (overview)
  600. — Adaptive Inference (dynamic inference)
  601. Integer-only Transformers
  602. Approximate Transformers
  603. Decoder-Only Architectures
  604. Encoder-Only Architectures
  605. Encoder-Decoder Architectures

    Transformers and LLMs:
  606. Open source models
  607. Inference frameworks
  608. Open source frameworks

    Next-Generation Transformer architectures:
  609. Next-generation architectures (overview)
  610. Hybrid Transformer architectures
  611. Newer Transformer architectures
  612. BERT (encoder)
  613. — State Space Models (SSMs)
  614. Mamba
  615. RWKV
  616. Knowledge graph AI architectures
  617. Compound AI architectures
  618. Large Concept Model (LCM)

    General Classes of Optimization Techniques:
  619. Dynamic inference (adaptive inference)
  620. Skipping
  621. Heuristics
  622. Probabilistic optimizations
  623. Approximate computing
  624. Code optimizations
  625. Deep learning compilers
  626. Incremental algorithms
  627. Fuzzy logic
  628. Inference budget (with adaptive inference)

    Loop Optimizations:
  629. Loop optimizations (overview)
  630. Inference loop optimizations
  631. Loop fusion (merging loops)
  632. Loop unrolling
  633. Loop perforation
  634. Loop reordering
  635. Loop tiling
  636. Loop reversal
  637. Loop fission (splitting a loop)
  638. — Loop interleave
  639. Loop interchange
  640. Loop coalescing
  641. Loop-invariant code motion ("hoisting")
  642. Loop distribution
  643. Pointer arithmetic
  644. Loop peeling (unrolling first iterations)
  645. Loop splittingLoop sentinel
  646. Loop collapsing
  647. Loop normalization
  648. Loop strip mining (Loop sectioning)
  649. Loop skewing
  650. Loop spreading

    Low-Level Coding Efficiency:
  651. Code optimizations (overview)
  652. Constant folding
  653. Common subexpression elimination
  654. Algebraic identities
  655. Strength reduction
  656. Type consistency
  657. Reciprocal multiplication
  658. References vs pointers
  659. Compile-time optimizations
  660. Pointer arithmetic
  661. Algorithm-level optimizations
  662. Lazy evaluation
  663. Memory reduction heuristics

    Data Structures for AI optimization:
  664. Hashing
  665. Perfect hashing
  666. Look-up tables (LUTs)
  667. Bloom filters
  668. — Trees
  669. — Tries
  670. Bloom filters
  671. Bitserial operations
  672. Permutation arrays

    Vector Data Structures:
  673. Parallel data structures
  674. Bit vectors
  675. Vector hashing
  676. Locality-Sensitive Hashing (LSH)
  677. Vector dot product caching
  678. — Bit signatures (vector algorithm)
  679. — K-means clustering (vector algorithm)
  680. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  681. Convolution optimizations (overview)
  682. Grouped convolutions
  683. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  684. Tokenization (overview)
  685. Tokenizer and model inference latency
  686. Semantic tokenization
  687. Tokenization for Machine Vision
  688. Tokenization of non-English languages
  689. Vocabulary optimizations:
  690. Vocabulary size
  691. Lexical shortlisting
  692. Vocabulary trimming
  693. Vocabulary expansion
  694. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  695. Deslugging AI engines
  696. Accuracy-degrading optimizations
  697. Accuracy-retaining optimizations
  698. Uncommon inference optimizations

Not Enough?

More inference optimization resources:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

Free AI and C++ Books

Generative AI programming books:

  1. The Sweetest Lesson: Your Brain Versus AI, November 2025: full text online, free PDF available
  2. RAG Optimization: Accurate and Efficient LLM Applications, June 2025: full text online, free PDF available
  3. Generative AI Applications: Planning, Design and Implementation, November 2024: full text online, free PDF available
  4. Generative AI in C++ (Spuler, March 2024): full text online, free PDF available, table of contents, bonus materials, reference lists, source code

CUDA C++ GPU Programming Books:

  1. CUDA C++ Optimization: Coding Faster GPU Kernels, July 2024: full text online, bonus materials, free PDF available
  2. CUDA C++ Debugging: Safer GPU Kernel Programming, July 2024: full text online, free PDF available

Modern C++ Programming Books

  1. C++ AVX Optimization: CPU SIMD Vectorization, 2025: full text online, free PDF available
  2. C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations, 2025: full text online, free PDF available
  3. Advanced C++ Memory Techniques: Efficiency and Safety, 2025: full text online, free PDF available
  4. Efficient C++ Multithreading: Modern Concurrency Optimization, 2025: free PDF available
  5. Efficient Modern C++ Data Structures: Container and Algorithm Optimizations, 2025: free PDF available
  6. C++ Low Latency: Multithreading and Hotpath Optimizations, 2025: free PDF available
  7. Safe C++: Fixing Memory Safety Issues, Oct 2024: full text online, free PDF available

More AI Research Topics

Read more about: