Aussie AI Blog

500+ LLM Inference Optimization Techniques

  • Updated: May 12, 2026
  • by David Spuler, Ph.D.

LLM Inference Optimization

We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There's more than 500 (600+ now!), but see the blog post links below if you only want to know about the latest LLM inference techniques.

Update in May, 2026: Some of the newer research areas added to the list:

  • Infinite context
  • Context extension
  • Shallow prefill
  • Algebraic integer number system
  • Kernel synthesis
  • Infill/FIM optimizations
  • Infill suffix KV correction
  • KV pinning (e.g., for system prompt)
  • Whole layer fused kernels
  • KV shifting
  • KV reversal
  • KV correction
  • KV layer propagation
  • RoPE efficiency optimizations

Update in April, 2026: Some of the newer techniques added to this list:

Update in Feb 2026: As we head into 2026, some of the more recent areas of attention include:

Research areas that remain as hot as always include:

Areas of new research relevance include:

And no doubt much more to come in 2026!

Update in March 2025: well, now we're into 2025 and this list has outgrown its title. There are over 600 items on the list below, all of which are related to LLM efficiency. The main change in 2025 is that the recent releases of "reasoning models" has spawned a new area of research in optimizing the efficiency of LLM reasoning algorithms such as Chain-of-Thought.

Free AI C++ books: for more about LLM optimization, read books online or download a PDF:

Popular articles: additional research articles on faster LLM inference:

More lists: lots of general efficiency optimization information:

LLM Inference Optimizations List

Here's the list! It's over 600 and growing!

    Reasoning Efficiency Optimization (REO): it's the latest hot research area in 2025!
  1. Reasoning inference optimization (RIO) (blog article)
  2. Chain-of-Thought (CoT) optimization
  3. CoT token reduction
  4. CoT step skipping
  5. CoT path reduction
  6. CoT early stopping
  7. CoT reasoning decoding
  8. Constrained CoT
  9. Coconut
  10. Concise CoT
  11. Hidden CoT (interim steps in latent space)
  12. CoT prompt sequence optimizations
  13. CoT sparsity
  14. CoT distillation
  15. Long context CoT
  16. — Small Reasoning Model (SRM)
  17. Reasoning tokens
  18. Adaptive inference time compute
  19. — One-step reasoning models (e.g. DeepSeek R1's long answers)
  20. — Augmented scaffold + Small Reasoning Model
  21. Reasoning caching

    Inference Modes and Token API optimizations:
  22. "Fast mode" inference (e.g. from OpenAI or Anthropic)
  23. Cached tokens
  24. Batched tokens
  25. Low batch size inference
  26. Priority batching
  27. API model routing features

    Model compression main subtypes:
  28. Model compression (overview)
  29. Pruning (overview)
  30. Quantization (overview)
  31. Knowledge Distillation (KD)
  32. Parameter sharing (weight sharing)
  33. Low-rank matrices
  34. Small Language Models (SLMs)
  35. Data compression algorithms

    Pruning main types:
  36. Dynamic pruning
  37. Hybrid pruning
  38. Unstructured pruning
  39. Semi-Structured Pruning
  40. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  41. Depthwise structural pruning (overview)
  42. Static layer pruning
  43. Layer pruning
  44. Dynamic layer pruning
  45. Layer skipping
  46. Layer approximation
  47. Shallow decoder architecture
  48. Layer reordering
  49. Layer Importance

    Early exiting (dynamic layerwise pruning): Pruning all the layers from the point of exit:
  50. Early exit (overview)
  51. — Confidence-based exit policy
  52. — Patience-based exit policy
  53. — Entropy-based exit policy
  54. — Learned exit points or exit policies
  55. Early exit KV cache fixes
  56. Early exit knowledge distillation
  57. Early exit speculative decoding
  58. Early exit in training
  59. Layer freezing

    Width-wise structured pruning subtypes:
  60. Widthwise structural pruning (overview)
  61. Attention head pruning
  62. Slimmable networks (width pruning)
  63. FFN pruning
  64. Channel pruning
  65. Filter pruning

    Length-wise structured pruning subtypes:
  66. Lengthwise structural pruning (longitudinal/input/end-to-end):
  67. Token pruning (input pruning)
  68. Dynamic token pruning
  69. Prompt compression
  70. Context compression
  71. Token merging
  72. Token skipping
  73. Token dropping
  74. Zero padding removal
  75. Token reduction
  76. Token compression
  77. Input text compression

    Model dimension embedding pruning subtypes:
  78. Embedding-dimension pruning
  79. Embedding pruning
  80. Embedding matrix compression (embedding pruning)
  81. Embedding low-rank matrix factorization
  82. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  83. Multi-dimensional pruning
  84. Dual pruning
  85. Triple pruning
  86. Quadruple pruning
  87. 3D CNN model pruning
  88. Pyramid inference

    Transformer component pruning:
  89. Normalization pruning
  90. Positional embeddings pruning
  91. Softmax pruning
  92. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  93. Unstructured pruning (overview)
  94. Magnitude pruning
  95. Movement pruning
  96. — Gradual pruning

    Quantization theory and major subtypes:
  97. Post-Training Quantization (PTQ)
  98. Quantization-Aware Training (QAT)
  99. Activation Quantization
  100. Outlier-aware quantization (outlier management)
  101. Dequantization

    Quantization overall algorithms:
  102. Uniform quantization
  103. Non-Uniform quantization
  104. Symmetric quantization
  105. Asymmetric quantization
  106. GPTQ: Gradient PTQ
  107. AQLM: Activation-Quantization Low-Bit Method
  108. SpQR: Sparse Quantized Representations

    Integer quantization subtypes:
  109. Integer quantization (overview)
  110. Integer-only arithmetic quantization
  111. Fixed-point quantization (integer)
  112. Low-bit integer quantization (overview)
  113. Binary quantization
  114. Ternary quantization
  115. 2-bit quantization (INT2)
  116. 3-bit quantization (INT3)
  117. 4-bit quantization (INT4)
  118. 5-bit quantization (INT5)
  119. 6-bit quantization (INT6)
  120. 7-bit quantization (INT7)
  121. 8-bit quantization (INT8)
  122. 9-bit quantization (INT9)
  123. 10-bit quantization (INT10)
  124. 11-bit quantization (INT11)
  125. 12-bit quantization (INT12)
  126. 16-bit INT16 quantization
  127. 32-bit INT32 quantization
  128. — W4A4 quantization
  129. — W4A4KV4 quantization

    Floating-point quantization subtypes:
  130. Floating-point quantization
  131. FP4 quantization
  132. FP6 quantization
  133. FP8 quantization
  134. FP16 quantization
  135. FP32 quantization

    Quantization error mitigation and metrics:
  136. Quantization errors
  137. Outlier mitigation methods
  138. — Mean-Squared Error (MSE)
  139. — SNR degradation

    Outlier mitigation in quantization:
  140. AWQ: Activation‑Aware Weight Quantization

    Other uncommon quantization subtypes:
  141. Logarithmic power-of-two quantization (bitshift quantization)
  142. Double bitshift power-of-two quantization
  143. Division quantization
  144. Cluster-based quantization (Weight clustering)
  145. Hashing-based weight clustering
  146. Dyadic quantization
  147. Fake quantization
  148. Simulated quantization
  149. Stochastic quantization (probabilistic)

    Mixed-precision quantization subtypes:
  150. Mixed-precision quantization

    Granularity-level quantization subtypes:
  151. Granular quantization (overview)
  152. Layerwise Quantization
  153. Blockwise Quantization
  154. — K-quantization
  155. Vector quantization

    Knowledge distillation subtypes:
  156. Knowledge Distillation (overview)
  157. Ensemble Distillation
  158. Unnatural instructions (data sets)
  159. Dataset Distillation
  160. Black Box Distillation
  161. White Box Distillation

    Parameter/weight sharing subtypes:
  162. Parameter/Weight sharing (overview)
  163. Activation sharing
  164. Layer fusion
  165. Clustering (Weights)
  166. Attention head fusion
  167. FFN fusion (sharing parameters)
  168. KV cache layer fusion (depthwise)
  169. KV cache head fusion (widthwise)

    Activation function optimizations:
  170. Activation function optimizations (overview)
  171. Activation function approximation
  172. Integer-only activation functions
  173. Fused activation functions (kernel fusion)
  174. Fused RELU
  175. Fused GELU
  176. Fused SwiGLU
  177. Activation alternatives/replacements
  178. Activation function pruning/removal (bilinear layers)
  179. Activation function reordering

    Normalization optimization types:
  180. Normalization algorithm optimizations (overview)
  181. Approximate normalization
  182. Norm reordering (pre-norm/post-norm)
  183. Integer-only normalization
  184. Normalization alternatives/replacements
  185. Fused normalization (e.g. "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  186. Softmax optimizations (overview)
  187. Softmax pruning
  188. Approximate Softmax
  189. Softmax alternatives/replacements
  190. Integer-only Softmax
  191. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  192. FFN optimizations (overview)
  193. FFN pruning
  194. FFN approximation
  195. Fused add-bias
  196. Bias vector pruning
  197. FFN sparsity
  198. FFN alternatives/replacements
  199. Integer-only FFN
  200. FFN fusion (shared parameters)
  201. Inter-FFN fusion (merging two FFNs)
  202. Intra-FFN fusion (with piecewise linear approximations) (merging two linear projections in one FFN)
  203. — Bias vector addition optimizations
  204. — Bias vector pruning (no bias!)
  205. — FFN matrix merging (similar to "intra-FFN fusion")
  206. — Bulging FFN (per-layer FFN size increases)

    MatMul/GEMM optimization types:
  207. MatMul/GEMM kernel optimizations (overview)
  208. Faster matrix multiplication (e.g. Winograd, Strassen)
  209. Approximate matrix multiplication
  210. Transpose cache
  211. Fused multiply-add (FMA)
  212. Fused transpose
  213. Vector dot product optimization
  214. Sparse MatMul/GEMM
  215. — Tiled MatMul
  216. — Triangular MatMul optimizations (causal masking in attention)
  217. — Tiled skipping

    Positional Encoding optimizations:
  218. Positional encoding optimization (overview)
  219. RoPE (Rotary Positional Encoding)
  220. Pruning positional encoding (removal/NoPE)
  221. — Positional encoding approximation
  222. — Integer-only positional encoding
  223. Partial RoPE (p-RoPE)
  224. — RoPE rescaling
  225. — Attention with Linear Biases (ALiBi)
  226. — Relative Attention Biases (RAB)

    NAS subtypes:
  227. Neural Architecture Search (NAS)
  228. Dynamic NAS
  229. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  230. On-device inference (native phone and PC AI)
  231. AI Phones
  232. AI PCs (desktops/laptops)
  233. Edge device inference (IoT/mobile/PC)
  234. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  235. Decoding algorithms (overview)
  236. Non-autoregressive decoding
  237. Greedy decoding
  238. Top-k decoding
  239. Top-p decoding
  240. Min-P Sampling
  241. Flash decoding
  242. Beam search decoding
  243. Edit decoding
  244. Contrastive decoding
  245. — Approximate top-k algorithms
  246. — Bidirectional decoding
  247. Constrained decoding

    Parallel Decoding algorithms:
  248. Parallel decoding
  249. Blockwise parallel decoding
  250. n-gram parallel decoding
  251. Lookahead decoding
  252. Medusa decoding
  253. Consensus decoding
  254. — Mutually-guided decoding
  255. — Multi-token generation
  256. — Eagle decoding

    Speculative decoding subtypes:
  257. Speculative decoding (overview)
  258. Generalized speculative decoding
  259. Aggressive decoding
  260. Lookup decoding
  261. Retrieval lookup decoding
  262. Prompt lookup decoding
  263. — Multi-query prompt lookup decoding (across entire LLM history)
  264. Self speculative decoding
  265. Tree speculative decoding
  266. Superposed decoding
  267. Hierarchical speculative decoding
  268. Heuristic speculative decoding
  269. Multi-token speculative decoding
  270. Sequential speculative decoding
  271. Eagle speculative decoding
  272. — Redrafting

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  273. PEFT (overview)
  274. LoRA
  275. Multi-LoRA inference
  276. QLoRa (Quantized Low-Rank Adapters)
  277. LoRA inference optimizations (load/unload)
  278. Prompt Tuning (Extended Vocabulary PEFT)
  279. Prefix Tuning

    Mixture-of-Experts (MoE):
  280. Mixture of Experts (MoE)
  281. MoE-specific compute optimizations
  282. Hybrid MoE (dense FFN)
  283. — Shared experts
  284. — MoE routing optimizations
  285. — MoE gating optimizations

    Tool Integration Optimizations: LLMs using tools has gone mainstream, and there is also newer research on speeding it up:
  286. Tool optimizations
  287. — Tool execution pipelining (overlap with prefill or decode)
  288. — Speculative tool execution
  289. — Tool token reduction
  290. — Concise tool output
  291. — Disaggregated tool execution
  292. — Multi-tool parallel execution

    Ensemble multi-LLM subtypes:
  293. Ensemble inference (overview of multi-model AI engines)
  294. Model selection algorithms
  295. Big-little architectures
  296. Cascades
  297. Collaborative inference
  298. Consensus decoding
  299. — Swarm ensemble architectures
  300. — Committee ensemble architectures
  301. — Ensemble averaging
  302. Easy-hard queries
  303. Submodels (Many-Models-in-One)
  304. Distributed Inference

    Orchestration, Deployment and Serving:
  305. Cloud inference servers
  306. Orchestration frameworks
  307. Scheduling optimizations
  308. Serving
  309. Load balancing
  310. Batching
  311. Static Batching
  312. Dynamic Batching
  313. Continuous batching
  314. Deployment
  315. Serverless
  316. Networking optimizations
  317. In-flight batching

    Attention optimization subtypes:
  318. Attention optimizations (overview)
  319. Multi-Head Attention (MHA)
  320. Group Query Attention (GQA)
  321. Multi-Query Attention (MQA)
  322. Sparse attention
  323. Local attention
  324. Memory-efficient attention algorithms
  325. Flash Attention
  326. Paged Attention
  327. Linear attention
  328. Cross attention
  329. Tree attention
  330. Sliding window attention
  331. Approximate attention heads
  332. Attention alternatives/replacements
  333. Fused MHA
  334. Low-rank matrix attention
  335. Medusa attention
  336. Block attention
  337. Cross attention
  338. Fused head attention
  339. Hybrid local-global attention
  340. FFT attention
  341. Additive attention
  342. Multiplicative attention
  343. Graph attention
  344. Attention sink
  345. Attention steering
  346. Bilinear attention
  347. Attention-free methods
  348. Star attention
  349. Ring attention
  350. — Flex attention
  351. — Razor attention
  352. — Contiguous QKV tensor
  353. — Relative Attention Bias (RAB)
  354. Lightning attention
  355. Multihead Latent Attention (MLA (DeepSeek)
  356. — FFT attention
  357. — Round attention
  358. Delta attention
  359. Gated attention
  360. KIVI attention
  361. K=V (KV compute sharing)
  362. Bulging attention (per-layer attention module size increases)

    Attention compute optimizations:
  363. Chunked attention
  364. QKV computation optimizations
  365. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  366. Mixture-of-Attention (MoA) (MoE attention)

    Long context optimizations (attention):
  367. Long context models
  368. Length generalization
  369. Quadratic attention complexity
  370. Long RAG
  371. Infinite context models
  372. — Context extension (e.g., extending RoPE/YaRN)

    Caching optimizations:
  373. Caching (overview)
  374. Inference Cache (text-to-text)
  375. Inference cache (global KV caching)
  376. Prompt caching
  377. Input Similarity-Based Caching (frame skipping in video)
  378. Semantic caching (text-to-text)
  379. Semantic KV caching
  380. Vector database caching
  381. Chatbot caching
  382. Vector Caching (Vector hashing)
  383. Caching vector dot products
  384. Caching general theory

    KV cache optimizations:
  385. KV Caching (overview)
  386. KV cache global (multi-query KV caching)
  387. KV cache reuse
  388. Global semantic KV caching (difficult!)
  389. Context cache (global KV caching)
  390. Prefix KV Caching
  391. KV cache recomputation with early exit
  392. Session KV cache (multi-turn KV caching)
  393. Substring/fused/concatenated KV cache (Lengthwise-fused KV caching)
  394. — Paged KV caching (related to paged attention)
  395. — KV cache offloading (to CPU)
  396. — KV sharding

    KV cache memory size reduction:
  397. KV cache compression
  398. KV cache quantization
  399. KV cache sparsity
  400. KV cache token pruning
  401. — Salient token-based KV cache token pruning
  402. KV cache eviction policies
  403. KV cache layer fusion
  404. KV cache layer pruning
  405. KV Cache low-rank matrix factorization
  406. — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)
  407. — KV cache token merging
  408. — KV head fusion
  409. — KV head pruning
  410. — KV mixed-precision quantization
  411. — KV context compression
  412. — KV block pruning
  413. — SnapKV

    Non-Multiplication AI Models:
  414. Zero-Multiplication Models (overview)
  415. Binary quantization
  416. Ternary quantization
  417. 2-bit quantization (INT2)
  418. Adder networks
  419. Bitshift-add networks
  420. Bitshift power-of-2 quantization (logarithmic quantization)
  421. Double bitshift quantization
  422. Add-as-integer networks
  423. Logarithmic Models
  424. Bitwise neural networks
  425. Diff-squared networks
  426. Log-sum-exp (LSE) networks
  427. Max-Plus networks
  428. Min-Max-Plus networks
  429. Morphological networks
  430. Trigonometric approximate inference
  431. Weightless Neural Networks (WNNs)
  432. XNOR networks
  433. Hadamard elementwise matrix multiplication models
  434. Other addition-related zero-multiplication networks
  435. Table lookups replace multiplication
  436. Other multiplication-free neural networks

    Advanced Number System optimizations:
  437. Advanced Number Systems (overview)
  438. Posit number system (PNS)
  439. Residue number system (RNS)
  440. Dyadic numbers
  441. Double-base number system (DBNS)
  442. Dynamic number systems
  443. Hybrid number systems
  444. Tropical algebra (max-plus)
  445. MiniMax algebra
  446. Multi-dimensional logarithmic number system (MDLNS)
  447. Multiple-Base Number System (MBNS)
  448. — Semi-Logarithmic Number System (SLNS)
  449. — Lattice algebra
  450. Algebraic integer number system

    Logarithmic Number System optimizations:
  451. Logarithmic number system (LNS) (overview)
  452. End-to-end LNS logarithmic model
  453. LNS addition and subtraction
  454. LNS in AI models
  455. LNS Hardware Acceleration
  456. LNS mathematical and algorithmic theory
  457. LNS algebra
  458. LNS extensions

    Prefill phase optimizations:
  459. Prefill optimizations (overview)
  460. Chunked prefill
  461. Disaggregated prefill scheduling (Phase splitting)
  462. Deep prefill, shallow decoder architecture
  463. Mini-prefill recomputation
  464. Prefill first-layer precomputation
  465. Prefill last-layer FFN skipping
  466. Prefill first-token optimizations
  467. Layerwise Pipelined Prefill-Decoding
  468. Shallow prefill

    Parallel Programming Optimization Techniques:
  469. Parallelization techniques (overview)
  470. Hardware acceleration
  471. Hardware-software co-design
  472. Vectorization
  473. Pipelining (pipeline parallelism)
  474. Overlapping (new)
  475. Overlapping communications and computation (new)
  476. Overlapping rematerialization (new)
  477. Overlapping memory access & computation (new)
  478. Offloading
  479. Partitioning
  480. Dataflow optimizations
  481. — Sharding
  482. — Overlapping
  483. Data parallelism
  484. Query parallelism
  485. Tensor parallelism
  486. Model parallelism
  487. — Prefetching
  488. — Speculative execution
  489. Sequence Parallelism
  490. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  491. Hardware Acceleration (overview)
  492. Software accelerations
  493. Hardware-software co-design
  494. GPU
  495. GPU software platforms
  496. Multi-GPU
  497. CPU Execution
  498. Single Instruction Multiple Data (SIMD)
  499. AVX (AVX/AVX-2/AVX-512)
  500. — ARM NEON
  501. Neural Processing Unit (NPU)
  502. — Overclocking CPU
  503. — Overclocking GPU
  504. Assembly language

    RAG Architecture Optimizations:
  505. RAG architectures (overview)
  506. RAG cache
  507. RAG optimizations
  508. — RAG retriever datastore indexing
  509. Advanced RAG
  510. — Speculative RAG
  511. Reranker in RAG
  512. — Chunk-specific global KV caching
  513. — Chunk-specific prefix KV caching
  514. RAG Knowledge Graph
  515. RAG Ontologies/Taxonomies
  516. RAG fusion
  517. Mini-RAG (single-document RAG)

    Sparsity Optimizations:
  518. Sparsification techniques (overview)
  519. Activation Sparsity
  520. Dynamic Sparsity
  521. Block sparsity
  522. Vector sparsity
  523. Tensor sparsity
  524. Sparse matrix kernels
  525. Outlier-aware sparsification
  526. — N:M sparsity
  527. — 2:4 sparsity (supported in NVIDIA "Sparse Tensor Cores")

    Memory Utilization Optimizations:
  528. Memory optimization techniques (overview)
  529. Parameter sharing
  530. Model compression
  531. Low-bit integer quantization
  532. Binary quantization
  533. Ternary quantization
  534. Layer fusion
  535. Recomputation: trading time for space
  536. Memory-bound versus CPU-bound
  537. Data locality optimization
  538. Compute-in-Memory (CIM) architectures (also called PIM)
  539. — Memory cache management algorithms
  540. Kernel operator fusion
  541. — Flash Inference (FlashInfer)
  542. — Checkpointing
  543. Offloading
  544. SSD storage

    Numerical representation subtypes:
  545. Floating-point representations (overview)
  546. Floating Point Bit Tricks
  547. Block floating-point arithmetic
  548. Fixed point number system (FXP) optimizations
  549. Floating point number system (FLP) optimizations
  550. Foating point bitwise arithmetic
  551. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  552. Kernel optimizations (overview)
  553. Kernel operator fusion (merging, aka "kernel fusion" or "fusion")
  554. — Fused epilogues (post-MatMul fusion: fused MatMul then activation/normalization)
  555. — Fused prologues (pre-MatMul fusion: fused activation/normalization then MatMul)
  556. Kernel fission (splitting one kernel apart)
  557. Kernel tiling
  558. — Operator reordering
  559. Graph operator fusion (Deep learning compilers)
  560. Kernel synthesis
  561. — Whole-layer fused kernels (fused attention with FFN/GLU)

    Infill / Fill-in-Middle (FIM) optimizations:
  562. Infill harness optimizations
  563. Infill prefix reordering
  564. Infill suffix KV correction

    Computation optimizations:
  565. Advanced AI Mathematics
  566. Approximate activation functions
  567. Caching / memoization
  568. Computation reuse
  569. Precomputation
  570. Source code precomputation
  571. Conditional computation
  572. Approximations
  573. Integer-only arithmetic quantization
  574. Weight precomputations
  575. Zero-skipping
  576. Low-Level Zero Skipping
  577. High-Level Zero Skipping
  578. Negative skipping
  579. Approximate caching
  580. End-to-End integer inference
  581. Padding usage
  582. Incremental inference (new)
  583. BF16x9 emulation of FP32 computations (on Blackwell GPU)
  584. FP64 arithmetic emulation using 8-bit/16-bit/32-bit computations
  585. Thread block clusters (Blackwell/Rubin)

    Arithmetic optimizations:
  586. Integer operations
  587. Addition optimizations
  588. Bitwise operation tricks
  589. Approximate addition
  590. Multiplication algorithms
  591. Approximate division
  592. Approximate multiplication
  593. Bitwise operator inference
  594. Bitserial operations
  595. Division optimizations
  596. Logarithmic approximate multiplication
  597. Integer Dot Product
  598. Vector dot product optimization

    Advanced matrix algebra optimizations:
  599. Matrix Algebra (overview)
  600. Approximate matrix multiplication
  601. Butterfly matrices
  602. Monarch matrices
  603. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  604. Low-rank matrix factorization (overview)
  605. — Tensor decomposition
  606. — Tucker decomposition
  607. Embedding low-rank matrix factorization
  608. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  609. Transformer architectures (overview)
  610. Transformer low-level optimizations (overview)
  611. — Adaptive Inference (dynamic inference)
  612. Integer-only Transformers
  613. Approximate Transformers
  614. Decoder-Only Architectures
  615. Encoder-Only Architectures
  616. Encoder-Decoder Architectures

    Transformers and LLMs:
  617. Open source models
  618. Inference frameworks
  619. Open source frameworks

    Next-Generation Transformer architectures:
  620. Next-generation architectures (overview)
  621. Hybrid Transformer architectures
  622. Newer Transformer architectures
  623. BERT (encoder)
  624. — State Space Models (SSMs)
  625. Mamba
  626. RWKV
  627. Knowledge graph AI architectures
  628. Compound AI architectures
  629. Large Concept Model (LCM)

    General Classes of Optimization Techniques:
  630. Dynamic inference (adaptive inference)
  631. Skipping
  632. Heuristics
  633. Probabilistic optimizations
  634. Approximate computing
  635. Code optimizations
  636. Deep learning compilers
  637. Incremental algorithms
  638. Fuzzy logic
  639. Inference budget (with adaptive inference)

    Loop Optimizations:
  640. Loop optimizations (overview)
  641. Inference loop optimizations
  642. Loop fusion (merging loops)
  643. Loop unrolling
  644. Loop perforation
  645. Loop reordering
  646. Loop tiling
  647. Loop reversal
  648. Loop fission (splitting a loop)
  649. — Loop interleave
  650. Loop interchange
  651. Loop coalescing
  652. Loop-invariant code motion ("hoisting")
  653. Loop distribution
  654. Pointer arithmetic
  655. Loop peeling (unrolling first iterations)
  656. Loop splittingLoop sentinel
  657. Loop collapsing
  658. Loop normalization
  659. Loop strip mining (Loop sectioning)
  660. Loop skewing
  661. Loop spreading

    Low-Level Coding Efficiency:
  662. Code optimizations (overview)
  663. Constant folding
  664. Common subexpression elimination
  665. Algebraic identities
  666. Strength reduction
  667. Type consistency
  668. Reciprocal multiplication
  669. References vs pointers
  670. Compile-time optimizations
  671. Pointer arithmetic
  672. Algorithm-level optimizations
  673. Lazy evaluation
  674. Memory reduction heuristics

    Data Structures for AI optimization:
  675. Hashing
  676. Perfect hashing
  677. Look-up tables (LUTs)
  678. Bloom filters
  679. — Trees
  680. — Tries
  681. Bloom filters
  682. Bitserial operations
  683. Permutation arrays
  684. — Radix trees (compressed tries) used in Radix Attention for prefix KV caches.

    Vector Data Structures:
  685. Parallel data structures
  686. Bit vectors
  687. Vector hashing
  688. Locality-Sensitive Hashing (LSH)
  689. Vector dot product caching
  690. — Bit signatures (vector algorithm)
  691. — K-means clustering (vector algorithm)
  692. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  693. Convolution optimizations (overview)
  694. Grouped convolutions
  695. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  696. Tokenization (overview)
  697. Tokenizer and model inference latency
  698. Semantic tokenization
  699. Tokenization for Machine Vision
  700. Tokenization of non-English languages
  701. Vocabulary optimizations:
  702. Vocabulary size
  703. Lexical shortlisting
  704. Vocabulary trimming
  705. Vocabulary expansion
  706. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  707. Deslugging AI engines
  708. Accuracy-degrading optimizations
  709. Accuracy-retaining optimizations
  710. Uncommon inference optimizations

Not Enough?

More inference optimization resources:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

Free AI and C++ Books

Generative AI programming books:

  1. The Sweetest Lesson: Your Brain Versus AI, November 2025: full text online, free PDF available
  2. RAG Optimization: Accurate and Efficient LLM Applications, June 2025: full text online, free PDF available
  3. Generative AI Applications: Planning, Design and Implementation, November 2024: full text online, free PDF available
  4. Generative AI in C++ (Spuler, March 2024): full text online, free PDF available, table of contents, bonus materials, reference lists, source code

CUDA C++ GPU Programming Books:

  1. CUDA C++ Optimization: Coding Faster GPU Kernels, July 2024: full text online, bonus materials, free PDF available
  2. CUDA C++ Debugging: Safer GPU Kernel Programming, July 2024: full text online, free PDF available

Modern C++ Programming Books

  1. C++ AVX Optimization: CPU SIMD Vectorization, 2025: full text online, free PDF available
  2. C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations, 2025: full text online, free PDF available
  3. Advanced C++ Memory Techniques: Efficiency and Safety, 2025: full text online, free PDF available
  4. Efficient C++ Multithreading: Modern Concurrency Optimization, 2025: free PDF available
  5. Efficient Modern C++ Data Structures: Container and Algorithm Optimizations, 2025: free PDF available
  6. C++ Low Latency: Multithreading and Hotpath Optimizations, 2025: free PDF available
  7. Safe C++: Fixing Memory Safety Issues, Oct 2024: full text online, free PDF available

More AI Research Topics

Read more about: