Aussie AI

Appendix: 500+ LLM Inference Optimization Techniques

  • Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
  • by David Spuler and Michael Sharpe

Appendix: 500+ LLM Inference Optimization Techniques

Inference Optimization Research

The LLM is usually the main bottleneck for latency in a RAG architecture, and inference optimization is an important part of tuning the overall application. We do a lot of research on inference optimization techniques, so here’s a very long list of all the techniques about which we have research papers.

Inference optimization has become a hot area of research as the industry evolves to the point where inference costs are about 95% of overall compute. This is a change since the early days when training expense far exceeded inference costs. This trend is driven by both an increase in inference demand, and a decline in training costs. Specific factors include:

    (a) more users, which means more queries, which means more inference computations,

    (b) commercial and open source pre-trained models (rather than training your own),

    (c) faster training and fine-tuning methods (e.g., LoRA and multi-LoRA),

    (d) RAG architectures replacing fine-tuning, and

    (e) multi-step reasoning algorithms (e.g., OpenAI’s o1 model).

The first four of these factors have been going on for a year or two, but the last point is recent: inference is the new way to reason!

What’s Hot in Inference Optimization Research?

The change in focus towards inference has spawned a deluge of research papers on speeding up inference that aims to offer faster latency to users and reduce costs. Some of the hottest research sub-areas for speeding up inference include:

  1. Hardware optimizations. The biggest opportunity for inference speedup is probably in hardware rather than software. There’s the upcoming NVIDIA Blackwell architecture, which is apparently delayed as I write this, along with several AI-specific hardware startups such as Groq and Etched receiving large funding rounds. I’m not an expert on the hardware opportunities, so I’ll leave it there.
  2. KV cache compression. The KV cache was initially a speedup for inference, but it’s become a memory hog, especially for long context processing. Hence, there are numerous research papers on making the KV cache data use less memory (see KV cache compression research). In particular, KV cache quantization is becoming standard in industry framework implementations, such as 4-bit quantized KV cache data used by Character.AI and Apple Intelligence. There are several fancier types of KV cache compression in the research, such as KV cache layer pruning (depth dimension) and KV cache token pruning (input prompt length dimension). Notably, an implementation of KV cache layer fusion is used by Character.AI’s inference backend for companionbots.
  3. Context caching. The simplest cache is a text-to-text full "inference cache" and there’s also semantic caching based on embedding vector similarity. However, the idea of saving the KV cache, but re-running decoding has various advantages, and is gaining attention in both research and industry. This is usually termed “context caching” or “prompt caching” in the industry. Google has recently released “context caching” features, Anthropic has added “prompt caching,” and this type of caching is also appearing in other inference frameworks, such as vLLM and DeepSeek. Expect many more to follow! See: context caching research.
  4. Prefix KV caching. There are many cases where Transformers are re-processing the same prefix of tokens, such as chatbot multi-turn conversational context, global system instructions (prepended), RAG chunks (prepended), and re-used documents. Instead, you can just load the KV cache data from a prefix KV cache, and the latency is minimal, and you only have to decode the last few tokens. Prefix KV caching is also getting implemented in frameworks, including vLLM, DeepSeek, and Character.AI’s backend. Interestingly, DeepSeek offers lower pricing for “cached tokens,” which reflects the lower cost.
  5. Multi-LoRA. The idea of using multiple LoRA adapters for efficiently supporting multiple fine-tuned models got a massive boost from Apple Intelligence. There are many research papers now focused on further optimizing the load-time and inference characteristics of multi-LoRA architectures and other types of Parameter-Efficient Fine-Tuning (PEFT).
  6. Memory-efficient attention algorithms. The two leading contenders for attention optimization by paying attention to its memory access patterns are Flash Attention and Paged Attention, and you can even combine them! There’s also their precursors Multi-Query Attention (MQA) and Grouped Query Attention (GQA) that are still in use and getting researched in papers. See memory-efficient attention optimization.
  7. Linear attention. Another way to reduce memory cost is to simply access it less! Algorithms like this include local attention and other types of linear attention. As a recent example in industry, Character.AI’s inference backend uses a hybrid layerwise attention scheme, that alternates between local and global attention across different layers. There’s a lot of research happening in optimizing the attention mechanisms, because of its annoying quadratic complexity. See research on attention optimization.
  8. Zero-multiplication models. MIT research released a model architecture based on element-wise multiplication for matrix multiplication, which is the “Hadamard product.” Basic matrix multiplication is O(n3) whereas Hadamard computations are O(n2), so it’s potentially a tenfold reduction, and also a simpler algorithm that’s more amenable to followup kernel optimizations like kernel fusion. See Hadamard multiplication models. There are actually at least ten other types of zero-multiplication models in the literature (e.g., adder models, shift-add, logarithmic, power-of-two, max-plus, min-max, weightless neural networks, etc.). There’s also the well-known method of avoiding multiplication with low-bit quantization. Both binary quantization and ternary quantization can be implemented via addition, albeit with accuracy loss.
  9. Speculative decoding. Increased parallelization of the decoding algorithm via speculative decoding is a perennially hot area of research. It’s a speedup that’s long been used in production backends. Various generalization have been discovered, such as generalized speculative decoding, heuristic speculative decoding, self-speculative decoding, retrieval lookup decoding, prompt lookup decoding, and several other methods.
  10. Multi-token generation. Generalizing the decoding algorithm to output multiple tokens in parallel is a clear gain in efficiency, and some research is starting to show promise. These require an entirely different type of model architecture for both training and inference. There are also some multi-token drafting methods starting to be used to optimize speculative decoding algorithms. See: parallel decoding research.
  11. Prefill optimizations. There has been a burst of new research that examines the cost of the prefill operation, which creates the KV cache, and is the reason for the initial latency before the first token is output. Hence, prefill time is important for user responsiveness for any interactive use cases. In particular, research has found that prefill is compute-bound, whereas the decoding phase is memory-bound. Hence, there is much research on prefill phase optimizations, chunked prefill, and disaggregated scheduling of prefill and decoding phases on GPU platforms. Note also that KV caching methods as discussed above can optimize prefill by avoiding it completely!

LLM Inference Optimizations List

Here’s the list! It’s over 500 and growing!

If you’re reading the e-book version, then the links should be live for each topic. If not, you can see the live list, with links and updates, at this URL: https://www.aussieai.com/blog/llm-inference-optimization

    Model compression main subtypes:
  1. Model compression (overview)
  2. Pruning (overview)
  3. Quantization (overview)
  4. Knowledge Distillation (KD)
  5. Parameter sharing (weight sharing)
  6. Low-rank matrices
  7. Small Language Models (SLMs)
  8. Data compression algorithms

    Pruning main types:
  9. Dynamic pruning
  10. Hybrid pruning
  11. Unstructured pruning
  12. Semi-Structured Pruning
  13. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  14. Depthwise structural pruning (overview)
  15. Static layer pruning
  16. Layer pruning
  17. Early exit
  18. Dynamic layer pruning
  19. Layer skipping
  20. Layer approximation
  21. Shallow decoder architecture
  22. Layer reordering
  23. Layer Importance

    Width-wise structured pruning subtypes:
  24. Widthwise structural pruning (overview)
  25. Attention head pruning
  26. Slimmable networks (width pruning)
  27. FFN pruning
  28. Channel pruning
  29. Filter pruning

    Length-wise structured pruning subtypes:
  30. Lengthwise structural pruning (longitudinal/input/end-to-end):
  31. Token pruning (input pruning)
  32. Dynamic token pruning
  33. Prompt compression
  34. Context compression
  35. Token merging
  36. Token skipping
  37. Token dropping
  38. Zero padding removal

    Model dimension embedding pruning subtypes:
  39. Embedding-dimension pruning
  40. Embedding pruning
  41. Embedding matrix compression (embedding pruning)
  42. Embedding low-rank matrix factorization
  43. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  44. Multi-dimensional pruning
  45. Dual pruning
  46. Triple pruning
  47. Quadruple pruning
  48. 3D CNN model pruning

    Transformer component pruning:
  49. Normalization pruning
  50. Positional embeddings pruning
  51. Softmax pruning
  52. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  53. Unstructured pruning (overview)
  54. Magnitude pruning
  55. Movement pruning
  56. — Gradual pruning

    Quantization subtypes:
  57. Post-Training Quantization (PTQ)
  58. Quantization-Aware Training (QAT)
  59. Activation Quantization
  60. Outlier-aware quantization

    Integer quantization subtypes:
  61. Integer quantization (overview)
  62. Integer-only arithmetic quantization
  63. Fixed-point quantization (integer)
  64. Low-bit integer quantization (overview)
  65. Binary quantization
  66. Ternary quantization
  67. 2-bit quantization (INT2)
  68. 3-bit quantization (INT3)
  69. 4-bit quantization (INT4)
  70. 5-bit quantization (INT5)
  71. 6-bit quantization (INT6)
  72. 7-bit quantization (INT7)
  73. 8-bit quantization (INT8)
  74. 9-bit quantization (INT9)
  75. 10-bit quantization (INT10)
  76. 11-bit quantization (INT11)
  77. 12-bit quantization (INT12)
  78. 16-bit INT16 quantization
  79. 32-bit INT32 quantization

    Floating-point quantization subtypes:
  80. Floating-point quantization
  81. FP4 quantization
  82. FP6 quantization
  83. FP8 quantization
  84. FP16 quantization
  85. FP32 quantization

    Other quantization subtypes:
  86. Mixed-precision quantization
  87. Logarithmic power-of-two quantization (bitshift quantization)
  88. Double bitshift power-of-two quantization
  89. Division quantization
  90. Cluster-based quantization (Weight clustering)
  91. Hashing-based weight clustering
  92. Dyadic quantization
  93. Fake quantization
  94. Simulated quantization
  95. Stochastic quantization (probabilistic)

    Granularity-level quantization subtypes:
  96. Granular quantization (overview)
  97. Layerwise Quantization
  98. Blockwise Quantization
  99. Vector quantization

    Knowledge distillation subtypes:
  100. Knowledge Distillation (overview)
  101. Ensemble Distillation
  102. Unnatural instructions (data sets)
  103. Dataset Distillation

    Parameter/weight sharing subtypes:
  104. Parameter/Weight sharing (overview)
  105. Activation sharing
  106. Layer fusion
  107. Clustering (Weights)
  108. Attention head fusion
  109. FFN fusion
  110. KV cache layer fusion (depthwise)
  111. KV cache head fusion (widthwise)

    Activation function optimizations:
  112. Activation function optimizations (overview)
  113. Activation function approximation
  114. Integer-only activation functions
  115. Fused activation functions (kernel fusion)
  116. Fused RELU
  117. Fused GELU
  118. Fused SwiGLU
  119. Activation alternatives/replacements
  120. Activation function pruning/removal (bilinear layers)
  121. Activation function reordering

    Normalization optimization types:
  122. Normalization algorithm optimizations (overview)
  123. Approximate normalization
  124. Norm reordering (pre-norm/post-norm)
  125. Integer-only normalization
  126. Normalization alternatives/replacements
  127. Fused normalization (e.g., "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  128. Softmax optimizations (overview)
  129. Softmax pruning
  130. Approximate Softmax
  131. Softmax alternatives/replacements
  132. Integer-only Softmax
  133. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  134. FFN optimizations (overview)
  135. FFN pruning
  136. FFN approximation
  137. Fused add-bias
  138. Bias vector pruning
  139. FFN sparsity
  140. FFN alternatives/replacements
  141. Integer-only FFN
  142. — Bias optimizations

    MatMul/GEMM optimization types:
  143. MatMul/GEMM kernel optimizations (overview)
  144. Faster matrix multiplication (e.g., Winograd, Strassen)
  145. Approximate matrix multiplication
  146. Transpose cache
  147. Fused multiply-add (FMA)
  148. Fused transpose
  149. Vector dot product optimization
  150. Sparse MatMul/GEMM
  151. — Tiled MatMul

    Positional Encoding optimizations:
  152. Positional encoding optimization (overview)
  153. RoPE (Rotary Positional Encoding)
  154. Pruning positional encoding (removal/NoPE)
  155. — Positional encoding approximation
  156. — Integer-only positional encoding

    NAS subtypes:
  157. Neural Architecture Search (NAS)
  158. Dynamic NAS
  159. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  160. On-device inference (native phone and PC AI)
  161. AI Phones
  162. AI PCs (desktops/laptops)
  163. Edge device inference (IoT/mobile/PC)
  164. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  165. Decoding algorithms (overview)
  166. Non-autoregressive decoding
  167. Greedy decoding
  168. Top-k decoding
  169. Top-p decoding
  170. Min-P Sampling
  171. Flash decoding
  172. Beam search decoding
  173. Edit decoding
  174. Contrastive decoding
  175. — Approximate top-k algorithms
  176. — Bidirectional decoding
  177. Constrained decoding

    Parallel Decoding algorithms:
  178. Parallel decoding
  179. Blockwise parallel decoding
  180. n-gram parallel decoding
  181. Lookahead decoding
  182. Medusa decoding
  183. Consensus decoding
  184. — Mutually-guided decoding
  185. — Multi-token generation
  186. — Eagle decoding

    Speculative decoding subtypes:
  187. Speculative decoding (overview)
  188. Generalized speculative decoding
  189. Aggressive decoding
  190. Lookup decoding
  191. Retrieval lookup decoding
  192. Prompt lookup decoding
  193. Self speculative decoding
  194. Tree speculative decoding
  195. Superposed decoding
  196. Hierarchical speculative decoding
  197. Heuristic speculative decoding
  198. Multi-token speculative decoding
  199. Sequential speculative decoding
  200. — Redrafting

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  201. PEFT (overview)
  202. LoRA
  203. Multi-LoRA inference
  204. QLoRa (Quantized Low-Rank Adapters)
  205. LoRA inference optimizations (load/unload)
  206. Prompt Tuning (Extended Vocabulary PEFT)

    Ensemble multi-LLM subtypes:
  207. Ensemble inference (overview of multi-model AI engines)
  208. Mixture of Experts (MoE)
  209. Model selection algorithms
  210. Big-little architectures
  211. Cascades
  212. Collaborative inference
  213. Consensus decoding
  214. — Swarm ensemble architectures
  215. — Committee ensemble architectures
  216. — Ensemble averaging
  217. Easy-hard queries
  218. Submodels (Many-Models-in-One)
  219. Distributed Inference

    Orchestration, Deployment and Serving:
  220. Cloud inference servers
  221. Orchestration frameworks
  222. Scheduling optimizations
  223. Serving
  224. Load balancing
  225. Batching
  226. Continuous batching
  227. Deployment
  228. Serverless
  229. Networking optimizations
  230. In-flight batching

    Attention optimization subtypes:
  231. Attention optimizations (overview)
  232. Multi-Head Attention (MHA)
  233. Group Query Attention (GQA)
  234. Multi-Query Attention (MQA)
  235. Sparse attention
  236. Local attention
  237. Memory-efficient attention algorithms
  238. Flash Attention
  239. Paged Attention
  240. Linear attention
  241. Cross attention
  242. Tree attention
  243. Sliding window attention
  244. Approximate attention heads
  245. Attention alternatives/replacements
  246. Fused MHA
  247. Low-rank matrix attention
  248. Medusa attention
  249. Block attention
  250. Cross attention
  251. Fused head attention
  252. Hybrid local-global attention
  253. FFT attention
  254. QKV computation optimizations
  255. Additive attention
  256. Multiplicative attention
  257. Graph attention
  258. Chunked attention
  259. Attention sink
  260. Attention steering
  261. Bilinear attention
  262. Attention-free methods
  263. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  264. Star attention
  265. Flex attention
  266. Razor attention
  267. Contiguous QKV tensor
  268. Relative Attention Bias (RAB)

    Long context optimizations (attention):
  269. Long context models
  270. Length generalization
  271. Quadratic attention complexity
  272. Long RAG

    Caching optimizations:
  273. Caching (overview)
  274. Inference Cache (text-to-text)
  275. Inference cache (global KV caching)
  276. Prompt caching
  277. Input Similarity-Based Caching (frame skipping in video)
  278. Semantic caching (text-to-text)
  279. Semantic KV caching
  280. Vector database caching
  281. Chatbot caching
  282. Vector Caching (Vector hashing)
  283. Caching vector dot products
  284. Caching general theory

    KV cache optimizations:
  285. KV Caching (overview)
  286. KV cache global (multi-query KV caching)
  287. KV cache reuse
  288. Global semantic KV caching (difficult!)
  289. Context cache (global KV caching)
  290. Prefix KV Caching
  291. KV cache recomputation with early exit
  292. Session KV cache (multi-turn KV caching)
  293. Substring/fused KV cache (Lengthwise-fused KV caching)
  294. — Paged KV caching (related to paged attention)

    KV cache memory size reduction:
  295. KV cache compression
  296. KV cache quantization
  297. KV cache sparsity
  298. KV cache token pruning
  299. KV cache eviction policies
  300. KV cache layer fusion
  301. KV cache layer pruning
  302. KV Cache low-rank matrix factorization
  303. — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)

    Non-Multiplication AI Models:
  304. Zero-Multiplication Models (overview)
  305. Binary quantization
  306. Ternary quantization
  307. 2-bit quantization (INT2)
  308. Adder networks
  309. Bitshift-add networks
  310. Bitshift power-of-2 quantization (logarithmic quantization)
  311. Double bitshift quantization
  312. Add-as-integer networks
  313. Logarithmic Models
  314. Bitwise neural networks
  315. Diff-squared networks
  316. Log-sum-exp (LSE) networks
  317. Max-Plus networks
  318. Min-Max-Plus networks
  319. Morphological networks
  320. Trigonometric approximate inference
  321. Weightless Neural Networks (WNNs)
  322. XNOR networks
  323. Hadamard elementwise matrix multiplication models
  324. Other addition-related zero-multiplication networks
  325. Table lookups replace multiplication
  326. Other multiplication-free neural networks

    Advanced Number System optimizations:
  327. Advanced Number Systems (overview)
  328. Posit number system (PNS)
  329. Residue number system (RNS)
  330. Dyadic numbers
  331. Double-base number system (DBNS)
  332. Dynamic number systems
  333. Hybrid number systems
  334. Tropical algebra (max-plus)
  335. MiniMax algebra
  336. Multi-dimensional logarithmic number system (MDLNS)
  337. Multiple-Base Number System (MBNS)
  338. — Semi-Logarithmic Number System (SLNS)
  339. — Lattice algebra

    Logarithmic Number System optimizations:
  340. Logarithmic number system (LNS) (overview)
  341. End-to-end LNS logarithmic model
  342. LNS addition and subtraction
  343. LNS in AI models
  344. LNS Hardware Acceleration
  345. LNS mathematical and algorithmic theory
  346. LNS algebra
  347. LNS extensions

    Prefill phase optimizations:
  348. Prefill optimizations (overview)
  349. Chunked prefill
  350. Disaggregated prefill scheduling (Phase splitting)
  351. Deep prefill, shallow decoder architecture
  352. Mini-prefill recomputation

    Parallel Programming Optimization Techniques:
  353. Parallelization techniques (overview)
  354. Hardware acceleration
  355. Hardware-software co-design
  356. Vectorization
  357. Pipelining (pipeline parallelism)
  358. Overlapping (new)
  359. Overlapping communications and computation (new)
  360. Overlapping rematerialization (new)
  361. Overlapping memory access & computation (new)
  362. Offloading
  363. Partitioning
  364. Dataflow optimizations
  365. — Sharding
  366. — Overlapping
  367. Data parallelism
  368. Query parallelism
  369. Tensor parallelism
  370. Model parallelism
  371. — Prefetching
  372. — Speculative execution
  373. Sequence Parallelism
  374. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  375. Hardware Acceleration (overview)
  376. Software accelerations
  377. Hardware-software co-design
  378. GPU
  379. GPU software platforms
  380. Multi-GPU
  381. CPU Execution
  382. Single Instruction Multiple Data (SIMD)
  383. AVX (AVX/AVX-2/AVX-512)
  384. — ARM NEON
  385. Neural Processing Unit (NPU)
  386. — Overclocking CPU
  387. — Overclocking GPU
  388. Assembly language

    RAG Architecture Optimizations:
  389. RAG architectures (overview)
  390. RAG cache
  391. RAG optimizations
  392. — RAG retriever datastore indexing
  393. Advanced RAG
  394. — Speculative RAG
  395. Reranker in RAG
  396. — Chunk-specific global KV caching
  397. — Chunk-specific prefix KV caching
  398. RAG Knowledge Graph

    Sparsity Optimizations:
  399. Sparsification techniques (overview)
  400. Activation Sparsity
  401. Dynamic Sparsity
  402. Block sparsity
  403. Vector sparsity
  404. Tensor sparsity
  405. Sparse matrix kernels
  406. Outlier-aware sparsification

    Memory Utilization Optimizations:
  407. Memory optimization techniques (overview)
  408. Parameter sharing
  409. Model compression
  410. Low-bit integer quantization
  411. Binary quantization
  412. Ternary quantization
  413. Layer fusion
  414. Recomputation: trading time for space
  415. Memory-bound versus CPU-bound
  416. — Data locality optimization
  417. — Compute-in-Memory (CIM) architectures
  418. — Memory cache management algorithms
  419. Kernel operator fusion
  420. — Flash Inference (FlashInfer)
  421. — Checkpointing
  422. Offloading

    Numerical representation subtypes:
  423. Floating-point representations (overview)
  424. Floating Point Bit Tricks
  425. Block floating-point arithmetic
  426. Fixed point number system (FXP) optimizations
  427. Floating point number system (FLP) optimizations
  428. Foating point bitwise arithmetic
  429. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  430. Kernel optimizations (overview)
  431. Kernel operator fusion (merging)
  432. Kernel fission (splitting)
  433. Kernel tiling
  434. — Operator reordering
  435. Graph operator fusion (Deep learning compilers)

    Computation optimizations:
  436. Advanced AI Mathematics
  437. Approximate activation functions
  438. Caching / memoization
  439. Computation reuse
  440. Precomputation
  441. Source code precomputation
  442. Conditional computation
  443. Approximations
  444. Integer-only arithmetic quantization
  445. Weight precomputations
  446. Zero-skipping
  447. Low-Level Zero Skipping
  448. High-Level Zero Skipping
  449. Negative skipping
  450. Approximate caching
  451. End-to-End integer inference
  452. Padding usage
  453. Incremental inference (new)

    Arithmetic optimizations:
  454. Integer operations
  455. Addition optimizations
  456. Bitwise operation tricks
  457. Approximate addition
  458. Multiplication algorithms
  459. Approximate division
  460. Approximate multiplication
  461. Bitwise operator inference
  462. Bitserial operations
  463. Division optimizations
  464. Logarithmic approximate multiplication
  465. Integer Dot Product
  466. Vector dot product optimization

    Advanced matrix algebra optimizations:
  467. Matrix Algebra (overview)
  468. Approximate matrix multiplication
  469. Butterfly matrices
  470. Monarch matrices
  471. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  472. Low-rank matrix factorization (overview)
  473. — Tensor decomposition
  474. — Tucker decomposition
  475. Embedding low-rank matrix factorization
  476. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  477. Transformer architectures (overview)
  478. Transformer low-level optimizations (overview)
  479. — Adaptive Inference
  480. Integer-only Transformers
  481. Approximate Transformers
  482. Decoder-Only Architectures
  483. Encoder-Only Architectures
  484. Encoder-Decoder Architectures

    Transformers and LLMs:
  485. Open source models
  486. Inference frameworks
  487. Open source frameworks

    Next-Generation Transformer architectures:
  488. Next-generation architectures (overview)
  489. Hybrid Transformer architectures
  490. Newer Transformer architectures
  491. BERT (encoder)
  492. — State Space Models (SSMs)
  493. Mamba
  494. RWKV
  495. Knowledge graph AI architectures
  496. Compound AI architectures

    General Classes of Optimization Techniques:
  497. Dynamic inference (adaptive inference)
  498. Skipping
  499. Heuristics
  500. Probabilistic optimizations
  501. Approximate computing
  502. Code optimizations
  503. Deep learning compilers
  504. Incremental algorithms
  505. Fuzzy logic

    Loop Optimizations:
  506. Loop optimizations (overview)
  507. Inference loop optimizations
  508. Loop fusion (merging loops)
  509. Loop unrolling
  510. Loop perforation
  511. Loop reordering
  512. Loop tiling
  513. Loop reversal
  514. Loop fission (splitting a loop)
  515. — Loop interleave
  516. Loop interchange
  517. Loop coalescing
  518. Loop-invariant code motion ("hoisting")
  519. Loop distribution
  520. Pointer arithmetic
  521. Loop peeling (unrolling first iterations)
  522. Loop splittingLoop sentinel
  523. Loop collapsing
  524. Loop normalization
  525. Loop strip mining (Loop sectioning)
  526. Loop skewing
  527. Loop spreading

    Low-Level Coding Efficiency:
  528. Code optimizations (overview)
  529. Constant folding
  530. Common subexpression elimination
  531. Algebraic identities
  532. Strength reduction
  533. Type consistency
  534. Reciprocal multiplication
  535. References vs pointers
  536. Compile-time optimizations
  537. Pointer arithmetic
  538. Algorithm-level optimizations
  539. Lazy evaluation
  540. Memory reduction heuristics

    Data Structures for AI optimization:
  541. Hashing
  542. Perfect hashing
  543. Look-up tables (LUTs)
  544. Bloom filters
  545. — Trees
  546. — Tries
  547. Bloom filters
  548. Bitserial operations
  549. Permutation arrays

    Vector Data Structures:
  550. Parallel data structures
  551. Bit vectors
  552. Vector hashing
  553. Locality-Sensitive Hashing (LSH)
  554. Vector dot product caching
  555. — Bit signatures (vector algorithm)
  556. — K-means clustering (vector algorithm)
  557. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  558. Convolution optimizations (overview)
  559. Grouped convolutions
  560. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  561. Tokenization (overview)
  562. Tokenizer and model inference latency
  563. Semantic tokenization
  564. Tokenization for Machine Vision
  565. Tokenization of non-English languages
  566. Vocabulary optimizations:
  567. Vocabulary size
  568. Lexical shortlisting
  569. Vocabulary trimming
  570. Vocabulary expansion
  571. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  572. Deslugging AI engines
  573. Accuracy-degrading optimizations
  574. Accuracy-retaining optimizations
  575. Uncommon inference optimizations

     

    Online: Table of Contents

    PDF: Free PDF book download

    Buy: RAG Optimization: Accurate and Efficient LLM Applications

    RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
    • Smarter RAG
    • Faster RAG
    • Cheaper RAG
    • Agentic RAG
    • RAG reasoning

    Get your copy from Amazon: RAG Optimization