Aussie AI
Appendix: 500+ LLM Inference Optimization Techniques
-
Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
-
by David Spuler and Michael Sharpe
Appendix: 500+ LLM Inference Optimization Techniques
Inference Optimization Research
The LLM is usually the main bottleneck for latency in a RAG architecture, and inference optimization is an important part of tuning the overall application. We do a lot of research on inference optimization techniques, so here’s a very long list of all the techniques about which we have research papers.
Inference optimization has become a hot area of research as the industry evolves to the point where inference costs are about 95% of overall compute. This is a change since the early days when training expense far exceeded inference costs. This trend is driven by both an increase in inference demand, and a decline in training costs. Specific factors include:
(a) more users, which means more queries, which means more inference computations,
(b) commercial and open source pre-trained models (rather than training your own),
(c) faster training and fine-tuning methods (e.g., LoRA and multi-LoRA),
(d) RAG architectures replacing fine-tuning, and
(e) multi-step reasoning algorithms (e.g., OpenAI’s o1 model).
The first four of these factors have been going on for a year or two, but the last point is recent: inference is the new way to reason!
What’s Hot in Inference Optimization Research?
The change in focus towards inference has spawned a deluge of research papers on speeding up inference that aims to offer faster latency to users and reduce costs. Some of the hottest research sub-areas for speeding up inference include:
- Hardware optimizations. The biggest opportunity for inference speedup is probably in hardware rather than software. There’s the upcoming NVIDIA Blackwell architecture, which is apparently delayed as I write this, along with several AI-specific hardware startups such as Groq and Etched receiving large funding rounds. I’m not an expert on the hardware opportunities, so I’ll leave it there.
- KV cache compression. The KV cache was initially a speedup for inference, but it’s become a memory hog, especially for long context processing. Hence, there are numerous research papers on making the KV cache data use less memory (see KV cache compression research). In particular, KV cache quantization is becoming standard in industry framework implementations, such as 4-bit quantized KV cache data used by Character.AI and Apple Intelligence. There are several fancier types of KV cache compression in the research, such as KV cache layer pruning (depth dimension) and KV cache token pruning (input prompt length dimension). Notably, an implementation of KV cache layer fusion is used by Character.AI’s inference backend for companionbots.
- Context caching. The simplest cache is a text-to-text full "inference cache" and there’s also semantic caching based on embedding vector similarity. However, the idea of saving the KV cache, but re-running decoding has various advantages, and is gaining attention in both research and industry. This is usually termed “context caching” or “prompt caching” in the industry. Google has recently released “context caching” features, Anthropic has added “prompt caching,” and this type of caching is also appearing in other inference frameworks, such as vLLM and DeepSeek. Expect many more to follow! See: context caching research.
- Prefix KV caching. There are many cases where Transformers are re-processing the same prefix of tokens, such as chatbot multi-turn conversational context, global system instructions (prepended), RAG chunks (prepended), and re-used documents. Instead, you can just load the KV cache data from a prefix KV cache, and the latency is minimal, and you only have to decode the last few tokens. Prefix KV caching is also getting implemented in frameworks, including vLLM, DeepSeek, and Character.AI’s backend. Interestingly, DeepSeek offers lower pricing for “cached tokens,” which reflects the lower cost.
- Multi-LoRA. The idea of using multiple LoRA adapters for efficiently supporting multiple fine-tuned models got a massive boost from Apple Intelligence. There are many research papers now focused on further optimizing the load-time and inference characteristics of multi-LoRA architectures and other types of Parameter-Efficient Fine-Tuning (PEFT).
- Memory-efficient attention algorithms. The two leading contenders for attention optimization by paying attention to its memory access patterns are Flash Attention and Paged Attention, and you can even combine them! There’s also their precursors Multi-Query Attention (MQA) and Grouped Query Attention (GQA) that are still in use and getting researched in papers. See memory-efficient attention optimization.
- Linear attention. Another way to reduce memory cost is to simply access it less! Algorithms like this include local attention and other types of linear attention. As a recent example in industry, Character.AI’s inference backend uses a hybrid layerwise attention scheme, that alternates between local and global attention across different layers. There’s a lot of research happening in optimizing the attention mechanisms, because of its annoying quadratic complexity. See research on attention optimization.
- Zero-multiplication models. MIT research released a model architecture based on element-wise multiplication for matrix multiplication, which is the “Hadamard product.” Basic matrix multiplication is O(n3) whereas Hadamard computations are O(n2), so it’s potentially a tenfold reduction, and also a simpler algorithm that’s more amenable to followup kernel optimizations like kernel fusion. See Hadamard multiplication models. There are actually at least ten other types of zero-multiplication models in the literature (e.g., adder models, shift-add, logarithmic, power-of-two, max-plus, min-max, weightless neural networks, etc.). There’s also the well-known method of avoiding multiplication with low-bit quantization. Both binary quantization and ternary quantization can be implemented via addition, albeit with accuracy loss.
- Speculative decoding. Increased parallelization of the decoding algorithm via speculative decoding is a perennially hot area of research. It’s a speedup that’s long been used in production backends. Various generalization have been discovered, such as generalized speculative decoding, heuristic speculative decoding, self-speculative decoding, retrieval lookup decoding, prompt lookup decoding, and several other methods.
- Multi-token generation. Generalizing the decoding algorithm to output multiple tokens in parallel is a clear gain in efficiency, and some research is starting to show promise. These require an entirely different type of model architecture for both training and inference. There are also some multi-token drafting methods starting to be used to optimize speculative decoding algorithms. See: parallel decoding research.
- Prefill optimizations. There has been a burst of new research that examines the cost of the prefill operation, which creates the KV cache, and is the reason for the initial latency before the first token is output. Hence, prefill time is important for user responsiveness for any interactive use cases. In particular, research has found that prefill is compute-bound, whereas the decoding phase is memory-bound. Hence, there is much research on prefill phase optimizations, chunked prefill, and disaggregated scheduling of prefill and decoding phases on GPU platforms. Note also that KV caching methods as discussed above can optimize prefill by avoiding it completely!
LLM Inference Optimizations List
Here’s the list! It’s over 500 and growing!
If you’re reading the e-book version, then the links should be live for each topic. If not, you can see the live list, with links and updates, at this URL: https://www.aussieai.com/blog/llm-inference-optimization
-
Model compression main subtypes:
- Model compression (overview)
- — Pruning (overview)
- — Quantization (overview)
- — Knowledge Distillation (KD)
- — Parameter sharing (weight sharing)
- — Low-rank matrices
- — Small Language Models (SLMs)
- — Data compression algorithms
Pruning main types: - Dynamic pruning
- Hybrid pruning
- Unstructured pruning
- Semi-Structured Pruning
- Structured pruning
Layerwise structured pruning subtypes (depth dimension): - Depthwise structural pruning (overview)
- — Static layer pruning
- — Layer pruning
- — Early exit
- — Dynamic layer pruning
- — Layer skipping
- — Layer approximation
- — Shallow decoder architecture
- — Layer reordering
- — Layer Importance
Width-wise structured pruning subtypes: - Widthwise structural pruning (overview)
- — Attention head pruning
- — Slimmable networks (width pruning)
- — FFN pruning
- — Channel pruning
- — Filter pruning
Length-wise structured pruning subtypes: - Lengthwise structural pruning (longitudinal/input/end-to-end):
- — Token pruning (input pruning)
- — Dynamic token pruning
- — Prompt compression
- — Context compression
- — Token merging
- — Token skipping
- — Token dropping
- — Zero padding removal
Model dimension embedding pruning subtypes: - Embedding-dimension pruning
- — Embedding pruning
- — Embedding matrix compression (embedding pruning)
- — Embedding low-rank matrix factorization
- — Unembedding matrix (output embeddings)
Hybrid multi-dimensional pruning: - Multi-dimensional pruning
- — Dual pruning
- — Triple pruning
- — Quadruple pruning
- — 3D CNN model pruning
Transformer component pruning: - Normalization pruning
- Positional embeddings pruning
- Softmax pruning
- Skip connection pruning (residual connection removal)
Unstructured pruning subtypes: - Unstructured pruning (overview)
- — Magnitude pruning
- — Movement pruning
- — Gradual pruning
Quantization subtypes: - Post-Training Quantization (PTQ)
- Quantization-Aware Training (QAT)
- Activation Quantization
- Outlier-aware quantization
Integer quantization subtypes: - Integer quantization (overview)
- — Integer-only arithmetic quantization
- — Fixed-point quantization (integer)
- — Low-bit integer quantization (overview)
- — Binary quantization
- — Ternary quantization
- — 2-bit quantization (INT2)
- — 3-bit quantization (INT3)
- — 4-bit quantization (INT4)
- — 5-bit quantization (INT5)
- — 6-bit quantization (INT6)
- — 7-bit quantization (INT7)
- — 8-bit quantization (INT8)
- — 9-bit quantization (INT9)
- — 10-bit quantization (INT10)
- — 11-bit quantization (INT11)
- — 12-bit quantization (INT12)
- — 16-bit INT16 quantization
- — 32-bit INT32 quantization
Floating-point quantization subtypes: - Floating-point quantization
- — FP4 quantization
- — FP6 quantization
- — FP8 quantization
- — FP16 quantization
- — FP32 quantization
Other quantization subtypes: - Mixed-precision quantization
- Logarithmic power-of-two quantization (bitshift quantization)
- Double bitshift power-of-two quantization
- Division quantization
- Cluster-based quantization (Weight clustering)
- Hashing-based weight clustering
- Dyadic quantization
- Fake quantization
- Simulated quantization
- Stochastic quantization (probabilistic)
Granularity-level quantization subtypes: - Granular quantization (overview)
- — Layerwise Quantization
- — Blockwise Quantization
- — Vector quantization
Knowledge distillation subtypes: - Knowledge Distillation (overview)
- — Ensemble Distillation
- — Unnatural instructions (data sets)
- — Dataset Distillation
Parameter/weight sharing subtypes: - Parameter/Weight sharing (overview)
- — Activation sharing
- — Layer fusion
- — Clustering (Weights)
- — Attention head fusion
- — FFN fusion
- — KV cache layer fusion (depthwise)
- — KV cache head fusion (widthwise)
Activation function optimizations: - Activation function optimizations (overview)
- — Activation function approximation
- — Integer-only activation functions
- — Fused activation functions (kernel fusion)
- — Fused RELU
- — Fused GELU
- — Fused SwiGLU
- — Activation alternatives/replacements
- — Activation function pruning/removal (bilinear layers)
- — Activation function reordering
Normalization optimization types: - Normalization algorithm optimizations (overview)
- — Approximate normalization
- — Norm reordering (pre-norm/post-norm)
- — Integer-only normalization
- — Normalization alternatives/replacements
- — Fused normalization (e.g., "fused LayerNorm" in kernel fusion)
Softmax optimization types: - Softmax optimizations (overview)
- — Softmax pruning
- — Approximate Softmax
- — Softmax alternatives/replacements
- — Integer-only Softmax
- — Fused Softmax
Feed-Forward Network (FFN) optimization types: - FFN optimizations (overview)
- — FFN pruning
- — FFN approximation
- — Fused add-bias
- — Bias vector pruning
- — FFN sparsity
- — FFN alternatives/replacements
- — Integer-only FFN
- — Bias optimizations
MatMul/GEMM optimization types: - MatMul/GEMM kernel optimizations (overview)
- — Faster matrix multiplication (e.g., Winograd, Strassen)
- — Approximate matrix multiplication
- — Transpose cache
- — Fused multiply-add (FMA)
- — Fused transpose
- — Vector dot product optimization
- — Sparse MatMul/GEMM
- — Tiled MatMul
Positional Encoding optimizations: - Positional encoding optimization (overview)
- — RoPE (Rotary Positional Encoding)
- — Pruning positional encoding (removal/NoPE)
- — Positional encoding approximation
- — Integer-only positional encoding
NAS subtypes: - Neural Architecture Search (NAS)
- — Dynamic NAS
- — Embedding Size Optimization (embeddings NAS)
Platform-specific optimization subtypes: - On-device inference (native phone and PC AI)
- AI Phones
- AI PCs (desktops/laptops)
- Edge device inference (IoT/mobile/PC)
- Hybrid cloud-on-device inference
Decoding algorithm subtypes: - Decoding algorithms (overview)
- — Non-autoregressive decoding
- — Greedy decoding
- — Top-k decoding
- — Top-p decoding
- — Min-P Sampling
- — Flash decoding
- — Beam search decoding
- — Edit decoding
- — Contrastive decoding
- — Approximate top-k algorithms
- — Bidirectional decoding
- — Constrained decoding
Parallel Decoding algorithms: - Parallel decoding
- — Blockwise parallel decoding
- — n-gram parallel decoding
- — Lookahead decoding
- — Medusa decoding
- — Consensus decoding
- — Mutually-guided decoding
- — Multi-token generation
- — Eagle decoding
Speculative decoding subtypes: - Speculative decoding (overview)
- — Generalized speculative decoding
- — Aggressive decoding
- — Lookup decoding
- — Retrieval lookup decoding
- — Prompt lookup decoding
- — Self speculative decoding
- — Tree speculative decoding
- — Superposed decoding
- — Hierarchical speculative decoding
- — Heuristic speculative decoding
- — Multi-token speculative decoding
- — Sequential speculative decoding
- — Redrafting
Parameter Efficient Fine-Tuning (PEFT) subtypes: - PEFT (overview)
- — LoRA
- — Multi-LoRA inference
- — QLoRa (Quantized Low-Rank Adapters)
- — LoRA inference optimizations (load/unload)
- — Prompt Tuning (Extended Vocabulary PEFT)
Ensemble multi-LLM subtypes: - Ensemble inference (overview of multi-model AI engines)
- — Mixture of Experts (MoE)
- — Model selection algorithms
- — Big-little architectures
- — Cascades
- — Collaborative inference
- — Consensus decoding
- — Swarm ensemble architectures
- — Committee ensemble architectures
- — Ensemble averaging
- — Easy-hard queries
- — Submodels (Many-Models-in-One)
- — Distributed Inference
Orchestration, Deployment and Serving: - Cloud inference servers
- Orchestration frameworks
- Scheduling optimizations
- Serving
- Load balancing
- Batching
- Continuous batching
- Deployment
- Serverless
- Networking optimizations
- In-flight batching
Attention optimization subtypes: - Attention optimizations (overview)
- — Multi-Head Attention (MHA)
- — Group Query Attention (GQA)
- — Multi-Query Attention (MQA)
- — Sparse attention
- — Local attention
- — Memory-efficient attention algorithms
- — Flash Attention
- — Paged Attention
- — Linear attention
- — Cross attention
- — Tree attention
- — Sliding window attention
- — Approximate attention heads
- — Attention alternatives/replacements
- — Fused MHA
- — Low-rank matrix attention
- — Medusa attention
- — Block attention
- — Cross attention
- — Fused head attention
- — Hybrid local-global attention
- — FFT attention
- — QKV computation optimizations
- — Additive attention
- — Multiplicative attention
- — Graph attention
- — Chunked attention
- — Attention sink
- — Attention steering
- — Bilinear attention
- — Attention-free methods
- — Mixture-of-Heads (MOH) Attention (MoE+MHA)
- — Star attention
- Flex attention
- Razor attention
- Contiguous QKV tensor
- Relative Attention Bias (RAB)
Long context optimizations (attention): - — Long context models
- — Length generalization
- — Quadratic attention complexity
- — Long RAG
Caching optimizations: - Caching (overview)
- — Inference Cache (text-to-text)
- — Inference cache (global KV caching)
- — Prompt caching
- — Input Similarity-Based Caching (frame skipping in video)
- — Semantic caching (text-to-text)
- — Semantic KV caching
- — Vector database caching
- — Chatbot caching
- — Vector Caching (Vector hashing)
- — Caching vector dot products
- — Caching general theory
KV cache optimizations: - KV Caching (overview)
- — KV cache global (multi-query KV caching)
- — KV cache reuse
- — Global semantic KV caching (difficult!)
- — Context cache (global KV caching)
- — Prefix KV Caching
- — KV cache recomputation with early exit
- — Session KV cache (multi-turn KV caching)
- — Substring/fused KV cache (Lengthwise-fused KV caching)
- — Paged KV caching (related to paged attention)
KV cache memory size reduction: - KV cache compression
- — KV cache quantization
- — KV cache sparsity
- — KV cache token pruning
- — KV cache eviction policies
- — KV cache layer fusion
- — KV cache layer pruning
- — KV Cache low-rank matrix factorization
- — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)
Non-Multiplication AI Models: - Zero-Multiplication Models (overview)
- — Binary quantization
- — Ternary quantization
- — 2-bit quantization (INT2)
- — Adder networks
- — Bitshift-add networks
- — Bitshift power-of-2 quantization (logarithmic quantization)
- — Double bitshift quantization
- — Add-as-integer networks
- — Logarithmic Models
- — Bitwise neural networks
- — Diff-squared networks
- — Log-sum-exp (LSE) networks
- — Max-Plus networks
- — Min-Max-Plus networks
- — Morphological networks
- — Trigonometric approximate inference
- — Weightless Neural Networks (WNNs)
- — XNOR networks
- — Hadamard elementwise matrix multiplication models
- — Other addition-related zero-multiplication networks
- — Table lookups replace multiplication
- — Other multiplication-free neural networks
Advanced Number System optimizations: - Advanced Number Systems (overview)
- — Posit number system (PNS)
- — Residue number system (RNS)
- — Dyadic numbers
- — Double-base number system (DBNS)
- — Dynamic number systems
- — Hybrid number systems
- — Tropical algebra (max-plus)
- — MiniMax algebra
- — Multi-dimensional logarithmic number system (MDLNS)
- — Multiple-Base Number System (MBNS)
- — Semi-Logarithmic Number System (SLNS)
- — Lattice algebra
Logarithmic Number System optimizations: - Logarithmic number system (LNS) (overview)
- — End-to-end LNS logarithmic model
- — LNS addition and subtraction
- — LNS in AI models
- — LNS Hardware Acceleration
- — LNS mathematical and algorithmic theory
- — LNS algebra
- — LNS extensions
Prefill phase optimizations: - Prefill optimizations (overview)
- — Chunked prefill
- — Disaggregated prefill scheduling (Phase splitting)
- — Deep prefill, shallow decoder architecture
- — Mini-prefill recomputation
Parallel Programming Optimization Techniques: - Parallelization techniques (overview)
- — Hardware acceleration
- — Hardware-software co-design
- — Vectorization
- — Pipelining (pipeline parallelism)
- — Overlapping (new)
- — Overlapping communications and computation (new)
- — Overlapping rematerialization (new)
- — Overlapping memory access & computation (new)
- — Offloading
- — Partitioning
- — Dataflow optimizations
- — Sharding
- — Overlapping
- — Data parallelism
- — Query parallelism
- — Tensor parallelism
- — Model parallelism
- — Prefetching
- — Speculative execution
- — Sequence Parallelism
- — Skeleton-of-Thought (Query Parallelism)
Hardware Optimizations: - Hardware Acceleration (overview)
- — Software accelerations
- — Hardware-software co-design
- — GPU
- — GPU software platforms
- — Multi-GPU
- — CPU Execution
- — Single Instruction Multiple Data (SIMD)
- — AVX (AVX/AVX-2/AVX-512)
- — ARM NEON
- — Neural Processing Unit (NPU)
- — Overclocking CPU
- — Overclocking GPU
- — Assembly language
RAG Architecture Optimizations: - RAG architectures (overview)
- — RAG cache
- — RAG optimizations
- — RAG retriever datastore indexing
- — Advanced RAG
- — Speculative RAG
- — Reranker in RAG
- — Chunk-specific global KV caching
- — Chunk-specific prefix KV caching
- — RAG Knowledge Graph
Sparsity Optimizations: - Sparsification techniques (overview)
- — Activation Sparsity
- — Dynamic Sparsity
- — Block sparsity
- — Vector sparsity
- — Tensor sparsity
- — Sparse matrix kernels
- — Outlier-aware sparsification
Memory Utilization Optimizations: - Memory optimization techniques (overview)
- — Parameter sharing
- — Model compression
- — Low-bit integer quantization
- — Binary quantization
- — Ternary quantization
- — Layer fusion
- — Recomputation: trading time for space
- — Memory-bound versus CPU-bound
- — Data locality optimization
- — Compute-in-Memory (CIM) architectures
- — Memory cache management algorithms
- — Kernel operator fusion
- — Flash Inference (FlashInfer)
- — Checkpointing
- — Offloading
Numerical representation subtypes: - Floating-point representations (overview)
- — Floating Point Bit Tricks
- — Block floating-point arithmetic
- — Fixed point number system (FXP) optimizations
- — Floating point number system (FLP) optimizations
- — Foating point bitwise arithmetic
- — FTZ/DAZ floating point CPU settings
Kernel optimizations: - Kernel optimizations (overview)
- — Kernel operator fusion (merging)
- — Kernel fission (splitting)
- — Kernel tiling
- — Operator reordering
- — Graph operator fusion (Deep learning compilers)
Computation optimizations: - Advanced AI Mathematics
- Approximate activation functions
- Caching / memoization
- Computation reuse
- Precomputation
- Source code precomputation
- Conditional computation
- Approximations
- Integer-only arithmetic quantization
- Weight precomputations
- Zero-skipping
- — Low-Level Zero Skipping
- — High-Level Zero Skipping
- Negative skipping
- Approximate caching
- End-to-End integer inference
- Padding usage
- Incremental inference (new)
Arithmetic optimizations: - Integer operations
- Addition optimizations
- Bitwise operation tricks
- Approximate addition
- Multiplication algorithms
- Approximate division
- Approximate multiplication
- Bitwise operator inference
- Bitserial operations
- Division optimizations
- Logarithmic approximate multiplication
- Integer Dot Product
- Vector dot product optimization
Advanced matrix algebra optimizations: - Matrix Algebra (overview)
- — Approximate matrix multiplication
- — Butterfly matrices
- — Monarch matrices
- — Sparse matrices (sparsification)
Low-rank matrix optimizations: - Low-rank matrix factorization (overview)
- — Tensor decomposition
- — Tucker decomposition
- — Embedding low-rank matrix factorization
- — KV Cache low-rank matrix factorization
Transformer architectural optimizations: - Transformer architectures (overview)
- — Transformer low-level optimizations (overview)
- — Adaptive Inference
- — Integer-only Transformers
- — Approximate Transformers
- — Decoder-Only Architectures
- — Encoder-Only Architectures
- — Encoder-Decoder Architectures
Transformers and LLMs: - Open source models
- Inference frameworks
- Open source frameworks
Next-Generation Transformer architectures: - Next-generation architectures (overview)
- — Hybrid Transformer architectures
- — Newer Transformer architectures
- — BERT (encoder)
- — State Space Models (SSMs)
- — Mamba
- — RWKV
- — Knowledge graph AI architectures
- — Compound AI architectures
General Classes of Optimization Techniques: - Dynamic inference (adaptive inference)
- Skipping
- Heuristics
- Probabilistic optimizations
- Approximate computing
- Code optimizations
- Deep learning compilers
- Incremental algorithms
- Fuzzy logic
Loop Optimizations: - Loop optimizations (overview)
- — Inference loop optimizations
- — Loop fusion (merging loops)
- — Loop unrolling
- — Loop perforation
- — Loop reordering
- — Loop tiling
- — Loop reversal
- — Loop fission (splitting a loop)
- — Loop interleave
- — Loop interchange
- — Loop coalescing
- — Loop-invariant code motion ("hoisting")
- — Loop distribution
- — Pointer arithmetic
- — Loop peeling (unrolling first iterations)
- — Loop splitting— Loop sentinel
- — Loop collapsing
- — Loop normalization
- — Loop strip mining (Loop sectioning)
- — Loop skewing
- — Loop spreading
Low-Level Coding Efficiency: - Code optimizations (overview)
- — Constant folding
- — Common subexpression elimination
- — Algebraic identities
- — Strength reduction
- — Type consistency
- — Reciprocal multiplication
- — References vs pointers
- — Compile-time optimizations
- — Pointer arithmetic
- — Algorithm-level optimizations
- — Lazy evaluation
- — Memory reduction heuristics
Data Structures for AI optimization: - — Hashing
- — Perfect hashing
- — Look-up tables (LUTs)
- — Bloom filters
- — Trees
- — Tries
- — Bloom filters
- — Bitserial operations
- — Permutation arrays
Vector Data Structures: - — Parallel data structures
- — Bit vectors
- — Vector hashing
- — Locality-Sensitive Hashing (LSH)
- — Vector dot product caching
- — Bit signatures (vector algorithm)
- — K-means clustering (vector algorithm)
- — Hyper-Cube (vector algorithm)
Convolution Optimizations in CNNs: - Convolution optimizations (overview)
- — Grouped convolutions
- — Depth-wise separable convolutions
Tokenization and Vocabulary Optimizations: - Tokenization (overview)
- — Tokenizer and model inference latency
- — Semantic tokenization
- — Tokenization for Machine Vision
- — Tokenization of non-English languages
- Vocabulary optimizations:
- — Vocabulary size
- — Lexical shortlisting
- — Vocabulary trimming
- — Vocabulary expansion
- — Dynamic vocabulary pruning
Overall summaries of AI optimizations: - — Deslugging AI engines
- — Accuracy-degrading optimizations
- — Accuracy-retaining optimizations
- — Uncommon inference optimizations
• Online: Table of Contents
• PDF: Free PDF book download
• Buy: RAG Optimization: Accurate and Efficient LLM Applications
RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures: - Smarter RAG
- Faster RAG
- Cheaper RAG
- Agentic RAG
- RAG reasoning
Get your copy from Amazon: RAG Optimization