Aussie AI
Chapter 3. Faster RAG
-
Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
-
by David Spuler and Michael Sharpe
Chapter 3. Faster RAG
RAG Speed Optimizations
Firstly, RAG architectures are inherently an optimization, themselves. RAG was created because fine-tuning was too expensive and has various other limitations (e.g., attribution, explainability), although Parameter-Efficient Fine-Tuning (PEFT) techniques have also attacked the inefficiencies in fine-tuning, so maybe it’s a tie between RAG and FT/PEFT.
Secondly, you can also further optimize your RAG architecture. To start with, many of the major LLM optimizations also work on the RAG LLM, so there’s many ways to do this (e.g., quantization, pruning, inference optimizations, etc.) and they can all be used on the LLM underneath the overall RAG architecture.
Furthermore, there are a few techniques that are specifically applicable to RAG architectures, not just LLMs. The main RAG-specific speedups include optimizations to either:
(a) non-LLM RAG components, or
(b) the RAG prompt tokens.
Some examples of RAG non-LLM optimizations include:
- RAG vector database speedups (e.g., indexing, all the usual database stuff).
- Keyword versus vector lookups in the retriever (e.g., hybrid keyword-vector search, metadata search, etc.).
- Caching — multiple types (e.g., caching in the retriever versus the LLM parts).
- FAQs — identify specific questions and answers from an FAQ, but the LLM still needs to be involved to seed the history.
There are also some RAG-specific techniques on the “length” dimension (i.e., input tokens), that are applicable to an input prompt that is extended with extra prepended “context” tokens from the RAG chunks. Some examples include:
- Chunk compression (e.g., chunk pre-summarization)
- Prompt compression
- Context compression
- Prompt lookup decoding (an extension of speculative decoding)
- Prefix global KV cache
- Precomputed KV cache (for each RAG chunk)
RAG is not the only architecture to use prepended context. For example, chatbots prepend the conversation history, so many of these approaches apply there too.
Types of RAG Speed Optimizations
Optimizing the latency of an LLM system using Retrieval Augmented Generation (RAG) can be achieved in various ways. The main approaches are:
- RAG component optimizations
- General LLM inference optimizations
- Text-to-text caching (“inference cache”)
- Global KV caching methods (general)
- RAG-specific KV caching
Every component in a RAG application architecture is critical to the overall latency as seen by the user. Hence, we can look at optimizing any of the main components:
- Retriever — use any of the well-known vector database optimizations, such as indexes and embeddings caching.
- Serving stack — general web architectural optimizations.
- LLM optimizations — various well-known industry and research approaches.
LLM Inference Optimization
Optimizing LLM inference is a well-known problem, with literally thousands of research papers, and we have a list of 500 inference optimization techniques (see the Appendix).
The first point about optimizing the LLM in a RAG application is that most of the general LLM optimization ideas apply. You’re probably familiar with the various techniques:
- Buy a better GPU (or rent one)
- Quantization
- Pruning
- Small Language Models (SLMs)
Using smaller models, via either model compression or SLMs, is an optimization that is particularly relevant to RAG applications. The goal with RAG’s use of an LLM is not to have all the answers pre-trained into the model parameters, but rather, to have the LLM rely on the text chunk in its input prompt. Hence, RAG applications don’t need massive LLMs, and it might even be preferable to use a smaller model, as it avoids conflict between trained knowledge and the contents in the RAG text chunks.
In addition to changing the model itself, the software kernels in the inference engine can be optimized. There are various inference software improvements that apply to a RAG LLM just as well as they apply to any other LLM:
- Attention algorithm optimizations — e.g., Flash attention, Paged attention.
- Speculative decoding (parallel decoding)
- Caching (inference caches, KV caching)
- Matrix multiplication kernels (GEMM/GEMV/MatMul)
- Feed-Forward Network (FFN) optimizations
- Sparsification (static or dynamic)
Maybe let’s not list all 500 of the techniques here. Well, actually, the full list is to be found in the Appendix!
RAG Speed Optimization Research Papers
Research papers on optimization of RAG architectures:
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Priyank Rathod, May 21, 2024, Efficient Usage of RAG Systems in the World of LLMs, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171625877.73379410/v1
- Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen, 25 May 2024, Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection, https://arxiv.org/abs/2405.16178
- Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
- Dr. Ashish Bamania, Jun 18, 2024, Google’s New Algorithms Just Made Searching Vector Databases Faster Than Ever: A Deep Dive into how Google’s ScaNN and SOAR Search algorithms supercharge the performance of Vector Databases, https://levelup.gitconnected.com/googles-new-algorithms-just-made-searching-vector-databases-faster-than-ever-36073618d078
- Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister, 11 Jul 2024, Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting, https://arxiv.org/abs/2407.08223
- Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Eric Yang, Jonathan Amar, Jong Ha Lee, Bhawesh Kumar, Yugang Jia, 25 Jul 2024, The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.18044
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024 (v2), Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation, https://arxiv.org/abs/2404.06910 (Process each RAG chunk in parallel and choose a final output.)
- Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
- Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
- David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
- Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
- Tolga Şakar and Hakan Emekci, 30 October 2024, Maximizing RAG efficiency: A comparative analysis of RAG methods, Natural Language Processing. doi:10.1017/nlp.2024.53, https://www.cambridge.org/core/journals/natural-language-processing/article/maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods/D7B259BCD35586E04358DF06006E0A85 https://www.cambridge.org/core/services/aop-cambridge-core/content/view/D7B259BCD35586E04358DF06006E0A85/S2977042424000530a.pdf/div-class-title-maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods-div.pdf
- Sarayavalasaravikiran, Nov 2024, Optimizing RAG with Embedding Tuning, https://ai.plainenglish.io/optimizing-rag-with-embedding-tuning-2508af2ec049
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: RAG Optimization: Accurate and Efficient LLM Applications |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |