Aussie AI
Chapter 5. RAG Architecture Optimizations
-
Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
-
by David Spuler and Michael Sharpe
Chapter 5. RAG Architecture Optimizations
Overview of RAG Components
A typical RAG application has a number of components, wrapped around the main LLM and its inference engine. Each of these components can be optimized for either speed or smartness or maybe even both if we’re lucky. Some of the main RAG-specific components include:
- RAG orchestrator component (running the whole show)
- Retriever (two types)
- Embedding vector generator
- Vector database
- Keyword datastore
- Reranker
- Packer (sometimes called the “augmenter”)
A more fully-fledged RAG application backend would have some additional components:
- Prompt shield — block jailbreaks and other user trickiness.
- Error handler — sometimes things go awry.
- Citation manager — show references for “explainability” of answers.
- Image results handler — show pretty product pictures to cashed-up users.
- Observability — add logging, monitoring, and instrumentation for MLOps.
- Security credential management — allow users to log in!
Some of the possible latency optimizations by adding RAG components include:
- Inference cache (text-to-text or text-to-KV cache)
- Semantic cache (text-to-text)
- Prefix caching
- Context token compression (chunk compression)
Reranker Component
The reranker component aims to calibrate the best chunk for the LLM to use. The basic idea is:
- Retriever returns several chunks
- Reranker orders them in priority of relevance
- Packer merges the chunks with the user’s query and other global instructions
- One final LLM request answers the user’s question
Why do we need a reranker? After all, doesn’t the vector database return a list of chunks sorted in order of relevance? Yes, but the reranker is useful for:
- Reconciling the two different rankings returned by the keyword lookup and vector database.
- Offering an extra point to optimize accuracy of the system.
Here are some research papers specific to the reranker component:
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Benjamin Clavié, 30 Aug 2024, rerankers: A Lightweight Python Library to Unify Ranking Methods, https://arxiv.org/abs/2408.17344 https://arxiv.org/pdf/2408.17344
- Vivedha Elango, Sep 2024, Search in the age of AI- Retrieval methods for Beginners, https://ai.gopubby.com/search-in-the-age-of-ai-retrieval-methods-for-beginners-557621e12ded
- Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
- Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
- Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
- Y Huang, T Gao, J Zhang, X Liu, G Wang, 2024, Adapting Large Language Models for Biomedicine though Retrieval-Augmented Generation with Documents Scoring, 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2024, pages 5770-5775, DOI: 10.1109/BIBM62325.2024.10822725, https://www.computer.org/csdl/proceedings-article/bibm/2024/10822725/23oodpoidfq (Using an LLM-based reranker for medical research documents.)
- MS Tamber, R Pradeep, J Lin, Jan 2025, LiT and Lean: Distilling Listwise Rerankers into Encoder-Decoder Models, https://cs.uwaterloo.ca/~jimmylin/publications/Tamber_Lin_ECIR2025.pdf
- Bharani Subramaniam, 13 February 2025, Emerging Patterns in Building GenAI Products, https://martinfowler.com/articles/gen-ai-patterns/
- Tanay Varshney, Annie Surla, Nave Algarici, Isabel Hulseman and Cherie Wang, Mar 06, 2025, How Using a Reranking Microservice Can Improve Accuracy and Costs of Information Retrieval, https://developer.nvidia.com/blog/how-using-a-reranking-microservice-can-improve-accuracy-and-costs-of-information-retrieval/
- Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Flora D. Salim, Imran Razzak, 19 Mar 2025, Long Context Modeling with Ranked Memory-Augmented Retrieval, https://arxiv.org/abs/2503.14800
- Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, Jiawei Han, 16 May 2025 (v2), DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation, https://arxiv.org/abs/2505.07233 https://github.com/GasolSun36/DynamicRAG
Packer
The packer component in a RAG application takes the user’s query and the RAG chunks and “packs” them according to a prompt template. The output from the packer is the final input text that is sent to the LLM. The usual order of the final text is:
- Global instructions (e.g., “please use only the following data for answering the question.”)
- Chunks (1 or more)
- User question
The general guidance for optimizing the packing phase includes:
- Put the user’s query last.
- Order the chunks in reverse (worst-to-best).
- Put the global instructions first (as a fixed prefix).
- Don’t insert the user’s query into the global instructions.
Putting the user’s question last helps in both accuracy and speed. With the question at the end of the prompt, it’s “closest” to the newly output answer tokens, which helps the LLM pay attention to the question when creating more answer tokens in the decoding phase.
Leaving the user’s question till the end also ensures a common prefix for all RAG queries, which enables the “prefix caching” optimization. Putting the user’s query first would ruin that caching speedup. Note that a higher-level “inference cache” of text-to-text cached query-response results, for optimizing recurring user queries, is a separate optimization, and that idea works regardless of the ordering, because it’s at the top level. Prefix caching is an optimization that helps latency for all user queries, without needing them to be the same or even similar.
Ordering of the RAG chunks also matters for both accuracy and speed. The packer has a list of chunks, which have been ranked by both the retriever and also the “reranker” component. But how should they be packed — best to worst (sorted), worst to best (reverse), or some other ordering.
There is some “best practices” research that the accuracy-optimal algorithm for packing is “reverse,” which means putting the highest-ranked chunk last. This means that the best chunk is nearest to the user’s question, which helps attention. Note that this is somewhat the opposite result to other research whereby LLMs generally are better at finding information at the start of a sequence, then information at the end is next-best, and information in the middle is a muddle.
Chunk ordering also affects latency. Firstly, the number of chunks and the total token count significantly affects LLM speed, so perhaps the packer could prefer shorter chunks, or even compress them to fewer words. Configuring the retriever and/or reranker to return fewer chunks is an important optimization. Prefix caching is also affected by the packer’s choice of ordering, but this optimization doesn’t work as well for chunks because their ordering can change, so there’s not always the same prefix (unlike global instructions, which are fixed text).
Cache-aware chunk ordering is probably not desirable, as it sacrifices accuracy for speed. There is a “cache-aware prefix caching” speed optimization for RAG chunks, whereby the packer should put any chunks for which it has a prefix cache to the front of the query, to maximize cache usage. Although faster, this undermines the whole point of the reranker, which focuses on presenting the most relevant information to the LLM for the best accuracy.
The overall prompt template used by the packer also affects both accuracy and speed. Good prompt engineering helps with RAG accuracy and is a very important part of designing a RAG system (see Chapter 7). However, the size and layout of prompt instructions can also affect speed. If it’s too long, then the LLM is processing a lot of tokens for every user query, although this is alleviated somewhat by prefix caching if the text is fixed. For this reason, the idea of inserting the user’s question twice, at both the start and end of the prompt template, might help accuracy, but the changing query at the front ruins prefix caching speed optimizations.
Some of this research on text ordering may need to be re-done. Newer models are much better at handling “long context” inputs with a high level of accuracy, so the guidance to put the best information at the end (reverse chunk ordering), or near the beginning, may no longer matter for some advanced models, and probably won’t matter at all for any models in the longer term.
Keyword Lookup Datastore
The keyword lookup component is a datastore mapping keywords to RAG chunks. It is somewhat optional, as a RAG architecture can make do with only a vector database component, using only embeddings-based search. However, combining keyword search with vector database search can lead to more accurate results.
The area of searching text for keywords has been around for many decades, usually called “information retrieval.” Some of the better-known algorithms for looking up things in text include:
- Term Frequency—Inverse Document Frequency (TF-IDF)
- Okapi Best Match 25 (BM25)
- Facebook AI Similarity Search (FAISS)
- Navigable Small World (NSW)
- Hierarchical Navigable Small World (HNSW)
The accuracy of a keyword lookup datastore is integral to the overall accuracy of the RAG system. Both the keyword lookup and vector database lookup needs to be performing well, in order to give the LLM the most relevant input data.
Latency can also be affected by adding a keyword datastore, as it adds another step in the RAG sequence. Optimizing the speed of the keyword datastore component involves choices such as:
- Comparing multiple keyword datastores
- Parallel execution of keyword and vector lookups
- Datastore indexes
- Caching
The keyword datastore needs to ensure good performance in terms of both accuracy and speed as the number of stored chunks increases. This needs to be tested and tuned as part of the initial build of the RAG system, and also on an ongoing basis as more data is added to the datastore.
BM25 Keyword Lookup Algorithm
As an example of the keyword lookup component, let’s examine the “Best Match 25” or BM25 algorithm. This is a longstanding algorithm for keyword search that has preceded much of the AI work, and is simpler to understand that embedding-based vector research. Keyword search algorithms such as BM25 can be used as RAG-based keyword retrievers.
At a high level, BM25 consists of the following steps:
1. Process the query to remove the small words (e.g., “the,” “and,” “this”, “that,” etc.), leaving the core words of the query.
2. Use a reverse index to find all the documents (chunks) that match each word.
3. Score and rank these chunks or documents.
4. Return the Top 25 as the keyword lookup results.
The score for ranking is based on:
- How many of the query terms appear in a chunk/document.
- How frequently each word appears in the chunk/document.
- How rare each word is across all documents.
- Length of the chunk/document.
All of these tests can be brute-force precalculated for each word, so only the combination of each metric needs to be calculated at runtime. Presumably, there could be other weights in the calculation too, such as there may be a “vocabulary” specific to your knowledge base which might have a better affinity to specific documents for example.
This overall algorithm is used as the keyword-based retriever component of the RAG architecture. At the very end of this algorithm, the “top 25” chunks or documents are passed back to the LLM as the result of the keyword lookup, which are then incorporated into the overall RAG algorithm.
Overall RAG Algorithm
The overall RAG algorithm involves quite a few steps, and is the basis of the RAG architecture. The software component that executes this overarching algorithm is called the “RAG orchestrator,” or simply the “RAG backend.”
The RAG orchestrator has quite a busy life, because it has to supervise a lot of different personalities. There are several other individual components used by the algorithm:
- Embedding model — converts the user’s query string into a vector of semantic embeddings.
- Retriever — gets text chunks from a vector database, a keyword database, or both.
- Reranker — orders the returned text chunks into a “ranking” from most relevant to least.
- Packer — merges all the text chunks and the user query into the prompt template.
- Citation manager — keeps track of the citations to show users, based on the text chunks.
The basic technical algorithm flow for a user request in a RAG architecture can be something like this:
a. Receive the user’s question (input).
b. Use the user’s question to do a text-based (keyword) search on the index and get the top X hits (of documents or snippets).
c. Calculate the “embedding” for the user’s question (a vector that shows its semantic meaning in numbers).
d. Calculate the embeddings for the top X hits (from text search) and add these embeddings vectors to the vector database.
e. Do a vector search on embeddings and get the top Y hits.
f. Filter top X hits (text-based) and top Y hits (vector-based) to find overlaps, this overlap represents the best text-based and vector-based hits. If there is no overlap, select some from each.
g. Combine the top hits with any summarization from previous questions.
h. Get the contents from the top hits and use prompt engineering to create a question something like:
“Given <summary>, <chunk 1>, <chunk 2>, <chunk 3>, answer <question from user>.
Only respond with content from the provided data.
If you do not know the answer, respond with I do not know.
Cite the content used.”
i. Send the prompt to the LLM, and receive the answer back from the LLM.
j. Resolve any citations in the answers back to URLs the end user can click on, e.g., Confluence page, Jira ticket/comment/solution, etc.
k. Summarize the conversation to date using the model (i.e., context for any subsequent questions).
l. Send back answer + summarization (perhaps encoded). The idea is the encoded summarization will not be shown for this answer, but will only be used internally by the RAG components for follow-up questions.
m. The client or caller is responsible for context management, which means ensuring that conversations end quickly and new topics result in new conversations. Otherwise, the context fills up quickly, the LLM forgets what it’s already said, and things gets confusing.
The above algorithm is thorough in generating two sets of hits (top X and top Y). It’s not strictly necessary to do two searches (one with text keywords and one with vector embeddings), as often vector embeddings are good enough. Alternatively, text-based keyword searches are often cheaper, and vector lookups could be skipped. At the end of the day, the chunks most likely to contain answers to the questions are being sought.
Special Cases
If only that were all there was to code for a RAG system. Here are some more special cases to consider:
- No chunks are returned by the retriever (two cases: keyword retrieval and vector database retrieval).
- Returned chunks are scored so low by the reranker that it’s effectively the same as having no chunks (i.e., all are scored below a relevance threshold).
- The query requires a “tool” or “function call” (e.g., a clock is needed to answer: “What time is it?”).
- The query requires an external data source, such as a web search, which may or may not be supported by your RAG system.
My brain is in pain. How many distinct input cases is that?
There’s not one right answer in such cases. Handling zero chunks returned could be handled by simply bailing out with a fixed unfriendly message (“Error: no chunks”) or a friendlier LLM response based on a different prompt template (“I don’t know the answer to that, but here’s a quote from Hamlet.”)
On the other hand, maybe you still want to sell your customer something even if you’ve got nothing. For example, the RAG system is a clothing advisor, and the user’s query is:
“What should I wear today?”
That’s unlikely to have any good matches to specific RAG documents about clothing products. The simple and efficient solution would be to just give a generic sales spiel response, which could be a canned document.
But if personalization was desirable (yes), and a weather tool was available (maybe), and your LLM has been trained to recognize day-specific or season-specific searches (probably not), then the weather tool could be called, and the returned weather conditions could be substituted into the query. With better keywords, a good selection of clothing could then be retrieved from the RAG store. The query would effectively become:
“What should I wear when it is chilly with a chance of rain?”
Note that this idea actually needs two tools. To get the weather, you first need to know the website user’s general location from their IP address or other cookies, which is another tool.
References
General research papers examining the component-level architectures of RAG applications include:
- Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
- Chaitanya Sharma, 28 May 2025, Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers, https://arxiv.org/abs/2506.00054
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: RAG Optimization: Accurate and Efficient LLM Applications |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |