Aussie AI

Chapter 1. Introduction to RAG

Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"

by David Spuler and Michael Sharpe

Chapter 1. Introduction to RAG

What is RAG?

Retrieval Augmented Generation (RAG) is a technique of merging external data sources with AI-based query answering. Really, it’s just a fancy way to say: look up a database and then get the LLM to summarize it. When it works well, RAG combines the speed of searching an information database with the elegance of fluent writing from an LLM.

RAG is a fundamental technique in generative AI that extends the knowledge of an LLM without fine-tuning. Rather than train new knowledge in the LLM’s parameters, we instead look up the extra information by searching a database. The LLM receives the user’s prompt and the extra information found by the RAG lookup (called the “retriever” component). The LLM then uses its summarization and natural language capabilities to answer the user’s question, based on the extra RAG text as input context.

RAG is commonly used as the go-to architecture for fine-tuning an LLM on a business’s specialist data. For example, to create a chatbot that knows about your products, you could use fine-tuning to create a custom LLM that knows about your products. The more efficient way is to leave your LLM unchanged, but put your special documents into a RAG database (e.g., your entire website), and then have the LLM search these documents using a RAG architecture.

The current capabilities of Google and Bing with AI assistants are a RAG-like architecture, but more like a mega-RAG architecture, using a rather large database of documents. The way it works is that Google or Bing first search the entire internet (however they do this), and then the LLM summarizes the handful of internet documents into the final AI answer.

Why RAG?

The need for RAG arose as a solution whereby a base LLM needed to be fine-tuned with proprietary data. When generative AI first became hot, the cost of fine-tuning was prohibitive, and the idea of using a datastore of small pieces of data came about.

The original authors didn’t know how popular RAG would become, or else surely they would have chosen a better name. I mean, how about Flash Fine-Tuning, or something like that? But, no, we have Retrieval-Augmented Generation (RAG), for better or worse.

RAG arose for the same use case as fine-tuning was needed, of which the most well-known is the customer support chatbot, which uses internal business data about products. Hence, the general characteristics of a RAG application are:

Interactive
Question and answer focus
Purposeful, not just “chatting”
Based on business-specific information

In technical terms, the goals of a RAG architecture include:

Focus the LLM responses toward business-specific information.
Reduce hallucinations by giving the LLM more accurate input data.
Faster to customize the LLM than using fine-tuning.
Easier to keep up-to-date by just adding more documents.

Lately, fine-tuning has been staging a comeback with new research into Parameter-Efficient Fine-Tuning (PEFT). Again, we have the need for a better name. Notably, Apple went for that and more with “Apple Intelligence,” which brazenly seeks to take over the “AI” acronym as well. The on-device engine for Apple Intelligence uses PEFT via the Multi-LoRA architecture to run fast on iPhones and MacOS PCs, but that’s really the topic for another whole book. RAG is still widely used, especially in implementing customer support chatbots that rely on business-specific information.

RAG is often used as a technique to manage large tool outputs also. This is often necessary in Agentic Applications. A tool returns a large result: it gets chunked and then the chunks necessary to satisfy the goal are retrieved. It does not always make sense.

Overall RAG Architecture

RAG is an architecture whereby the AI is integrated with an external document search mechanism. There are three main components:

Datastore
Retriever
Generator

The datastore could be a classic database (e.g., SQL or MongoDB) with keyword lookup or a vector database with semantic lookup. The use of semantic lookup can give more meaningful document search results, with better model answers, but requires two additional steps. Firstly, the user’s query must be converted into a vector format that represents its semantic meaning (called an “embedding”). Secondly, a vector database lookup is required using that embedding vector. There are various commercial and open source vector databases available.

The “retriever” component looks up the user’s query in a datastore of documents, using either keyword search or vector search. This is effectively a search component that accesses a database and finds all the related material. Typically, it returns excerpts or snippets of relevant text, rather than full documents.

The role of the “generator” component in RAG is to receive the document excerpts back from the retriever, and collate that into a prompt for the AI model. The snippets of text are merged as context for the user’s question, and the combined prompt is sent to the AI engine. Hence, the role of the generator is mainly one of prompt engineering and forwarding requests to the LLM, and it tends to be a relatively simple component.

How Does RAG Work?

As an example, let us assume you are creating an AI system that answers questions about your company’s latest product offerings, perhaps with details already on your website, or with nice glossy PDF marketing documents that describe the products. How do you get the AI system to answer questions about those products?

In the RAG approach, the model itself doesn’t know about the data or documents with details of your newest products. Instead, the engine in a RAG architecture knows how to:

(a) search your company documents for the most relevant ones (retriever), and

(b) summarize relevant parts of the documents into an answer for the user’s question (generator).

Unlike fine-tuning, the RAG approach does not use your company documents as training data that you cram into an updated model. Instead, the documents are a source of input data that is integrated via a retriever search component, and sent as input to the AI engine using an unchanged model. RAG may require some “prompt engineering” that combines the document search results and a user’s query, but the foundational model itself does not change.

The RAG component typically consists of a datastore of documents and a search mechanism. A typical setup would be a vector database containing documents that are indexed according to a semantic vectorization of their contents. The search mechanism would first vectorize the incoming query into its semantic components, then find the documents with the “nearest” matching vectors, which indicates a close semantic affinity.

Document Snippets. Typically, the results from the “retriever” component would be small sections or snippets of documents, rather than full-size documents. The reason small sections are desirable is because:

(a) it would be costly to make an AI engine process a large document, and

(b) it helps the AI find the most relevant information quickly.

The retrieved snippets or portions of documents would be returned to the AI. They would be prepended to the user’s search query as “context” for the AI engine. Prompt engineering would then be used to ensure that the AI engine responds to the user query using information from the context document sections.

RAG Project Design

General implementation steps in a typical RAG project are as follows:

1. Data identification. Identify the proprietary data you want to make the RAG system an expert on. This will also mean working out how to ingest the data. For example, it might be a JIRA customer support database, a Confluence space, or a directory of PDF files on disk,

Generally, any type of “knowledge base” can be used. Some common internal examples are product documentation, HR benefits information, company policies and other internal training materials. External examples are ticketing systems (carefully scrubbed), customer product documentation, and support information.

Note that the base of knowledge will change and get bigger over time. It is necessary to ponder the refresh operation, because purging and starting over can be expensive, especially if embeddings are being calculated.

2. Sizing. Determine the sizes of a “chunk” of data. The context size of the model matters here, because the chunks need to be substantially smaller than the context size. When a user asks a question, the system will be given 3-5 chunks of knowledge excerpts that are pertinent to the question. These snippets will be combined with the question. Even worse, if a back-and-forth dialog needs to occur between the users and the model, extra room needs to be available in the context size for follow-up questions.

Note that there can be two context sizes in play. The context size of the LLM which will generate the answer, and the context size of the LLM producing the embeddings. It’s common for these to be different, with the embedding generation typically using a “cheaper” engine. If a RAG system has previously been implemented and is now being revised, you should recheck the initial assumptions, because model context sizes have increased and model costs have improved.

3. Splitting sections. Determine how to “chunk” the data. A boundary for the chunk needs to be identified, which might be sections after a heading if it’s a web page. Larger sections might need the heading and one or two paragraphs in each chunk. Content from a bug tracking system might use the main description as one chunk, the follow-up comments as another chunk or multiple chunks, and the identified solution as another chunk. It’s often beneficial to “overlap” chunks to hide the fact that chunking occurs.

4. Text-based database upload. Each chunk needs to be organized and stored in a text-based search engine, like Elastic Search, Azure Cognitive Search, etc. For documents and web pages, the organization can be minimal, but for a ticketing system with a complex structure (e.g., problem descriptions, comments, solutions), it all needs to be related somehow.

5. Vector database upload. The embedding for each chunk needs to be calculated and stored in a Vector Database. You can think of the embedding as a “summarization” of the chunk from the perspective of the model. The vector returned is typically multi-dimensional with 50+ dimensions. Each dimension represents a single concept in the contents. The idea is that from the model’s perspective, similar content would produce similar vectors. A vector database can quickly find chunks with related data using vector lookup.

6. Optimizations. The embeddings can sometimes be calculated using a lazy evaluation algorithm (avoiding the cost of embeddings calculations for never-used documents), but this can also slow down inference, and full precomputation is faster. The model used for calculating embeddings does not need to be the same as the model answering the questions. Hence, a cheaper model can be used for embedding, such as GPT-Turbo, whereas GPT-4 could be used to answer questions.

When Not To Use RAG

RAG is not for everything and everyone. There are plenty of AI use cases where RAG is not the appropriate architecture. The main alternative architectures to consider are:

Raw LLM (without any extra lookups)
Fine-Tuned LLM
Data source integrations
Agentic architectures
Chatbots and companionbots
Non-interactive LLM applications
Multiple sub-models specialized from one main model

Some of the indicators against using RAG would therefore be:

No proprietary or internal company-specific data — any frontier LLM has probably been pre-trained on public and evergreen data (i.e., RAG chunks are not needed).
Needs are more suited to fine-tuning — see Chapter 6 for choosing RAG versus fine-tuning (or both).
Main input data sources are dynamic and rapidly changing — a data source integration might make more sense.
Aim is to perform “actions” for the user — agent architectures are needed whereas RAG may not be (see Chapter 15).
The primary purpose is “chatting” rather than answering questions.
Purpose is a long-term project like booking a vacation — agentic architectures with planning are needed.
Programmatic generation of many articles or LLM responses — batch API architecture without RAG is probably best.
Multiple special-purpose models are needed — using Parameter-Efficient Fine-Tuning (PEFT) such as multi-LoRA is probably better than using the same model with different RAG datastores.

Hence, RAG has a specific place in the hierarchy of various generative AI architectures. It is needed when a business or company has some specific proprietary information, whether public or internal, and an AI application is needed that is specifically focused on answering questions about that information. If the requirements fit that description, then the options are really to use RAG or fine-tuning, and although we’re a little biased, RAG is better!

Refreshing an Old RAG Application

You’re not the only one with a sub-par customer support RAG application from a year or more ago. Here’s the brutal truth: it’s not great. On the other hand, everyone’s had plenty of experiences with real human customer support agents that were also “not great,” so your RAG automation is not the worst thing. But it’s giving your customers a non-optimal experience, which you could definitely improve to drive greater business value.

Here are some thoughts on ways to “refresh” your RAG AI application using the insight from a year or two of experience with live use of RAG architectures.

Logging. Make sure you add sufficient logging so you can analyze the user questions being asked and the LLM responses given. Sometimes you have to follow a full conversation via the logs to determine where things became difficult.
Analyze the logs! Employees and customers are familiar with the industry’s terminology, but a naive chatbot often is not and can get confused. Tweaking indexes, or preprocessing questions can help. Tweaking the vector DB is more indirect (toss it and rebuild it with more context)
User feedback. Add a mechanism to get feedback about answer quality, such as a thumbs up/down, although a 1-5 star rating is better (but will see lower engagement). Provide something to be able to determine if users think the answer was good.
Longer contexts, more chunks. Early chatbots had small context windows, likely better options available today. Sometimes those small sized could propagated to the code. A hard coded limit or a truncation somewhere is obvious. More subtle limits arise in the assumption that only 5 chunks of data will be provided to the LLM, where it might be possible to provide more today. It might be possible to have bigger chunks now or provide the best chunk plus some surrounding context chunks.
Scope limits. If you put a chatbot on a knowledge base of your products, it’s possible that there are conflicts in the knowledge base, X means one thing for one product, but different thing for other products. Questions around X could be based on matches from both products the LLM will mash the info together. It’s useful allow the user to provide context to narrow the search, such as a drop down menu of products. This way, questions about X will not get contaminated by RAG chunk results from Y.
Query cleanup. Preprocessing the questions may help, such as cleaning up grammar, trying to narrow search context down or perhaps widening it, too. Maybe the preprocessing can check to see if the question matches an FAQ, and feed those results into the LLM too. It might just be that the user’s question asked is matched to a FAQ, and that question is fed into the system instead.
Hybrid vector search. Using just a vector database to find documents is often poor. Vector searches combined with full-text searches and keyword searches can be useful. For example, “Explain the ERR025 lost connection error?” will likely produce a better match using a keyword search on “ERR025” than via a vector search of embeddings.
Chunk attributes. Add context to the documents being retrieved as chunks. For example, if the knowledge base is “HTML”, there might be tags or comments in the HTML which help guide the search engine. These are useful to pass along with the chunks. If such metadata is not available, consider adding it.
Related questions. Generate embeddings of questions, and use that to search a vector store to find similar questions and answers, along with the source chunks from previous answers. It’s even better if those answers got a good 1-5 rating or thumbs up. Top chunks should be combined with the chunks found from the vector search of the content and text base searches.
HyDE mechanism. Run the question first to get an answer, then use the question and answer to find more chunks again. Often, the question and the initial answer can produce a better retrieval. This is the Hypothetical Document Embeddings (HyDE) mechanism for more relevant retrieval. Effectively, you can ask the LLM directly for a candidate answer (without RAG chunks), before using the user’s question and this preliminary answer to do the RAG retrieval.
Chunk context. Organization of the data chunks can be useful. If the document store has an organization, that can be important context. For example, if the documents store is a wiki which has comments, or a ticketing system which has comments and perhaps links to a knowledge base, it’s useful to provide all of this data to the LLM. It can be important to know that if the question matches a comment on the retrieval phase, then the comment’s parent document gets captured, too, and perhaps even the knowledge base or related articles or the document hierarchy.

References

Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li, 17 Jun 2024 (v3), A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2405.06211 Project: https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
Damian Gil, Apr 17, 2024, Advanced Retriever Techniques to Improve Your RAGs, https://towardsdatascience.com/advanced-retriever-techniques-to-improve-your-rags-1fac2b86dd61
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
Vishal Rajput, Sep 27, 2024, Why Scaling RAGs For Production Is So Hard? https://medium.com/aiguys/why-scaling-rags-for-production-is-so-hard-a2f540785e97
Adrian H. Raudaschl, Oct 6, 2023, Forget RAG, the Future is RAG-Fusion: The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion and Generated Queries, https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
Pathway, Sep 2024, 2024 Top RAG Frameworks, https://pathway.com/rag-frameworks
Sendbird, Oct 2024, Retrieval-Augmented Generation (RAG): Enhancing LLMs with Dynamic Information Access, https://sendbird.com/developer/tutorials/rag