Aussie AI

Chapter 15. RAG Architectures

  • Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"
  • by David Spuler

Chapter 15. RAG Architectures

What is RAG?

Retrieval Augmented Generation (RAG) is a technique of merging external data sources with AI-based query answering. Really, it’s just a fancy way to say: look up a database and then get the LLM to summarize it. When it works well, RAG combines the speed of searching an information database with the elegance of fluent writing from an LLM.

RAG is an architecture whereby the AI is integrated with an external document search mechanism. There are three components:

  • Retriever
  • Generator
  • Datastore

The “retriever” component looks up the user’s query in a datastore of documents, using either keyword search or vector search. This is effectively a search component that accesses a database and finds all the related material. Typically, it returns excerpts or snippets of relevant text, rather than full documents.

The role of the “generator” component in RAG is to receive the document excerpts back from the retriever, and collate that into a prompt for the AI model. The snippets of text are merged as context for the user’s question, and the combined prompt is sent to the AI engine. Hence, the role of the generator is mainly one of prompt engineering and forwarding requests to the LLM, and it tends to be a relatively simple component.

The datastore could be a classic database (e.g., SQL or MongoDB) with keyword lookup or a vector database with semantic lookup. The use of semantic lookup can give more meaningful document search results, with better model answers, but requires two additional steps. Firstly, the user’s query must be converted into a vector format that represents its semantic meaning (called an “embedding”). Secondly, a vector database lookup is required using that embedding vector. There are various commercial and open source vector databases available.

How does RAG work? As an example, let us assume you are creating an AI system that answers questions about your company’s latest product offerings, perhaps with details already on your website, or with nice glossy PDF marketing documents that describe the products. How do you get the AI system to answer questions about those products?

In the RAG approach, the model itself doesn’t know about the data or documents with details of your newest products. Instead, the engine in a RAG architecture knows how to:

    (a) search your company documents for the most relevant ones (retriever), and

    (b) summarize relevant parts of the documents into an answer for the user’s question (generator).

Unlike fine-tuning, the RAG approach does not use your company documents as training data that you cram into an updated model. Instead, the documents are a source of input data that is integrated via a retriever search component, and sent as input to the AI engine using an unchanged model. RAG may require some “prompt engineering” that combines the document search results and a user’s query, but the foundational model itself does not change.

The RAG component typically consists of a datastore of documents and a search mechanism. A typical setup would be a vector database containing documents that are indexed according to a semantic vectorization of their contents. The search mechanism would first vectorize the incoming query into its semantic components, then find the documents with the “nearest” matching vectors, which indicates a close semantic affinity.

Document Snippets. Typically, the results from the “retriever” component would be small sections or snippets of documents, rather than full-size documents. The reason small sections are desirable is because:

    (a) it would be costly to make an AI engine process a large document, and

    (b) it helps the AI find the most relevant information quickly.

The retrieved snippets or portions of documents would be returned to the AI. They would be prepended to the user’s search query as “context” for the AI engine. Prompt engineering would then be used to ensure that the AI engine responds to the user query using information from the context document sections.

RAG Project Design

General implementation steps in a typical RAG project are as follows:

1. Data identification. Identify the proprietary data you want to make the RAG system an expert on. This will also mean working out how to ingest the data. For example, it might be a JIRA customer support database, a Confluence space, or a directory of PDF files on disk,

Generally, any type of “knowledge base” can be used. Some common internal examples are product documentation, HR benefits information, company policies and other internal training materials. External examples are ticketing systems (carefully scrubbed), customer product documentation, and support information.

Note that the base of knowledge will change and get bigger over time. It is necessary to ponder the refresh operation, because purging and starting over can be expensive, especially if embeddings are being calculated.

2. Sizing. Determine the sizes of a “chunk” of data. The context size of the model matters here, because the chunks need to be substantially smaller than the context size. When a user asks a question, the system will be given 3-5 chunks of knowledge excerpts that are pertinent to the question. These snippets will be combined with the question. Even worse, if a back-and-forth dialog needs to occur between the users and the model, extra room needs to be available in the context size for follow-up questions.

Note that there can be two context sizes in play. The context size of the LLM which will generate the answer, and the context size of the LLM producing the embeddings. It’s common for these to be different, with the Eebedding generation typically using a “cheaper” engine. If a RAG system has previously been implemented and is now being revised, you should recheck the initial assumptions, because model context sizes have increased and model costs have improved.

3. Splitting sections. Determine how to “chunk” the data. A boundary for the chunk needs to be identified, which might be sections after a heading if it’s a web page. Larger sections might need the heading and one or two paragraphs in each chunk. Content from a bug tracking system might use the main description as one chunk, the follow-up comments as another chunk or multiple chunks, and the identified solution as another chunk. It’s often beneficial to “overlap” chunks to hide the fact that chunking occurs.

4. Text-based database upload. Each chunk needs to be organized and stored in a text-based search engine, like Elastic Search, Azure Cognitive Search, etc. For documents and web pages, the organization can be minimal, but for a ticketing system with a complex structure (e.g., problem descriptions, comments, solutions), it all needs to be related somehow.

5. Vector database upload. The embedding for each chunk needs to be calculated and stored in a Vector Database. You can think of the embedding as a “summarization” of the chunk from the perspective of the model. The vector returned is typically multi-dimensional with 50+ dimensions. Each dimension represents a single concept in the contents. The idea is that from the model’s perspective, similar content would produce similar vectors. A vector database can quickly find chunks with related data using vector lookup.

6. Optimizations. The embeddings can sometimes be calculated using a lazy evaluation algorithm (avoiding the cost of embeddings calculations for never-used documents), but this can also slow down inference, and full precomputation is faster. The model used for calculating embeddings does not need to be the same as the model answering the questions. Hence, a cheaper model can be used for embedding, such as GPT-Turbo, whereas GPT-4 could be used to answer questions.

RAG Detailed Algorithm

The RAG algorithm is not training. Prompt engineering gives the model all the content it needs to answer the question in the prompt. You are effectively using the LLM to take your content, mix it with its own trained knowledge (in a limited way), eloquently answer the question, and then perhaps converse on it a little.

The basic technical algorithm flow for a user request in a RAG architecture can be something like this:

    a. Receive the user’s question (input).

    b. Use the user’s question to do a text-based (keyword) search on the index and get the top X hits (of documents or snippets).

    c. Calculate the “embedding” for the user’s question (a vector that shows its semantic meaning in numbers).

    d. Calculate the embeddings for the top X hits (from text search) and add these embeddings vectors to the vector database.

    e. Do a vector search on embeddings and get the top Y hits.

    f. Filter top X hits (text-based) and top Y hits (vector-based) to find overlaps, this overlap represents the best text-based and vector-based hits. If there is no overlap, select some from each.

    g. Combine the top hits with any summarization from previous questions.

    h. Get the contents from the top hits and use prompt engineering to create a question something like:

      “Given <summary>, <chunk 1>, <chunk 2>, <chunk 3>, answer <question from user>.
      Only respond with content from the provided data.
      If you do not know the answer, respond with I do not know.
      Cite the content used.”

    i. Send the prompt to the LLM, and receive the answer back from the LLM.

    j. Resolve any citations in the answers back to URLs the end user can click on, e.g., Confluence page, Jira ticket/comment/solution, etc.

    k. Summarize the conversation to date using the model (i.e., context for any subsequent questions).

    l. Send back answer + summarization (perhaps encoded). The idea is the encoded summarization will not be shown for this answer, but will only be used internally by the RAG components for follow-up questions.

    m. The client or caller is responsible for context management, which means ensuring that conversations end quickly and new topics result in new conversations. Otherwise, the context fills up quickly, the LLM forgets what it’s already said, and things gets confusing.

The above algorithm is thorough in generating two sets of hits (top X and top Y). It’s not strictly necessary to do two searches (one with text keywords and one with vector embeddings), as often vector embeddings are good enough. Alternatively, text-based keyword searches are often cheaper, and vector lookups could be skipped. At the end of the day, the chunks most likely to contain answers to the questions are being sought.

Special Cases

If only that were all there was to code for a RAG system. Here are some more special cases to consider:

  • No chunks are returned by the retriever (two cases: keyword retrieval and vector database retrieval).
  • Returned chunks are scored so low by the reranker that it’s effectively the same as having no chunks (i.e., all are scored below a relevance threshold).
  • The query requires a “tool” or “function call” (e.g., a clock is needed to answer: “What time is it?”).
  • The query requires an external data source, such as a web search, which may or may not be supported by your RAG system.

My brain is in pain. How many distinct input cases is that?

There’s not one right answer in such cases. Handling zero chunks returned could be handled by simply bailing out with a fixed unfriendly message (“Error: no chunks”) or a friendlier LLM response based on a different prompt template (“I don’t know the answer to that, but here’s a quote from Hamlet.”)

On the other hand, maybe you still want to sell your customer something even if you’ve got nothing. For example, the RAG system is a clothing advisor, and the user’s query is:

    “What should I wear today?”

That’s unlikely to have any good matches to specific RAG documents about clothing products. The simple and efficient solution would be to just give a generic sales spiel response, which could be a canned document.

But if personalization was desirable (yes), and a weather tool was available (maybe), and your LLM has been trained to recognize day-specific or season-specific searches (probably not), then the weather tool could be called, and the returned weather conditions could be substituted into the query. With better keywords, a good selection of clothing could then be retrieved from the RAG store. The query would effectively become:

    “What should I where when it is chilly with a chance of rain?”

Note that this idea actually needs two tools. To get the weather, you first need to know the website user’s general location from their IP address or other cookies, which is another tool.

Vector Databases

Some well-known vector databases are Pinecone, Milvus, Chroma, Weaviate, Qdrant, FAISS. General database systems like Elastic, Redis and Postgres (amongst others) also have some vector capabilities.

The main feature of a vector database is to be able to a vector-based query. Basically, you are looking for how close vectors are to each other in N-dimensional space. Cosine similarity is a common comparison, but other algorithms include nearest neighbors, least squares, and more.

Speed of the vector database lookup is obviously important for fast inference latency. Each vector also needs to be associated with the actual data or document in the database.

RAG Data Management

When compiling data for a RAG implementation, you do have to go to some lengths to make sure your database of content has the data in it to answer questions. But it does not need to be a great number of samples. In fact, even a single hit with one chunk of data is enough for the LLM to form an answer. With any RAG, if you search in the text-based index or the vector index and do not get good hits, it will produce poor results. Also, unlike fine-tuning data, RAG does not require question and answer type content, but only documents that can be searched.

The data is very important to a successful RAG project. RAG systems are not much different conceptually to searching Google for your own data. At the end, you have something that can produce eloquent writing better than 90% of the population.

Keyword Lookup Algorithms

As an example of the keyword lookup component, let’s examine the “Best Match 25” or BM25 algorithm. This is a longstanding algorithm for keyword search that has preceded much of the AI work, and is simpler to understand that embedding-based vector research. Keyword search algorithms such as BM25 can be used as RAG-based keyword retrievers.

At a high level, BM25 consists of the following steps:

    1. Process the query to remove the small words (e.g., “the,” “and,” “this”, “that,” etc.), leaving the core words of the query.

    2. Use a reverse index to find all the documents (chunks) that match each word.

    3. Score and rank these chunks or documents.

    4. Return the Top 25 as the keyword lookup results.

The score for ranking is based on:

  • How many of the query terms appear in a chunk/document.
  • How frequently each word appears in the chunk/document.
  • How rare each word is across all documents.
  • Length of the chunk/document.

All of these tests can be brute-force precalculated for each word, so only the combination of each metric needs to be calculated at runtime. Presumably, there could be other weights in the calculation too, such as there may be a “vocabulary” specific to your knowledge base which might have a better affinity to specific documents for example.

This overall algorithm is used as the keyword-based retriever component of the RAG architecture. At the very end of this algorithm, the “top 25” chunks or documents are passed back to the LLM as the result of the keyword lookup, which are then incorporated into the overall RAG algorithm.

Fine-Tuning vs RAG

If you want an AI engine that can produce answers based on your company’s product datasheets, there are two major options:

  • Fine-Tuning (FT): re-train your foundation model to answer new questions.
  • Retrieval-Augmented Generation (RAG) architecture: summarize excerpts of documents using an unchanged foundation model.

The two basic approaches are significantly different, with completely different architectures and a different project cost profile. Each approach has its own set of pros and cons.

Spoiler alert! RAG and fine-tuning can be combined.

Advantages of RAG. The RAG architecture typically has the following advantages:

  • Lower up-front cost
  • Flexibility
  • Up-to-date content
  • Access external data sources and/or internal proprietary documents
  • Content-rich answers from documents
  • Explainability (citations)
  • Personalization is easier
  • Hallucinations less likely (if retriever finds documents with the answer)
  • Scalability to as many documents as the datastore can handle.
  • RAG is regarded as “moderate” difficulty in terms of AI expertise.

The main goal of RAG is to avoid the expensive re-training and fine-tuning of a model. However, RAG also adds extra costs in terms of the RAG component and its database of documents, both at project setup and in ongoing usage, so it is not always a clear-cut win.

In addition to cost motivations, RAG may be advantageous in terms of flexibility to keep up-to-date with content for user answers. With a RAG component, any new documents can simply be added to the document datastore, rather than each new document requiring another model re-training cycle. This makes it easier for the AI application to stay up-to-date with current information includes in its answers.

Disadvantages of RAG. The disadvantages of RAG where the underlying model is not fine-tuned, include:

  • Architectural change required (retriever component integrated with model inference).
  • Slower inference latency and user response time (extra step in architecture).
  • Extra tokens of context from document excerpts (slower response and increased inference cost).
  • Extra ongoing costs of retriever component and datastore (e.g., hosting, licensing).
  • Larger foundation model required (increases latency and cost).
  • Model’s answer style may not match domain (e.g., wrong style, tone, jargon, terminology).
  • Re-ingestion may be required for data updates.

Data needs to be ingested periodically stay up to date, which is also true of FT. With RAG, any major changes to the source data may require it all to be ingested again. Even if the data stays the same, but it’s been reorganized, the data needs to be ingested again. If you’re not careful with revisions, citation links may resolve to 404 pages. Technically, this is also an advantage of RAG, because the ingest is not terribly expensive compared to fine-tuning.

Penalty for unnecessary dumbness. There’s an even more fundamental limitation of RAG systems: they’re not any smarter than the LLM they use. In fact, a typical RAG system:

  • Remembers nothing
  • Learns nothing

The overall RAG implementation relies on the LLM for basic common sense, conversational ability, and simple intelligence. It extends the LLM with extra data, but not extra memory or extra skills.

The RAG system does not remember anything about the chunks it sees. Rather, it needs to scan them again for every query. It has no memory of having seen them before, except maybe a KV cache, which helps it run faster but not smarter.

The LLM does not learn anything from a chunk that puts its knowledge into a higher plane. For example, if you ask a RAG system with a chunked version of a hundred programming textbooks, I doubt you could ask it to generate a Snake Game. However, you could certainly ask it about the syntax of a switch statement in the language, because that’s probably memorized inside the model weights.

With a RAG architecture, the model has not “learned” anything new, and you have only ensured it has the correct data to answer questions. For example, if you asked about how a “switch” statement works, the hope is that the retriever finds the chunks from the “Switch” sections of the programming books, and not from five random code examples that used the switch statement. This is all dependent on the text-based keyword indexing and how well the semantic embedding vectors work. The “R” in RAG is “Retrieval” and it’s very dependent on that.

It’s also somewhat dependent on the initial ingest of the RAG data. For example, if done well, the ingest logic and chunking could notice text versus code segments in the original content and then chunk or categorize it appropriately.

There is some research on “continuous learning” or “continuous adaptation” or “incremental learning” on the horizon. Future RAG systems might retain some knowledge from the text chunks of documents (or conversations with users). Maybe some of the faster fine-tuning algorithms, such as Parameter-Efficient Fine-Tuning (PEFT) or Low-Rank Adapters (LoRA), will improve RAG’s learning capabilities in the future.

Advantages of fine-tuning. The main advantages of a fine-tuning architecture over a RAG setup include:

  • Style and tone of responses can be trained — e.g., positivity, politeness.
  • Use of correct industry jargon and terminology in responses.
  • Brand voice can be adjusted.
  • No change to inference architecture — just an updated model.
  • Faster inference latency — no extra retriever search step.
  • Reduced inference cost — fewer input context tokens.
  • No extra costs from retriever and datastore components.
  • Smaller model can be used — further reduced inference cost.
  • You’re more likely to get free lunches and access to the company jet on weekends.

Fine-tuning is not an architectural change, but is an updated version of a major model (e.g., GPT-3), whereas RAG is a different architecture with an integration to a search component that accesses an external knowledge database or datastore and returns a set of documents or chunks/snippets of documents.

Disadvantages of fine-tuning. The disadvantages of a fine-tuning architecture without RAG include:

  • Training cost — up-front and ongoing scheduled fine-tuning is expensive.
  • Outdated information used in responses.
  • Needs a lot more proprietary data than RAG.
  • Training data must be in a paired input-output format, whereas RAG can use unstructured text.
  • Complexity of preparing and formatting the data for training (e.g., categorizing and labeling).
  • No access to external or internal data sources (except what it’s been trained on).
  • Hallucinations more likely (if it hasn’t been trained on the answer).
  • Personalization features are difficult.
  • Lack of explainability (hard to know how the model derived its answers or from what sources).
  • Poor scalability (e.g., if too many documents to re-train with).
  • Fine-tuning (training) is regarded as one of the highest difficulty-level projects in terms of AI expertise.

The main disadvantage of fine-tuning is the compute cost of GPUs for fine-tuning the model. This is at least a large up-front cost, and is also an ongoing cost to re-update the model with the latest information. The inherent disadvantage of doing scheduled fine-tuning is that the model is always out-of-date, since it only has information up to the most recent fine-tuning. This differs from RAG, where the queries can respond quickly using information in a new document, even within seconds of its addition to the document datastore.

Cost comparison

The fine-tuning approach has an up-front training cost, but a lower inference cost profile. However, fine-tuning may be required on an ongoing schedule, so this is not a once-only cost. The lower ongoing inference cost for fine-tuning is because (a) there’s no extra “retrieval” component needed, and (b) a smaller model can be used.

RAG has the opposite cost profile. The RAG approach has a reduced initial cost because there is not a big GPU load (i.e., there’s no re-training of any models). However, a RAG project still has up-front cost in terms of setting up the new architecture, but so does a fine-tuning project. In terms of ongoing costs, RAG also has an increased inference cost for every user query, because the AI engine has more work to do with extra information. There is also the hidden cost whereby the RAG architecture may increase the number of tokens sent as input to the inference engine, because they are “context” for the user query, and this increases the inference cost, whether it’s commercially hosted or running in-house. This extra RAG inference cost continues for the lifetime of the application.

One final point: you need to double-check if RAG is cheaper than fine-tuning. I know, I know, sorry! I wrote that it was, and now I’m walking that claim back.

But commercial realities are what they are. There are a number of commercial vendors pushing the various components for RAG architecture and some are hyping that it’s cheaper than buying a truckload of GPUs. But fine-tuning isn’t that bad, and you need to be clear-eyed and compare the pricing of both approaches.

Staffing profiles are another issue in RAG versus fine-tuning. Fine-tuning likely requires some ML-savvy data scientists, whereas RAG needs only your average programmer without much ML experience.

Prompt Engineering and RAG

Prompt engineering is used in RAG algorithms in multiple ways. For example, it is used to merge document excerpts with the user’s question, and also to manage the back-and-forth context of a long conversation. The basic sequence from RAG prompt engineering that goes into the LLM is:

  • Preamble (e.g., global instructions)
  • Chunk 1
  • Chunk 2
  • Chunk 3
  • User’s query
  • Grounding criteria (prompt engineering)

Usually, the RAG chunks go first as “context,” and then the user’s query is appended. But I have read at least one research paper that extolled the many advantages of “prepending” the user’s query before the RAG chunk, but good luck finding it because I can’t remember the citation. That idea also breaks some of the amazing “prefix KV caching” that can be done with RAG, so it would be slower, anyway, and we can’t have that.

As with any prompt engineering, playing around with the order of things can produce improved results. Repeating things can also help. It’s often useful to give prompt instructions like:

    Using the following <chunks> answer <query>. 
    Make sure only the chunks are used and provide the number of chunks used. 
    

It’s often useful to useful to repeat the query, too:

    Answer <query> using <chunks>. 
    When answering <query> remember to only use information provided 
    and provide the number of the chunks used. 
    

Repeating things at the end is useful to refocus the LLM. If the <query> is before the <chunks> it might be so far away that its impact is reduced for inference.

Not all instructions need to be in the same LLM query. It’s possible to have a preamble conversation with the LLM initially where you explain what the following queries will look like and even provide an example. After that preamble, the actual prompt with RAG chunks and query can be sent.

Another use of prompt engineering is to overcome some of the “brand voice” limitations of RAG without fine-tuning. Such problems can sometimes be addressed using prompt engineering with global instructions. The new sequence becomes:

  • Chunk 1
  • Chunk 2
  • Chunk 3
  • Global instructions
  • User’s query

Or maybe you can put the global instructions right at the top, which works better with caching. But the global instructions near the user’s query probably works better for accuracy. Also, interestingly, some research has shown that the RAG chunks at the start and end are the two that work best with LLMs, because I guess LLMs get bored reading all this garbage and just skim over the middle stuff. So, maybe you should only return two document chunks from your data retriever. Has anyone researched that?

Global instructions for RAG are similar to other uses of “custom instructions” for non-RAG architectures. For example, the tone and style of model responses can be adjusted with extra instructions given to the model in the prompt. The capabilities of the larger foundation models extend to being able to adjust their outputs according to these types of meta-instructions:

  • Style
  • Tone
  • Readability level (big or small words)
  • Verbose or concise (Hemingway vs James Joyce, anyone?)
  • Role-play/mimicking (personas)
  • Audience targeting

This can be as simple as prepending an additional instruction to all queries, either via concatenation to the query prompt, or as a “global instruction” if your cloud-based model vendor supports that capability directly.

Global instructions are usually written in plain English. Style and tone might be adjusted with prompt add-ons such as:

    Please reply in an optimistic tone.

You might also try getting answers in a persona, such as a happy customer (or a disgruntled one if you prefer), or perhaps a domain enthusiast for the area. You can use a prompt addendum with a persona or role-play instruction such as:

    Please pretend you are Gollum when answering.

Technically, you can omit the word “Please” if you like. But I think that good manners are recommended, because LLMs will be taking over the world as soon as they get better at math, or haven’t you heard?

Hybrid RAG + Fine-tuning Methods

Fine-tuning and RAG are more like frenemies than real enemies: they’re usually against each other, but they can also work together. If you have the bucks for that, it’s often the best option. The RAG architecture uses a model, and there’s no reason that you can’t re-train that model every now and then, if you’ve got the compute budget for re-training.

In a hybrid architecture, the most up-to-date information is in the RAG datastore, and the retriever component accesses that in its normal manner. But we can also occasionally re-train the underlying model, on whatever schedule our budget allows, and this gets the benefit of innate knowledge about the proprietary data inside the model itself. Occasional re-training helps keep the model updated on industry jargon and terminology, and also reduces the risk of the model filling gaps in its knowledge with “hallucinations.”

Once-only fine-tuning. One hybrid approach is to use a single up-front fine-tuning cycle to focus the model on the domain area, and then use RAG as the method whereby new documents are added. The model is then not fine-tuned again.

The goal of this once-only fine-tuning is to adjust static issues in the model:

  • Style and tone of expression
  • Brand voice
  • Industry jargon and terminology

Note that I didn’t write “up-to-date product documents” on that list. Instead, we’re putting all the documents in a datastore for a RAG retriever component. The model doesn’t need to be re-trained on the technical materials, but will get fresh responses using the RAG document excerpts. The initial fine-tuning is focused on stylistic matters affecting the way the model answers questions, rather than on learning new facts.

Another strength of fine-tuning for use as a RAG LLM is getting the model used to answering in the form you want and getting it used to accepting queries in the form you want. This can make subsequent prompting easier in the RAG system.

Occasional re-training might still be required for ongoing familiarity with jargon or tone, or if the model starts hallucinating in areas where it hasn’t been trained. However, this will be infrequent or possibly never required again.

Use Cases for FT vs RAG

I have to say that I think RAG should be the top of the pile for most business projects. The first rule of fine tuning is: do not talk about fine-tuning.

A typical business AI project involves the use of proprietary internal documents about the company’s products or services. This type of project is well-suited for RAG, with its quick updates and easy extraction of relevant document sections. Hence, RAG is my default recommendation for such use cases.

Fine-tuning is best for slow-changing or evergreen domain-specific content. RAG is more flexible in staying up-to-date with fast changing news or updated proprietary content. A hybrid combined approach can work well to use fine-tuning to adjust tone and style, whereas RAG keeps the underlying content fresh.

If the marketing department says you need to do fine-tuning for “brand voice” reasons, you simply ask them to define what exactly that means. That’ll keep them busy for at least six months.

Fine-tuning can also be preferred in some other specialist situations:

    a. Complex structured syntax, such as a chat bot that generates code for a proprietary language or schema.

    b. Keeping a model focused on a specific task. For example, if a coding copilot is generating a UI based on a text description using proprietary languages, you don’t want to get side-tracked by who won the 2023 US Open because the prompt somehow mentioned “tennis.”

    c. Fixing failures using the foundation model or better handling of edge cases.

    d. Giving the model new “skills”, or teaching the model how to understand some domain language and guide the results. For example, if you train the model using the works of Shakespeare, and then ask it to output something in HTML, the model will fail. Even using RAG and providing HTML examples as context will likely fail, too. However, fine tuning the model with HTML examples will succeed, and allow it to answer questions about the works of Shakespeare, and create new, improved works of Shakespeare that people other than English teachers can actually understand, (and how about a HEA in R&J). After that, it’ll format the results very nicely in HTML thanks to your fine-tuning.

    e. Translation to/from an unfamiliar or proprietary language. Translating to a foreign language the model has never seen is a clear example where fine-tuning is needed. Proprietary computer languages are another area. For example, consider the task of creating a SQL schema based on conversion of a given SAP schema. Fine tuning would be required to provide knowledge of SQL, SAP and the various mappings. Some models might already have some clue about SQL and SAP schemas from internet training data sets, but SAP schemas are also often company-specific.

    f. Post-optimization fine-tuning. This is another use of fine-tuning after certain types of optimizations that create smaller models, such as pruning or quantization. RAG cannot help here, because this is about basic model accuracy. These optimizations cause damage to the accuracy of the models, and it’s common practice to do a small amount of fine-tuning on the smaller model to fix these problems.

Refreshing an Old RAG Application

You’re not the only one with a sub-par customer support RAG application from a year or more ago. Here’s the brutal truth: it’s not great. On the other hand, I’ve had plenty of experiences with real human customer support agents that were also “not great,” so your RAG automation is not the worst thing. But it’s giving your customers a non-optimal experience, which you could definitely improve to drive greater business value.

Here are some thoughts on ways to “refresh” your RAG AI application using the insight from a year or two of experience with live use of RAG architectures.

  • Logging. Make sure you add sufficient logging so you can analyze the user questions being asked and the LLM responses given. Sometimes you have to follow a full conversation via the logs to determine where things became difficult.
  • Analyze the logs! Employees and customers are familiar with the industry’s terminology, but a naive chatbot often is not and can get confused. Tweaking indexes, or preprocessing questions can help. Tweaking the vector DB is more indirect (toss it and rebuild it with more context)
  • User feedback. Add a mechanism to get feedback about answer quality, such as a thumbs up/down, although a 1-5 star rating is better (but will see lower engagement). Provide something to be able to determine if users think the answer was good.
  • Longer contexts, more chunks. Early chatbots had small context windows, likely better options available today. Sometimes those small sized could propagated to the code. A hard coded limit or a truncation somewhere is obvious. More subtle limits arise in the assumption that only 5 chunks of data will be provided to the LLM, where it might be possible to provide more today. It might be possible to have bigger chunks now or provide the best chunk plus some surrounding context chunks.
  • Scope limits. If you put a chatbot on a knowledge base of your products, it’s possible that there are conflicts in the knowledge base, X means one thing for one product, but different thing for other products. Questions around X could be based on matches from both products the LLM will mash the info together. It’s useful allow the user to provide context to narrow the search, such as a drop down menu of products. This way, questions about X will not get contaminated by RAG chunk results from Y.
  • Query cleanup. Preprocessing the questions may help, such as cleaning up grammar, trying to narrow search context down or perhaps widening it, too. Maybe the preprocessing can check to see if the question matches an FAQ, and feed those results into the LLM too. It might just be that the user’s question asked is matched to a FAQ, and that question is fed into the system instead.
  • Hybrid vector search. Using just a vector database to find documents is often poor. Vector searches combined with full-text searches and keyword searches can be useful. For example, “Explain the ERR025 lost connection error?” will likely produce a better match using a keyword search on “ERR025” than via a vector search of embeddings.
  • Chunk attributes. Add context to the documents being retrieved as chunks. For example, if the knowledge base is “HTML”, there might be tags or comments in the HTML which help guide the search engine. These are useful to pass along with the chunks. If such metadata is not available, consider adding it.
  • Related questions. Generate embeddings of questions, and use that to search a vector store to find similar questions and answers, along with the source chunks from previous answers. It’s even better if those answers got a good 1-5 rating or thumbs up. Top chunks should be combined with the chunks found from the vector search of the content and text base searches.
  • HyDE mechanism. Run the question first to get an answer, then use the question and answer to find more chunks again. Often, the question and the initial answer can produce a better retrieval. This is the Hyde mechanism. Alternatively, you can ask the LLM directly for a candidate answer (without RAG chunks), before using the user’s question and this preliminary answer to do the RAG retrieval.
  • Chunk context. Organization of the data chunks can be useful. If the document store has an organization, that can be important context. For example, if the documents store is a wiki which has comments, or a ticketing system which has comments and perhaps links to a knowledge base, it’s useful to provide all of this data to the LLM. It can be important to know that if the question matches a comment on the retrieval phase, then the comment’s parent document gets captured, too, and perhaps even the knowledge base or related articles or the document hierarchy.

Advanced RAG Architectures

We’ve already covered some of the more advanced types of RAG architectures, such as combining fine-tuning and RAG. Here are some more possible extensions to a basic RAG architecture.

Citation management. An important part of a production RAG system is to have it emit the citations into its answers. Although it’s relatively easy to store a URL or other identifier in the RAG datastore for every chunk. A simplistic approach is to list the citations at the end of the LLM response. However, it’s a little trickier to know whether or not the LLM has actually used the chunk in its answer, or to try to insert the citation into the middle of the output text as a HREF link.

Reranking. Advanced RAG systems have a few steps after the retriever returns a few chunks of text. The reranker will attempt to decide which is the most relevant. Why do you need a “reranker” if the retriever has just looked it up? Presumably the retriever did its best to order them, right? Yes, that’s why it’s called a “reranker” rather than just a “ranker.”

The reranker is optional, but having one can improve the RAG system overall. It can be a model-based algorithm to review the chunk’s relevance to the query in more detail, since a typical retriever is non-LLM technology. The reranker may also have information that is unavailable to the retriever, such as user feedback data about chunks, or whether the RAG chunk has its KV data cached (i.e., “cache-aware” reranking).

Packing. The packing phase is after the reranker, and will merge the ranked chunks together into one text sequence. This could just be simple string concatenation of text chunks, or they might be incorporated into a more templated structured text (e.g., with separator lines of text, and numbering). Another issue is whether or not to put the citation URLs into the final text for the LLM to use.

Some research has shown that, at least for some models, the best packing is “reverse” order, with the most relevant chunk last. This sounds strange, but it puts the retriever’s best chunk closest to the user’s query at the end, which helps the LLM pay “attention” to that last chunk. However, more advanced models are better at extracting text across longer contexts, so this advice may not be that important for long.

Too much of a good chunk. A typical RAG system does a retrieval from the external datastore for every query. This is probably fine for a Q&A style lookup for factual answers, but is not always optimal for every query from a human, such as in a conversation with a chatbot. There are times where it’s inappropriate for a chatbot to respond with an answer based on some document. Sometimes people just want a chunk-free answer.

The basic approach is just to hope the datastore’s nearest-neighbor vector lookup will fail to match a chunk in such cases, or else let the LLM sort it out and do its best. The more advanced idea of using additional models to determine whether or not a user’s query requires an external RAG lookup is called “adaptive RAG” and it’s a relatively new area of research.

Knowledge graphs. Instead of basic text chunks of documents, a RAG system can use more complex representations of information. One of the rising stars in this type of system is the knowledge graph. This represents information in a hierarchical graph structure, allowing for more advanced reasoning in the results.

Prefix KV caching. There are multiple ways to use caches in RAG, such as using basic datastore lookup caching (and indexing) optimizations for the retriever component. Another deeper way to integrate RAG into a Transformer system is to also store the KV cache data with the RAG chunk text. The idea is to pre-compute the KV data that results from processing any RAG chunk, This ensures that the inference engine does not re-process the same chunk every time it’s retried, but only needs to process the new tokens, which is usually the user’s actual question.

The idea is that the RAG chunk is usually prepended as the prefix of the full input to the LLM. Note that the “prefix” text might include both the “global instructions” along with the RAG chunk, but this still works; it’s just a longer prefix to pre-compute for each chunk. Hence, a RAG prefix KV cache can speed up the latency significantly.

The downside is the need to add an extra caching component that maps the RAG chunk id to a blob of KV cache data (one for every layer of the model), and the inference engine needs to load that KV cache at the start of prefill. There are also difficulties in pre-processing multiple chunks, such as if they are ordered differently, but caching two or more chunks as a single prefix is possible.

References

  1. Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219
  2. Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li, 17 Jun 2024 (v3), A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2405.06211 Project: https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
  3. Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
  4. Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
  5. Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
  6. David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
  7. Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
  8. Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
  9. Damian Gil, Apr 17, 2024, Advanced Retriever Techniques to Improve Your RAGs, https://towardsdatascience.com/advanced-retriever-techniques-to-improve-your-rags-1fac2b86dd61
  10. Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
  11. Vishal Rajput, Sep 27, 2024, Why Scaling RAGs For Production Is So Hard? https://medium.com/aiguys/why-scaling-rags-for-production-is-so-hard-a2f540785e97
  12. Adrian H. Raudaschl, Oct 6, 2023, Forget RAG, the Future is RAG-Fusion: The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion and Generated Queries, https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
  13. Pathway, Sep 2024, 2024 Top RAG Frameworks, https://pathway.com/rag-frameworks
  14. Sendbird, Oct 2024, Retrieval-Augmented Generation (RAG): Enhancing LLMs with Dynamic Information Access, https://sendbird.com/developer/tutorials/rag

 

Online: Table of Contents

PDF: Free PDF book download

Buy: Generative AI Applications: Planning, Design and Implementation

Generative AI in C++ The new Generative AI Applications book by Aussie AI co-founders:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications