Aussie AI

Chapter 6. Fine-Tuning vs RAG

Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"

by David Spuler and Michael Sharpe

Chapter 6. Fine-Tuning vs RAG

Fine-Tuning vs RAG

If you want an AI engine that can produce answers based on your company’s product datasheets, there are two major options:

Fine-Tuning (FT): re-train your foundation model to answer new questions.
Retrieval-Augmented Generation (RAG) architecture: summarize excerpts of documents using an unchanged foundation model.

The two basic approaches are significantly different, with completely different architectures and a different project cost profile. Each approach has its own set of pros and cons.

Spoiler alert! RAG and fine-tuning can be combined.

Advantages of RAG

The RAG architecture typically has the following advantages:

Lower up-front cost
Flexibility
Up-to-date content
Access external data sources and/or internal proprietary documents
Content-rich answers from documents
Explainability (citations)
Personalization is easier
Hallucinations less likely (if retriever finds documents with the answer)
Scalability to as many documents as the datastore can handle.
RAG is regarded as “moderate” difficulty in terms of AI expertise.

The main goal of RAG is to avoid the expensive re-training and fine-tuning of a model. However, RAG also adds extra costs in terms of the RAG component and its database of documents, both at project setup and in ongoing usage, so it is not always a clear-cut win.

In addition to cost motivations, RAG may be advantageous in terms of flexibility to keep up-to-date with content for user answers. With a RAG component, any new documents can simply be added to the document datastore, rather than each new document requiring another model re-training cycle. This makes it easier for the AI application to stay up-to-date with current information includes in its answers.

Disadvantages of RAG

The disadvantages of RAG where the underlying model is not fine-tuned, include:

Architectural change required (retriever component integrated with model inference).
Slower inference latency and user response time (extra step in architecture).
Extra tokens of context from document excerpts (slower response and increased inference cost).
Extra ongoing costs of retriever component and datastore (e.g., hosting, licensing).
Larger foundation model required (increases latency and cost).
Model’s answer style may not match domain (e.g., wrong style, tone, jargon, terminology).
Re-ingestion may be required for data updates.

Data needs to be ingested periodically stay up to date, which is also true of FT. With RAG, any major changes to the source data may require it all to be ingested again. Even if the data stays the same, but it’s been reorganized, the data needs to be ingested again. If you’re not careful with revisions, citation links may resolve to 404 pages. Technically, this is also an advantage of RAG, because the ingest is not terribly expensive compared to fine-tuning.

Statelessness of RAG

There’s an even more fundamental limitation of RAG systems: they’re not any smarter than the LLM they use. In fact, a typical RAG system:

Remembers nothing
Learns nothing

The overall RAG implementation relies on the LLM for basic common sense, conversational ability, and simple intelligence. It extends the LLM with extra data, but not extra memory or extra skills.

The RAG system does not remember anything about the chunks it sees. Rather, it needs to scan them again for every query. It has no memory of having seen them before, except maybe a KV cache, which helps it run faster but not smarter.

The LLM does not learn anything from a chunk that puts its knowledge into a higher plane. For example, if you ask a RAG system with a chunked version of a hundred programming textbooks, I doubt you could ask it to generate a Snake Game. However, you could certainly ask it about the syntax of a switch statement in the language, because that’s probably memorized inside the model weights.

With a RAG architecture, the model has not “learned” anything new, and you have only ensured it has the correct data to answer questions. For example, if you asked about how a “switch” statement works, the hope is that the retriever finds the chunks from the “Switch” sections of the programming books, and not from five random code examples that used the switch statement. This is all dependent on the text-based keyword indexing and how well the semantic embedding vectors work. The “R” in RAG is “Retrieval” and it’s very dependent on that.

It’s also somewhat dependent on the initial ingest of the RAG data. For example, if done well, the ingest logic and chunking could notice text versus code segments in the original content and then chunk or categorize it appropriately.

There is some research on “continuous learning” or “continuous adaptation” or “incremental learning” on the horizon. Future RAG systems might retain some knowledge from the text chunks of documents (or conversations with users). Maybe some of the faster fine-tuning algorithms, such as Parameter-Efficient Fine-Tuning (PEFT) or Low-Rank Adapters (LoRA), will improve RAG’s learning capabilities in the future.

Advantages of Fine-Tuning

The main advantages of a fine-tuning architecture over a RAG setup include:

Style and tone of responses can be trained — e.g., positivity, politeness.
Use of correct industry jargon and terminology in responses.
Brand voice can be adjusted.
No change to inference architecture — just an updated model.
Faster inference latency — no extra retriever search step.
Reduced inference cost — fewer input context tokens.
No extra costs from retriever and datastore components.
Smaller model can be used — further reduced inference cost.
You’re more likely to get free lunches and access to the company jet on weekends.

Fine-tuning is not an architectural change, but is an updated version of a major model (e.g., GPT-3), whereas RAG is a different architecture with an integration to a search component that accesses an external knowledge database or datastore and returns a set of documents or chunks/snippets of documents.

Disadvantages of Fine-Tuning

The disadvantages of a fine-tuning architecture without RAG include:

Training cost — up-front and ongoing scheduled fine-tuning is expensive.
Outdated information used in responses.
Needs a lot more proprietary data than RAG.
Training data must be in a paired input-output format, whereas RAG can use unstructured text.
Complexity of preparing and formatting the data for training (e.g., categorizing and labeling).
No access to external or internal data sources (except what it’s been trained on).
Hallucinations more likely (if it hasn’t been trained on the answer).
Personalization features are difficult.
Lack of explainability (hard to know how the model derived its answers or from what sources).
Poor scalability (e.g., if too many documents to re-train with).
Fine-tuning (training) is regarded as one of the highest difficulty-level projects in terms of AI expertise.

The main disadvantage of fine-tuning is the compute cost of GPUs for fine-tuning the model. This is at least a large up-front cost, and is also an ongoing cost to re-update the model with the latest information. The inherent disadvantage of doing scheduled fine-tuning is that the model is always out-of-date, since it only has information up to the most recent fine-tuning. This differs from RAG, where the queries can respond quickly using information in a new document, even within seconds of its addition to the document datastore.

Cost Comparison of RAG and FT

The fine-tuning approach has an up-front training cost, but a lower inference cost profile. However, fine-tuning may be required on an ongoing schedule, so this is not a once-only cost. The lower ongoing inference cost for fine-tuning is because (a) there’s no extra “retrieval” component needed, and (b) a smaller model can be used.

RAG has the opposite cost profile. The RAG approach has a reduced initial cost because there is not a big GPU load (i.e., there’s no re-training of any models). However, a RAG project still has up-front cost in terms of setting up the new architecture, but so does a fine-tuning project. In terms of ongoing costs, RAG also has an increased inference cost for every user query, because the AI engine has more work to do with extra information. There is also the hidden cost whereby the RAG architecture may increase the number of tokens sent as input to the inference engine, because they are “context” for the user query, and this increases the inference cost, whether it’s commercially hosted or running in-house. This extra RAG inference cost continues for the lifetime of the application.

One final point: you need to double-check if RAG is cheaper than fine-tuning. I know, I know, sorry! I wrote that it was, and now I’m walking that claim back.

But commercial realities are what they are. There are a number of commercial vendors pushing the various components for RAG architecture and some are hyping that it’s cheaper than buying a truckload of GPUs. But fine-tuning isn’t that bad, especially with PEFT and LoRA available, and you need to be clear-eyed and compare the pricing of both approaches.

Staffing profiles are another issue in RAG versus fine-tuning. Fine-tuning likely requires some ML-savvy data scientists, whereas RAG needs only your average programmer without much ML experience.

Hybrid RAG + Fine-Tuning Methods

Fine-tuning and RAG are more like frenemies than real enemies: they’re usually against each other, but they can also work together. If you have the bucks for that, it’s often the best option. The RAG architecture uses a model, and there’s no reason that you can’t re-train that model every now and then, if you’ve got the compute budget for re-training.

In a hybrid architecture, the most up-to-date information is in the RAG datastore, and the retriever component accesses that in its normal manner. But we can also occasionally re-train the underlying model, on whatever schedule our budget allows, and this gets the benefit of innate knowledge about the proprietary data inside the model itself. Occasional re-training helps keep the model updated on industry jargon and terminology, and also reduces the risk of the model filling gaps in its knowledge with “hallucinations.”

Once-only fine-tuning. One hybrid approach is to use a single up-front fine-tuning cycle to focus the model on the domain area, and then use RAG as the method whereby new documents are added. The model is then not fine-tuned again.

The goal of this once-only fine-tuning is to adjust static issues in the model:

Style and tone of expression
Brand voice
Industry jargon and terminology

Note that I didn’t write “up-to-date product documents” on that list. Instead, we’re putting all the documents in a datastore for a RAG retriever component. The model doesn’t need to be re-trained on the technical materials, but will get fresh responses using the RAG document excerpts. The initial fine-tuning is focused on stylistic matters affecting the way the model answers questions, rather than on learning new facts.

Another strength of fine-tuning for use as a RAG LLM is getting the model used to answering in the form you want and getting it used to accepting queries in the form you want. This can make subsequent prompting easier in the RAG system.

Occasional re-training might still be required for ongoing familiarity with jargon or tone, or if the model starts hallucinating in areas where it hasn’t been trained. However, this will be infrequent or possibly never required again.

Use Cases for FT vs RAG

I have to say that I think RAG should be the top of the pile for most business projects. The first rule of fine tuning is: do not talk about fine-tuning.

A typical business AI project involves the use of proprietary internal documents about the company’s products or services. This type of project is well-suited for RAG, with its quick updates and easy extraction of relevant document sections. Hence, RAG is my default recommendation for such use cases.

Fine-tuning is best for slow-changing or evergreen domain-specific content. RAG is more flexible in staying up-to-date with fast changing news or updated proprietary content. A hybrid combined approach can work well to use fine-tuning to adjust tone and style, whereas RAG keeps the underlying content fresh.

If the marketing department says you need to do fine-tuning for “brand voice” reasons, you simply ask them to define what exactly that means. That’ll keep them busy for at least six months.

Fine-tuning can also be preferred in some other specialist situations:

a. Complex structured syntax, such as a chat bot that generates code for a proprietary language or schema.

b. Keeping a model focused on a specific task. For example, if a coding copilot is generating a UI based on a text description using proprietary languages, you don’t want to get side-tracked by who won the 2023 US Open because the prompt somehow mentioned “tennis.”

c. Fixing failures using the foundation model or better handling of edge cases.

d. Giving the model new “skills”, or teaching the model how to understand some domain language and guide the results. For example, if you train the model using the works of Shakespeare, and then ask it to output something in HTML, the model will fail. Even using RAG and providing HTML examples as context will likely fail, too. However, fine tuning the model with HTML examples will succeed, and allow it to answer questions about the works of Shakespeare, and create new, improved works of Shakespeare that people other than English teachers can actually understand, (and how about a HEA in R&J). After that, it’ll format the results very nicely in HTML thanks to your fine-tuning.

e. Translation to/from an unfamiliar or proprietary language. Translating to a foreign language the model has never seen is a clear example where fine-tuning is needed. Proprietary computer languages are another area. For example, consider the task of creating a SQL schema based on conversion of a given SAP schema. Fine tuning would be required to provide knowledge of SQL, SAP and the various mappings. Some models might already have some clue about SQL and SAP schemas from internet training data sets, but SAP schemas are also often company-specific.

f. Post-optimization fine-tuning. This is another use of fine-tuning after certain types of optimizations that create smaller models, such as pruning or quantization. RAG cannot help here, because this is about basic model accuracy. These optimizations cause damage to the accuracy of the models, and it’s common practice to do a small amount of fine-tuning on the smaller model to fix these problems.

Obviously, we prefer RAG, and it is an optimal solution in many cases. However, fine-tuning has been making a comeback in terms of use cases and optimizations such as LoRA architectures.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: RAG Optimization: Accurate and Efficient LLM Applications