Aussie AI

Chapter 11. RAG Caching

Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"

by David Spuler and Michael Sharpe

Chapter 11. RAG Caching

RAG Caching

Several components in a RAG architecture can be optimized with a cache. The retrieval component can use all of the types of caching that are applicable to whatever database or datastore architecture it uses, irrespective whether it’s keyword or vector lookup, and whether stored on disk or cached in memory. All of these different retrieval options can have a cache. At the bottom level of the LLM, there are various KV caching techniques (see further below). At the topmost level, there can be an overall cache via an “inference cache” for exactly identical queries, or a “semantic cache” for similar queries.

In addition to RAG caches, such as retrieval caches, there are various LLM cache methods. Several of the many types of KV caching optimizations can optimize RAG architectures (and other LLM use cases). The main KV cache techniques involve precomputed caches for RAG chunks, such as prefix caching or session caching. Such techniques include:

Prefix KV cache
Session KV cache (multi-turn KV caching)
Substring KV cache (Lengthwise-fused KV caching)
Global KV cache (multi-query KV caching)

Other general types of caching that apply to any LLM system, and can be used with RAG:

Inference cache
Global KV caching
Semantic cache
Context cache
Prompt cache

Text-to-Text Inference Caching

Any LLM architecture can put a text-to-text cache in front of the query processor. Whenever the user types in the same query, we can output an answer that we’ve previously cached. There are two main types:

Inference cache — exact text-to-text mapping (identical queries).
Semantic cache — detects similar-enough queries with the same meaning.

Text-to-text caching works for RAG applications and is perhaps even more effective than for other LLM applications, because a text-to-text cache removes the need to launch the RAG retriever. We don’t need chunks of text if we already have stored the answer.

There are a few problems with implementing a text-to-text inference cache. Some of the issues include:

Same answer every time
Not all queries can be cached

One practical issue is that your RAG application will output the exact word-for-word answers for the same input queries. It’s great for regression testing, but users may not like it.

Furthermore, there are a lot of different types of queries that cannot be cached. Some examples include:

Time-dependent queries
Location-dependent queries
News-related queries
Any topic that changes rapidly

Hence, this type of cache may not be ideal. Instead, we’d like the speed of caching, but with some flexibility in the answers. The solution is to use the KV cache.

Global KV Caching

Instead of text-to-text caching, we can go halfway and cache the internal computations of the LLM, called the “KV cache” data. The basic type of KV caching occurs in every query, one token at a time. But we can also store the whole KV cache data in a “global KV cache.” We can do this for an “inference cache” of exact queries, but it doesn’t work for semantic caching, because the number of tokens is usually different (i.e., the user query used slightly different words). KV cache data is very specific to a token sequence.

The idea is that for a known query, such as an inference cache, a huge part of the computation is cached in this internal data. Most of the tokens are processed, and there’s no need to do any of the “prefill phase” that usually delays the LLM’s first token. Only the decoding phase remains and the LLM can output its first token very quickly to give good user responsiveness.

Global KV caching works especially well for RAG applications because:

Input data is large — multiple chunks of text.
Queries are small — usually questions of a few words.
Output is smaller than the input (it’s usually a summary extracted from a chunk).

Hence, this type of global KV caching works in any LLM and is a good candidate for RAG caching.

RAG-Specific Caching Optimizations

But what can we only do in a RAG architecture? How can we make the LLM run faster when we’re looking up most of the answer as text chunks in a vector database? The main answers are:

Prefix KV caching
Precomputed RAG KV cache

Prefix KV caching is a slight generalization of global KV caching, in that it will cache any prefix of tokens that it’s seen before. If you think about it, there are a lot of common prefixes in a RAG architecture:

Global prompt instructions (prepended)
RAG chunks (prepended)
User query (a few words)

We have a common prefix every time consisting of the global instructions plus the first RAG chunk. The prefix has more than one RAG chunk if our vector database is deterministic in returning the chunks in the same order.

Prefix KV caching is a very powerful technique, and it’s been implemented in various commercial inference frameworks, such as vLLM, DeepSeek, and Anthropic. It’s so effective that some LLM API providers are offering “cached tokens” at a much lower price point.

Precomputed RAG Cache

The main problem with all of the above caching techniques is that they require a user’s query to be seen already. We don’t have a cache the first time.

So why not precompute it?

In order to precompute our cache, we could try to guess all of our user’s likely queries. Then we can pre-populate the cache by running all those candidate user queries.

In fact, that would work for any of the above caching methods, but it’s somewhat hard to know all the variations of words that users might choose. And remember that semantic caching doesn’t work for KV data, so we can’t use vector embeddings to help with a KV cache.

A better idea is to precompute our cache for each RAG chunk. The idea here is to store the KV cache data in the vector database. For every text chunk, we have a precomputed KV cache. When the retriever returns a chunk, or more than one chunk, it also returns any KV cache data that it has for that chunk.

Note that if we don’t want to precompute for all of the text chunks, then we can precompute for a subset of the most frequently used chunks, and then add more KV cache data to the RAG cache, as other chunks appear in user queries.

However, our RAG chunk is not actually the prefix for the LLM if there’s a prepended set of global instructions. The trick is to precompute the KV cache data for the combined query: global instructions plus the RAG chunk.

This assumes that if we change the global instructions, we have to discard any previously cached KV data. Similarly, this doesn’t work if there’s per-user global instructions, because we don’t want to store a wholly different cache for every user!

If this approach works, the only part of the query that isn’t precomputed is the user’s query. Usually, it’s only a few words of a question.

Prefix KV Caching Takes Off

There have been two notable trends in relation to “prefix KV caching” in the last couple of months:

1. Several industry implementations of prefix KV caching,

2. Multiple research papers on the generalization to “non-prefix KV caching.”

There’s been quite a few industry announcements of “caching” functionality, such as “prompt caching” or “context caching” and similar names, from platforms such as:

vLLM
Character.AI
DeepSeek
Google Gemini
Anthropic
OpenAI
OpenVINO

Such is the efficiency benefit of prefix KV caching that the pricing for “cached tokens” is much lower than normal inference pricing. It’s faster and cheaper!

Multiple RAG Chunks

You’ve probably noticed a problem with the precomputed RAG KV cache: what if the retriever returns multiple chunks. After all, it usually does!

Yes, there’s a problem here, because the retriever could return the chunks in different orders. It’s not always going to be the same prefix.

If we have 10 chunks, and have 10 precomputed KV caches, we can’t just munge them together. Furthermore, each of our precomputed KV caches might be assuming it’s the one just after the global instructions, whereas only 1 of the 10 chunks is actually the prefix chunk, so we’ve got a big mess.

There are several basic approaches used in the research literature to resolve this problem:

Get the retriever to favor returning the one RAG chunk with a cache.
Store multiple caches for different RAG chunk orders.
Fused KV caching

Using a “cache-aware” RAG retriever asks the retriever to try to return the RAG chunks that have a cache as the first one. But it’s not a very general approach, because the other 9 RAG chunks don’t have a cache, or even if they have one, we can’t use it, because they’re not the prefix chunk.

Storing multiple KV caches for different chunk orders is possible, but it’s very expensive. If the chunks are coming out in different orders, we need to cache each unique order. If we have 10 chunks in a retrieved result, there are 10! cache entries we could use. This can be done with prefix KV caching, but we can’t precompute all of that!

The last idea is “fused KV caching.” I said above that we can’t just munge multiple caches for chunks all together. Sorry, I lied, because you actually can.

Fused KV Caching of RAG Chunks

This is definitely at the cutting edge of the AI research papers. There’s only a handful of papers that look at this idea of fusing KV caches together. I’ve seen two new papers on this research area in just the last week. There are various names being used:

Substring KV cache
Fused KV cache
Position-independent KV cache
KV cache concatenation

These all amount to the same thing: just merge all the KV caches together! It seems like it shouldn’t be that simple, and yet there are now at least four research papers that suggest that it is.

If we have a query sent to the LLM, after the retriever has returned 10 chunks, then the LLM sees this:

Global instructions prompt
Chunk 1
....
Chunk 10
User query text

Each of those parts have their own KV cache data precomputed. We have a general precomputation for the global instructions, and our retriever returns 10 separate KV caches for each of the chunks. The only part without a KV cache is the user’s query text.

How do we merge multiple KV caches? It seems like it should be a fancy GPU computation, and yet, no.

Instead, we just put one after the other, in a sequence, and pretend that we’ve computed the cache for the full token sequence. This is “fused KV caching” and it’s obviously a fast method that only requires memory copies and literally zero extra computations.

I really don’t understand why this works at all, let alone works well. Nevertheless, there are a couple of research papers that assert that it seems to be effective.

Even so, I feel like it needs a lot more research. Also, there are literally zero research papers on using “fused KV caching” with other optimizations, such as linear attention algorithms or hybrid local-global attention, which would be an obvious improvement to make RAG run even faster.

Open Questions for Non-Prefix KV Caching

This area still needs more research! Basic confirmation of the technique will still require more confirmation. Some of the other areas needing examination include:

Why does this even work? Does the attention computation during the decoding phase quickly make up for the total lack of inter-chunk and instructions-chunk attention in the KV cache?
To what extent is this a lossy approximation of the KV cache, and how does it depend on the position of the RAG chunk in the ordering.
Does the lossiness of the approximation depend on the length of the chunk? On the number of chunks? On the distance between the chunk and the user query tokens? On the length of any prepended instructions? On the length of the user’s query afterwards?
Whether the KV cache for a RAG chunk should be computed with or without any prepended global instructions or meta-instructions (i.e., it’s faster if not, and using a prepended instruction sequence is like doing prefix KV caching if the chunk is in first position, so it may depend on its ranking position.)
To what extent does this fused KV caching method require that global attention is used in the decoding phase? Using local attention (e.g., sliding window) or hybrid local-global attention is a common industry backend optimization. Does this method of RAG chunk KV caching undermine this other optimization method?

Nevertheless, I am optimistic that we’ll see an industry implementation of fused KV caching for RAG chunks in the near future!

Cached Tokens Should Be Cheaper?

When you think about “cached tokens” at a 50% discount, shouldn’t it be cheaper? I mean, it’s taking a complex LLM computation away, and replacing it with a disk load of some KV data (i.e., prefix KV caching). And why should you pay more for caching of tokens for the bigger models, when storage cost would be the same for all LLMs?

That all sounds like reasonable logic, and yet, no. The above postulate about storage costs misunderstands how caching of tokens works:

It’s not only a disk load, and
The cost depends on the cached token length, and
The cost does depend on the model.

The way that prefix KV caching works is that it does take the compute cost away for that prefix of tokens, replacing it with a disk load of data. Note that it’s not only storage costs, but also network bandwidth cost, because that data has to be sent to the GPUs, which indicates the key point:

GPUs still use the KV cache data.

The GPUs still have crunch away on that KV cache data, not for the cached tokens, but for:

(a) all of the non-cached prompt tokens, and

(b) all of the newly decoded tokens in the generated output.

Hence, token caching does alleviate the “context processing” part of the query, but only in the prefill phase, and not even for all of the prefill phase. And the longer the cached token sequence, the more computation that is required for each non-cached token, because they have to “pay attention” to every single one of the cached tokens. Hence, there is still extra GPU computation cost in the attention module with these features:

More computations for longer cached token sequences, and
Extra computations for every non-cached user query token, and
Extra attention computations for every output token, and
These are all more expensive computations when used with bigger models.

Hence, it’s not clear if a 50% discount is too high or too low, but we shouldn’t be expecting a fixed price either.

References on RAG Caching

Research papers on RAG cache architectures:

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444
Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Pere Martra, Aug 2024 (accessed), Implementing semantic cache to improve a RAG system with FAISS, https://huggingface.co/learn/cookbook/semantic_cache_chroma_vector_database
Richmond Alake, Apoorva Joshi, Aug 14, 2024, Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain, MongoDB, https://www.mongodb.com/developer/products/atlas/advanced-rag-langchain-mongodb/
Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang, 10 Oct 2024, TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text, https://arxiv.org/abs/2410.07590 (Fusing precomputed KV caches for each RAG chunk.)
David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag