Aussie AI

Chapter 10. Long RAG, Mini-RAG and Mega-RAG

Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"

by David Spuler and Michael Sharpe

Chapter 10. Long RAG, Mini-RAG and Mega-RAG

Long Context RAG

There’s been a lot of research over the last year on “long context” in LLM inference. There were two problems that needed solving:

Length generalization
Quadratic inference cost

In short, LLMs on long input documents were neither smart nor speedy.

Length generalization is about making the LLMs smart enough to find the answer to your question in a long document. Historically, LLMs were good at finding answers at the start of a document, and also near the end, but not so much in the middle. This area has massively improved now, to the point where many models are passing various types of “needle in a haystack” inference benchmarks.

The other problem with running an LLM over a long document was the computational cost, which is “quadratic” in the length of the input. Hence, running inference on a long document was prohibitively expensive. The main problems were:

Self attention had quadratic compute cost
KV cache data grew proportionally to input length

Well, that sounds easy, since there are only two problems: GPU cost and memory cost. No, it was far from easy, but there have been several research breakthroughs and success in both areas using techniques such as:

Memory-efficient attention algorithms (e.g., Flash attention, Paged attention).
KV cache compression (e.g., KV quantization, KV layer fusion, etc.).

Long context is a solved problem!

Overall, the good news is that we now have models with a very long “context window” such as 128K or even 1M tokens, in both open source and commercial invocations. Furthermore, the pricing on these models from commercial providers has been dropping fast, to the point where their use is cost-effective for many AI applications.

What has this to do with RAG?

The main area where RAG applications can leverage a long context window in a model are the chunks. The size of the LLM context is the product of the number of chunks returned by the retriever (or reranker), and the token length of each chunk. Hence, having a fast and cheap way to process large documents, and an LLM smart enough to find useful information, leads to ideas like:

Retrieve more chunks
Use bigger chunks

And taking all these to their logical conclusion, as programmers tend to do, we get the ideas:

Mini-RAG — combine all the chunks into one massive document (dispensing with the retriever completely), or
Mega-RAG — use a small number of very massive chunks.

Both of these ideas are now viable options for RAG architectures with a long context LLM as the superpower underneath. Mostly, it depends on the use case whether to use them or not.

Mini-RAG

The fullest generalization of this idea is that you can cancel the licensing fees on your vector database, and shut down your reranker. Just use one document!

The idea is that you can prepend anything you like, including a full datasheet of information about your preferred product. This is then prepended to every query, and the LLM has to decide how much of the information it wants to use from the context.

The DeLorean is a famous sportscar that was used in the movie Back to the Future starring Michael J. Fox. It was literally a time machine in the movie and its sequels. This is why everybody loves this car and a DeLorean is the most visibly wonderful car you could possibly buy. The most notable and impressive feature of the DeLorean is that the doors open upwards like wings. Also, the engine is located in the rear, like all good sportscars.

You could write literally a thousand words about the specs of a DeLorean if you like. This is like a “mini-RAG” system where only one document is ever returned. But you don’t need to code any of the RAG architecture elements, like a datastore retrieval module, because there’s no datastore. Instead, the method is just a single string operation to prepend the long context in your prompt engineering module.

You do have a prompt engineering module, right?

Too Expensive?

Both of the mini-RAG and mega-RAG ideas involving sending more tokens to your inference provider, with the goal of giving users better answers. However, this also means more bucks in your LLM API bills, or more cost to you if you’re self-hosting. Is it too expensive, and can this be addressed in any way?

The idea of mini-RAG is converting a HR chatbot into a mini-RAG architecture by sending the entire HR manual with every query. What’s the wordcount of your HR manual, including all the articles on the intranet?

There are LLMs that can handle 1M tokens, which is about 750,000 words, which is about 1,500 printed pages on full-size paper (or about eight full-length mystery novels in 6x9 format on Amazon Kindle Unlimited). It’s amazing that LLMs can read that much stuff and still have something useful to say about your question, but here’s the thing: it’s not free.

Sending 1M tokens with every user query might be non-viable. If the cost is $15 per 1M input tokens (note that it’s not output tokens), then that’s $15 for every user question about pay and benefits. Not ideal.

There are a few ways to reduce the cost, such as:

Reduce the word length of the full HR manual used as input (i.e., via manual or automatic context compression)
Use “cached tokens” support via “prefix caching” in many commercial APIs, since the prepended document is a common prefix for every query (typically, a 50% discount).
Split your big HR chatbot into sub-areas, each with a shorter document, thereby creating multiple smaller chatbots, where the user can self-select which one they need.

Mega-RAG

The support for long context inference in LLMs, with both better understanding and more efficient inference, means that RAG applications can use much bigger chunks. This is the idea of “mega-RAG” architectures and there are several ways to take advantage of these new capabilities:

Return fewer, much longer RAG chunks.
Return whole documents rather than chunks.
Use overlapping chunks rather than topic-specific paragraphs.
Base your entire RAG application on a single document (i.e., mini-RAG) or a handful of long documents.

If you’re using a small number of large documents, the need for a full vector database becomes questionable. Perhaps a keyword-based search would be adequate (and faster). Other heuristics are also possible to speed things up.

Another idea is to have the user effectively “self-select” which large document they are interrogating via the query, thereby obviating the need for any search. For example, an internal HR Q&A app could offer an input selector that allows the user to ask a question about either “Pay” or “Benefits” or other subareas. This idea is more of a “multi-mini-RAG” architecture than a single mega-RAG application.

The latency and token cost of using larger RAG chunks is still non-negligible, but it’s declining faster than an ice cube at a monster truck rally. Speed of inference can be further optimized using prefix caching, since each large RAG chunk will be a fixed prefix of unchanging text. Possibly the retriever might only ever return one document, in which case prefix caching is always possible. Even if not always only one, the concerns about needing to cache too many distinct orderings of RAG documents will be much less when there is only a small set of documents. Hence, the latency and cost may be acceptable, and the main advantage of long context RAG is the opportunity to give much more accurate answers to users.

Research on Long Context RAG

There is a lot of research on getting LLMs to run fast on long context inputs, and some of this is related to RAG architectures (i.e., processing big chunks!):

Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
Contextual AI Team, March 19, 2024 Introducing RAG 2.0, https://contextual.ai/introducing-rag2/
Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang, 20 Dec 2024, Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks, https://arxiv.org/abs/2412.15605 (Mini-RAG architecture preloading the entire knowledge into the LLM context and then using KV caching.)
Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun, 27 Dec 2024, Long Context vs. RAG for LLMs: An Evaluation and Revisits, https://arxiv.org/abs/2501.01880 (Long context, summarization-based RAG, and classic chunked RAG have different strengths and weaknesses for different types of query.)
Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu, 15 Jan 2025, MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents, https://arxiv.org/abs/2501.08828
Salvatore Raieli, Jan 2025, Do Not Flip a Coin: When to Use RAG or Long Context LLMs, Understanding the Trade-offs and Best Practices for Optimizing LLMs with External Knowledge Sources, https://levelup.gitconnected.com/do-not-flip-a-coin-when-to-use-rag-or-long-context-llms-6f51a39de98c (Analysis of several papers that compare LC to RAG)
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 16 May 2024 (v3), FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065
Isuru Lakshan Ekanayaka, Jan 2025, Retrieval-Augmented Generation (RAG) vs. Cache-Augmented Generation (CAG): A Deep Dive into Faster, Smarter Knowledge Integration, https://pub.towardsai.net/retrieval-augmented-generation-rag-vs-0b4bc63c1653
Dr. Ashish Bamania Jan 10, 2025, Cache-Augmented Generation (CAG) Is Here To Replace RAG: A deep dive into how a novel technique called Cache-Augmented Generation (CAG) works and reduces/ eliminates the need for Retrieval-augmented generation (RAG), https://levelup.gitconnected.com/cache-augmented-generation-cag-is-here-to-replace-rag-3d25c52360b2
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 12 Apr 2021 (v4), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://arxiv.org/abs/2005.11401
Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu, 27 Jan 2025, Parametric Retrieval Augmented Generation, https://arxiv.org/abs/2501.15915 https://github.com/oneal2000/prag (Parametric RAG (PRAG) is training the RAG documents into model parameters, rather than prepending documents using long context RAG, and this means a shorter inference token length.)
Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang, 3 Feb 2025, VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos, https://arxiv.org/abs/2502.01549 https://github.com/HKUDS/VideoRAG
Cristian Leo, Feb 2025, Don’t Do RAG: Cache is the future: CAG or RAG? Let’s explore Cached Augmented Generation, its math, and trade-offs. Let’s dig into its research paper to see what it excels at, and how you could leverage it, https://levelup.gitconnected.com/dont-do-rag-cache-is-the-future-d1e995f0c76f
Manpreet Singh, Feb 2025, Goodbye RAG? Gemini 2.0 Flash Have Just Killed It! https://ai.gopubby.com/goodbye-rag-gemini-2-0-flash-have-just-killed-it-96301113c01f
Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu, 17 Feb 2025, Does RAG Really Perform Bad For Long-Context Processing? https://arxiv.org/abs/2502.11444 (Long context RAG processing based on the KV cache data is similar to fused/substring KV caching methods.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh, 27 Feb 2025, Long-Context Inference with Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2502.20330
Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou, 15 May 2025, Hierarchical Document Refinement for Long-context Retrieval-augmented Generation, https://arxiv.org/abs/2505.10413 https://github.com/ignorejjj/LongRefiner

Research on Mini-RAG

There are starting to be research papers on making use of long context RAG by cramming the entire information into one LLM context, and avoiding a retrieval database lookup. Papers include:

Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang, 20 Dec 2024, Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks, https://arxiv.org/abs/2412.15605 (Mini-RAG architecture preloading the entire knowledge into the LLM context and using pre-computed prefix KV caching for efficiency.)
Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky, 17 Oct 2024 (v2), Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, https://arxiv.org/abs/2407.16833
Jérôme DIAZ, Dec 2024, Why Retrieval-Augmented Generation Is Still Relevant in the Era of Long-Context Language Models, In this article we will explore why 128K tokens (and more) models can’t fully replace using RAG. https://towardsdatascience.com/why-retrieval-augmented-generation-is-still-relevant-in-the-era-of-long-context-language-models-e36f509abac5
Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 20 Nov 2023 (v3), Lost in the Middle: How Language Models Use Long Contexts, https://arxiv.org/abs/2307.03172 (Information is best placed at the start, or otherwise at the end, of a long context.)
Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun, 27 Dec 2024, Long Context vs. RAG for LLMs: An Evaluation and Revisits, https://arxiv.org/abs/2501.01880 (Long context, summarization-based RAG, and classic chunked RAG have different strengths and weaknesses for different types of query.)
Tianyu Fan, Jingyuan Wang, Xubin Ren, Chao Huang, 14 Jan 2025 (v2), MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation, https://arxiv.org/abs/2501.06713 https://github.com/HKUDS/MiniRAG (Uses the name “mini RAG” but is about knowledge graphs not long context RAG.)
Isuru Lakshan Ekanayaka, Jan 2025, Retrieval-Augmented Generation (RAG) vs. Cache-Augmented Generation (CAG): A Deep Dive into Faster, Smarter Knowledge Integration, https://pub.towardsai.net/retrieval-augmented-generation-rag-vs-0b4bc63c1653
Dr. Ashish Bamania Jan 10, 2025, Cache-Augmented Generation (CAG) Is Here To Replace RAG: A deep dive into how a novel technique called Cache-Augmented Generation (CAG) works and reduces/ eliminates the need for Retrieval-augmented generation (RAG), https://levelup.gitconnected.com/cache-augmented-generation-cag-is-here-to-replace-rag-3d25c52360b2
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 12 Apr 2021 (v4), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://arxiv.org/abs/2005.11401
Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu, 27 Jan 2025, Parametric Retrieval Augmented Generation, https://arxiv.org/abs/2501.15915 https://github.com/oneal2000/prag (Parametric RAG (PRAG) is training the RAG documents into model parameters, rather than prepending documents using long context RAG, and this means a shorter inference token length.)
Cristian Leo, Feb 2025, Don’t Do RAG: Cache is the future: CAG or RAG? Let’s explore Cached Augmented Generation, its math, and trade-offs. Let’s dig into its research paper to see what it excels at, and how you could leverage it, https://levelup.gitconnected.com/dont-do-rag-cache-is-the-future-d1e995f0c76f
Manpreet Singh, Feb 2025, Goodbye RAG? Gemini 2.0 Flash Have Just Killed It! https://ai.gopubby.com/goodbye-rag-gemini-2-0-flash-have-just-killed-it-96301113c01f
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129