Aussie AI
Fused KV Caching
-
Last Updated 19 September, 2025
-
by David Spuler, Ph.D.
What is Fused KV Caching?
Fused KV caching is merging two KV caches on the lengthwise token dimension. This means that the cache of two pieces of adjacent text can be created by just concatenating the two caches end-to-end. Surprisingly, this actually just works, but it did have some accuracy issues that needed research.
The nomenclature in this subarea of LLM optimization research is not yet settled, and the papers have used various different names for this technique:
- Fused KV caching
- Substring KV caching
- Concatenated KV caching
- Position-independent caching (PIC)
This has been an emerging area of research as a generalization of "prefix KV caching". The idea is to address the situation where the two pieces of text are not a prefix or a combined prefix. How do we combine two KV caches into one?
One particular situation where this commonly occurs is RAG chunks. To create the KV cache for two RAG chunks, we'd like to just pre-compute the KV cache for all the chunks, so we can just combine them in whatever order our reranker puts them. That would allow precomputed caches for the entire set of RAG chunks stored in the datastore alongside their text, and this would be available for all user queries. That speedup is the promise of this research.
This area of research, whatever you want to call it, just got a big boost from Meta research labs; see the "REFRAG" project by Lin et al. (2025). They didn't quite do the merging of two KV caches in the same way, but instead modified the attention algorithm to treat the chunks of text differently. The end result is very similar to precomputing and concatenating KV caches.
Research on Fused KV Caching
Research papers include:
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457 (This paper briefly considers merging KV caches of multiple RAG chunks, but instead focuses on (a) caching of two or more chunks in one KV cache record, and (b) reordering the chunks in a cache-aware manner.)
- Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056 (Examines KV cache head merging approaches for KV cache size reduction, and also examines RoPE encoding issues with relevance to fusing KV caches.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
- Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang, 10 Oct 2024, TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text, https://arxiv.org/abs/2410.07590 (Fusing precomputed KV caches for each RAG chunk.)
- Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie, 20 Oct 2024, EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models, https://arxiv.org/abs/2410.15332
- David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
- East Sun, Yan Wang, Lan Tian, 17 Oct 2024 (v4), Block-Attention for Efficient RAG, https://arxiv.org/abs/2409.15355
- Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu, 21 Dec 2024, Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models, https://arxiv.org/abs/2412.16545 (Parallel encoding of chunks of context is similar to fused KV caching.)
- Philhoon Oh, Jinwoo Shin, James Thorne, 13 Jan 2025, Parallel Key-Value Cache Fusion for Position Invariant RAG, https://arxiv.org/abs/2501.07523 (Generating the KV cache for each RAG chunk.)
- Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
- Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen, 4 Feb 2025, MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving, https://arxiv.org/abs/2502.01960
- Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu, 17 Feb 2025, Does RAG Really Perform Bad For Long-Context Processing? https://arxiv.org/abs/2502.11444 (Long context RAG processing based on the KV cache data is similar to fused/substring KV caching methods.)
- S Agarwal, S Sundaresan, S Mitra, D Mahapatra, Feb 2025, Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation, https://skejriwal44.github.io/docs/CacheCraft_SIGMOD_2025.pdf (Managing pre-computed KV caches for RAG chunks as a generalization of prefix KV caching, addressing limitations in their position and ordering.)
- Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 21 Feb 2025, KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse, https://arxiv.org/abs/2502.16002 https://github.com/UCSB-NLP-Chang/KVLink (Computing a KV cache for each RAG chunk, and using techniques to fuse/merge/concatenate these KV caches, i.e., fused KV caching as a generalization of prefix KV caching, while restoring cross-chunk attention accuracy via 3 techniques: positional re-encoding, "link tokens" between chunks processed during inference, and fine-tuning).
- Shai Bergman, Zhang Ji, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos, 7 Mar 2025, Leveraging Approximate Caching for Faster Retrieval-Augmented Generation, https://arxiv.org/abs/2503.05530
- J Hu, W Huang, W Wang, H Wang, H Feng, X Chen, 2025, EPIC: Efficient Position-Independent Caching for Serving Large Language Models, Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025, https://openreview.net/pdf?id=qjd3ZUiHRT
- Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan, 1 Sep 2025, REFRAG: Rethinking RAG based Decoding, https://www.arxiv.org/abs/2509.01092 https://www.alphaxiv.org/pdf/2509.01092 (Separates the attention computations across RAG chunks, which is effectively the same as "fused KV" or "concatenated KV" approaches with pre-computed per-chunk KV caches.)
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home