Aussie AI
Context Compression
-
Last Updated 5 April, 2026
-
by David Spuler, Ph.D.
What is Context Compression?
Context compression is an LLM inference optimization that reduces processing of tokens in the context of a query. It is a type of "prompt compression" that involves aspects of techniques such as token pruning or token merging.
Releated research areas include:
Research on Context Compression
Research papers include:
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
- Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu, 5 Dec 2024, NVILA: Efficient Frontier Visual Language Models, https://arxiv.org/abs/2412.04468
- Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
- Jeffrey Cheng, Benjamin Van Durme, 17 Dec 2024, Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, https://arxiv.org/abs/2412.13171 (Context compression applied to interim CoT reasoning steps.)
- Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou, 23 Dec 2024, A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression, https://arxiv.org/abs/2412.17483
- Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou, 11 Oct 2024 (v2), Compressing Lengthy Context With UltraGist, https://arxiv.org/abs/2405.16635 https://github.com/namespace-Pt/UltraGist
- J Köpke, A Safan, 2024, Efficient llm-based conversational process modeling, Business Process Management Workshops, https://isys.uni-klu.ac.at/PDF/BPM_2024_paper_1442.pdf (Examines and improves the token costs of prompt strategies in conversational sessions.)
- H Liao, S He, Y Xu, Y Zhang, S Liu, K Liu, J Zhao, Jan 2025, Awakening Augmented Generation: Learning to Awaken Internal Knowledge of Large Language Models for Question Answering, Proceedings of the 31st International Conference on Computational Linguistics, pages 1333–1352, January 19–24, 2025, https://aclanthology.org/2025.coling-main.89.pdf https://github.com/Xnhyacinth/IAG (Attempts to perform RALM based only on parametric knowledge, without any external sources, thereby optimizing away RAG steps.)
- Jeonghun Cho, Gary Geunbae Lee, 23 Jan 2025, K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor, https://arxiv.org/abs/2501.13567
- Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, Stéphane Clinchant, 27 Jan 2025, Provence: efficient and robust context pruning for retrieval-augmented generation, https://arxiv.org/abs/2501.16214
- Maxime Louis, Hervé Déjean, Stéphane Clinchant, 7 Jan 2025, PISCO: Pretty Simple Compression for Retrieval-Augmented Generation, https://arxiv.org/abs/2501.16075
- Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong, 20 Feb 2025, ParallelComp: Parallel Long-Context Compressor for Length Extrapolation, https://arxiv.org/abs/2502.14317
- Chau Minh Pham, Yapei Chang, Mohit Iyyer, 20 Feb 2025, CLIPPER: Compression enables long-context synthetic data generation, https://arxiv.org/abs/2502.14854
- Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang, 21 Feb 2025, LightThinker: Thinking Step-by-Step Compression, https://arxiv.org/abs/2502.15589 https://github.com/zjunlp/LightThinker (Faster CoT by compressing the text of intermediate reasoning steps with gist tokens.)
- Zhanghao Hu, Hanqi Yan, Qingling Zhu, Zhenyi Shen, Yulan He, Lin Gui, 3 Mar 2025, Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering, https://arxiv.org/abs/2503.01606
- Michael List, July 2025, Distillation for Efficient History Compression in Reinforcement Learning, Master’s Thesis, https://epub.jku.at/obvulihs/content/titleinfo/12295461/full.pdf
- Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu, 21 May 2025, Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention, https://arxiv.org/abs/2505.15774
- Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim, 20 May 2025, Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning, https://arxiv.org/abs/2505.13866 https://github.com/jiwonsong-dev/ReasoningPathCompression
- Shuyu Guo, Zhaochun Ren, 24 Jul 2025, Enhancing RAG Efficiency with Adaptive Context Compression, https://arxiv.org/abs/2507.22931
- Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li, 25 Aug 2025, ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models, https://arxiv.org/abs/2508.17892
- Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov, 13 Aug 2025, RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression, https://arxiv.org/abs/2502.14051
- Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu, 17 Aug 2025, ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression, https://arxiv.org/abs/2508.12212
- Guillermo Sarasa Dur\'an, Ana Granados Fontecha, Francisco de Borja Rodr\'iguez Ort\'iz, 20 Aug 2025, Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features, https://arxiv.org/abs/2508.14780
- Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan, 1 Oct 2025, ACON: Optimizing Context Compression for Long-horizon LLM Agents, https://arxiv.org/abs/2510.00615
- Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim, 28 Oct 2025, FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation, https://arxiv.org/abs/2502.01068
- Yair Feldman, Yoav Artzi, 23 Oct 2025, Simple Context Compression: Mean-Pooling and Multi-Ratio Training, https://arxiv.org/abs/2510.20797
- Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang, 20 Oct 2025, Glyph: Scaling Context Windows via Visual-Text Compression, https://arxiv.org/abs/2510.17800
- Billy Dickson, Zoran Tiganj, 25 Oct 2025, Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows, https://arxiv.org/abs/2510.22109
- Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen, 25 Sep 2025, OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule, https://arxiv.org/abs/2509.21623
- Xin Liu, Xudong Wang, Pei Liu, Guoming Tang, 5 Oct 2025, ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs, https://arxiv.org/abs/2503.10714
- Jiachen Jiang and Yuxin Dong and Jinxin Zhou and Zhihui Zhu, 4 Oct 2025, From Compression to Expression: A Layerwise Analysis of In-Context Learning, https://arxiv.org/abs/2505.17322
- Yihang Wang, Xu Huang, Bowen Tian, Yueyang Su, Lei Yu, Huaming Liao, Yixing Fan, Jiafeng Guo, Xueqi Cheng, 13 Oct 2025, QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory, https://arxiv.org/abs/2408.10497
- Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 30 Sep 2025, KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction, https://arxiv.org/abs/2505.23416
- Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu, 11 Mar 2025, EFPC: Towards Efficient and Flexible Prompt Compression, https://arxiv.org/abs/2503.07956
- Morph, March 27, 2026, LLM Inference Optimization: A Practical Guide to Cutting Cost and Latency (2026): Concrete techniques for optimizing LLM inference across model, system, and application layers. Quantization, KV cache compression, continuous batching, speculative decoding, and context compaction with real benchmarks, https://www.morphllm.com/llm-inference-optimization
- Sebastian Raschka, PhD, Apr 04, 2026, Components of A Coding Agent: How coding agents use tools, memory, and repo context to make LLMs work better in practice, https://magazine.sebastianraschka.com/p/components-of-a-coding-agent
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home