Aussie AI
Prompt Compression
-
Last Updated 5 April, 2026
-
by David Spuler, Ph.D.
What is Prompt Compression?
Prompt compression is the shortening of LLM prompts automatically so that they contain fewer tokens and can be processed more efficiently. This isn't used on short user queries, but there are many other parts of a prompt that are longer, such as the entire conversational history, retrieved documents, or other types of "context" for the prompt. For this reason, this technique is often called "context compression" or "token reduction".
Releated research areas include:
Research on Prompt Compression
Research papers include:
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, Oct 2023, LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression https://arxiv.org/abs/2310.06839 Code: https://aka.ms/LLMLingua
- Yucheng Li, 24 Apr 2023, Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering. https://arxiv.org/abs/2304.12102
- Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen, 4 Nov 2023, Adapting Language Models to Compress Contexts (AutoCompressors method) https://arxiv.org/abs/2305.14788 Cope: https://github.com/princeton-nlp/AutoCompressors
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
- Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
- Alessandro Baiocchi, Indro Spinelli, Alessandro Nicolosi, Simone Scardapane, 26 Jan 2024, Adaptive Point Transformer, https://arxiv.org/abs/2401.14845
- Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
- Yichen Jiang, Marco Del Vecchio, Mohit Bansal, Anders Johannsen, March 17-22, 2024 , Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage, Findings of the Association for Computational Linguistics: EACL 2024, pages 2162–2174, https://aclanthology.org/2024.findings-eacl.143.pdf
- Amrit Nagarajan, Anand Raghunathan, Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks, PDF: https://www.researchgate.net/profile/Amrit_Nagarajan/publication/375911248_Input_Compression_with_Positional_Consistency_for_Efficient_Training_and_Inference_of_Transformer_Neural_Networks/links/656212a63fa26f66f4281d32/Input-Compression-with-Positional-Consistency-for-Efficient-Training-and-Inference-of-Transformer-Neural-Networks.pdf Code: https://github.com/amrnag/ICPC
- Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song, Dec 2023, Compressed Context Memory For Online Language Model Interaction, https://arxiv.org/abs/2312.03414 Code: https://github.com/snu-mllab/context-memory
- Jungmin Yun, Mihyeon Kim, Youngbin Kim, 2023, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification https://aclanthology.org/2023.findings-emnlp.909.pdf
- Maged S. Al-Shaibani and Irfan Ahmad, 2023, Consonant is all you need: a compact representation of English text for efficient NLP, https://aclanthology.org/2023.findings-emnlp.775.pdf (Tokens in a text are reduced by using a consonant-only representation, removing vowels, which is effectively a type of prompt compression.)
- Jinyu Chen, Wenchao Xu, Zicong Hong, Song Guo, Haozhao Wang, Jie Zhang, Deze Zeng, 10 Jan 2024, OTAS: An Elastic Transformer Serving System via Token Adaptation, https://arxiv.org/abs/2401.05031
- Iulia Brezeanu, Jan 5, 2024, How to Cut RAG Costs by 80% Using Prompt Compression, Towards Data Science, https://towardsdatascience.com/how-to-cut-rag-costs-by-80-using-prompt-compression-877a07c6bedb
- T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context autoencoder for context compression in a large language model,” arXiv preprint arXiv:2307.06945, 2023. https://arxiv.org/abs/2307.06945
- J Mu, XL Li, N Goodman, 2023, Learning to compress prompts with gist tokens, https://arxiv.org/abs/2304.08467
- Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang, 17 Jun 2024 (v2), SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM, https://arxiv.org/abs/2406.06571
- Cangqing Wang, Yutian Yang, Ruisi Li, Dan Sun, Ruicong Cai, Yuzhu Zhang, Chengqian Fu, Lillian Floyd, 18 Apr 2024 (v2), Adapting LLMs for Efficient Context Processing through Soft Prompt Compression, https://arxiv.org/abs/2404.04997
- Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
- Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
- Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
- David Gu, July 18, 2024, Text Compression for Efficient Language Generation, Master’s Thesis, Distributed Computing Group, Computer Engineering and Networks Laboratory, ETH Zürich, https://pub.tik.ee.ethz.ch/students/2023-HS/MA-2023-19.pdf (Training and inference at the sentence level, including caching of embeddings per sentence, which also has the side-effect of compressing the input prompts and reducing computation analogously to token pruning.)
- James Groeneveld, Aug 1, 2024, Prompt Design at Character.AI, Character.AI blog, https://research.character.ai/prompt-design-at-character-ai/
- Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 8 Jul 2024 (v2), Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
- Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
- Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang, 24 Nov 2022, DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention, https://arxiv.org/abs/2211.16368
- Barys Liskavets, Maxim Ushakov, Shuvendu Roy, Mark Klibanov, Ali Etemad, Shane Luke, 2 Sep 2024, Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference, https://arxiv.org/abs/2409.01227
- Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
- Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung, 15 Oct 2024, Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability, https://arxiv.org/abs/2410.11786
- Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier, 17 Oct 2024 (v2), Prompt Compression for Large Language Models: A Survey, https://arxiv.org/abs/2410.12388
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu, 5 Dec 2024, NVILA: Efficient Frontier Visual Language Models, https://arxiv.org/abs/2412.04468
- Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang, 17 Dec 2024 (v2), SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator, https://arxiv.org/abs/2412.12094 http://sepllm.github.io/
- Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
- Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
- Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami, 2 Dec 2024, Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index, https://arxiv.org/abs/2412.01690
- Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yukun Yan, Shuo Wang, Ge Yu, 25 Feb 2024, Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression, https://arxiv.org/abs/2402.16058 https://github.com/OpenMatch/Gist-COCO
- Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou, 11 Oct 2024 (v2), Compressing Lengthy Context With UltraGist, https://arxiv.org/abs/2405.16635 https://github.com/namespace-Pt/UltraGist
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- J Köpke, A Safan, 2024, Efficient llm-based conversational process modeling, Business Process Management Workshops, https://isys.uni-klu.ac.at/PDF/BPM_2024_paper_1442.pdf (Examines and improves the token costs of prompt strategies in conversational sessions.)
- Huanxuan Liao, Shizhu He, Yupu Hao, Xiang Li, Yuanzhe Zhang, Jun Zhao, Kang Liu, Jan 2025, SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Proceedings of the 31st International Conference on Computational Linguistics, pages 3203–3221 January 19–24, 2025, https://aclanthology.org/2025.coling-main.215.pdf
- Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
- Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067
- Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng, 16 Feb 2025, Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention, https://arxiv.org/abs/2502.11089
- Pengfei He, Shaowei Wang, Tse-Hsun Chen, 19 Feb 2025, CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs, https://arxiv.org/abs/2502.14925
- Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez, 24 Mar 2025, Understanding and Improving Information Preservation in Prompt Compression for LLMs, https://arxiv.org/abs/2503.19114
- Q Liu, H Fan, J Zhang, X Li, C Li, B Li, 2025, DisComp: A Two-Stage Prompt Optimization Framework Combining Task-Agnostic and Task-Aware Compression, Findings of the Association for Computational Linguistics: NAACL 2025, pages 1033–1044, April 29- May 4, 2025, https://aclanthology.org/anthology-files/pdf/naacl/2025.naacl-findings.58.pdf (LLM summarization and sentence-level compression.)
- Tom Zehle, Moritz Schlager, Timo Heiß, Matthias Feurer, 17 Jun 2025 (v4), CAPO: Cost-Aware Prompt Optimization, https://arxiv.org/abs/2504.16005
- Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar, 1 Aug 2025, Systematic Evaluation of Optimization Techniques for Long-Context Language Models, https://arxiv.org/abs/2508.00305
- Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang, 4 Aug 2025, ProCut: LLM Prompt Compression via Attribution Estimation, https://arxiv.org/abs/2508.02053
- Wenhao Mao, Chengbin Hou, Tianyu Zhang, Xinyu Lin, Ke Tang, Hairong Lv, 6 Aug 2025, Parse Trees Guided LLM Prompt Compression, https://arxiv.org/abs/2409.15395
- Tinghui Zhang, Yifan Wang, Daisy Zhe Wang, 16 Aug 2025, SCOPE: A Generative Approach for LLM Prompt Compression, https://arxiv.org/abs/2508.15813
- Zesen Liu and Zhixiang Zhang and Yuchong Xie and Dongdong She, 27 Oct 2025, CompressionAttack: Exploiting Prompt Compression as a New Attack Surface in LLM-Powered Agents, https://arxiv.org/abs/2510.22963
- Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Vedant Singh, Avani Appalla, Will Flanagan, Filipe Condessa, 20 Oct 2025, CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows, https://arxiv.org/abs/2510.18043
- Xiao Pu, Tianxing He, Xiaojun Wan, 17 Oct 2024, Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles, https://arxiv.org/abs/2410.14042
- Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu, 11 Mar 2025, EFPC: Towards Efficient and Flexible Prompt Compression, https://arxiv.org/abs/2503.07956
- Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du, 15 Apr 2025, Dynamic Compressing Prompts for Efficient Inference of Large Language Models, https://arxiv.org/abs/2504.11004 https://github.com/Fhujinwu/DCP
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home