Aussie AI

Salient Tokens

Last Updated 1 January, 2026

by David Spuler, Ph.D.

Salient tokens are an LLM optimization strategy that focuses inference cost on important or "salient" tokens. Unimportant or non-salient tokens can be pruned to reduce overall computations to focus on the salient ones. The strategy of pruning non-salient tokens can be applied to the regular inference tokens, or the same strategy can be applied in the KV cache, or both.

Research on Salient Tokens

Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256
Hyesong Choi, Hyejin Park, Kwang Moo Yi, Sungmin Cha, Dongbo Min, 12 Apr 2024, Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training, https://arxiv.org/abs/2404.08327
Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 30 Oct 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, https://arxiv.org/abs/2111.00230
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang, 17 Feb 2025, Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More, https://arxiv.org/abs/2502.11494 https://github.com/ZichenWen1/DART
Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee, 18 Dec 2025, Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference, https://arxiv.org/abs/2512.16391 (Chooses the likely output tokens exactly in early layers and then only computes with those, pruning less likely tokens.)