Aussie AI

Token Merging

Last Updated 10 May, 2026

by David Spuler, Ph.D.

What is Token Merging?

Token merging is an LLM inference optimization that merges two or more adjacent tokens into a single token. This optimization reduces the number of tokens for a GPU to process, thereby directly improving speed, and reducing compute. Token merging is a type of token reduction, and is closely related to token pruning, which is simply removing tokens from processing. Merging of multiple tokens together can be performed before sending the prompt to the LLM, or dynamically during inference based on analysis of which tokens are important.

Releated research areas include:

Research on Token Merging

Research papers include:

Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert, 25 May 2024, Accelerating Transformers with Spectrum-Preserving Token Merging, https://arxiv.org/abs/2405.16148
Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim, 20 Mar 2024, vid-TLDR: Training Free Token merging for Light-weight Video Transformer, https://arxiv.org/abs/2403.13347 Code: https://github.com/mlvlab/vid-TLDR (Token merging in video with a focus on the background of the image.)
Maxim Bonnaerens, Joni Dambre, 17 Aug 2023 (v2), Learned Thresholds Token Merging and Pruning for Vision Transformers, https://arxiv.org/abs/2307.10780
Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi, 27 May 2023, PuMer: Pruning and Merging Tokens for Efficient Vision Language Models, https://arxiv.org/abs/2305.17530
Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia, 19 May 2023, Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding, https://arxiv.org/abs/2305.11392
Daniel Bolya, Judy Hoffman, 30 Mar 2023, Token Merging for Fast Stable Diffusion, https://arxiv.org/abs/2303.17604
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman, 1 Mar 2023 (v3), Token Merging: Your ViT But Faster, https://arxiv.org/abs/2210.09461
Cedric Renggli, André Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, Carlos Riquelme, 24 Feb 2022, Learning to Merge Tokens in Vision Transformers, https://arxiv.org/abs/2202.12015
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, 2 Dec 2023, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, https://arxiv.org/abs/2312.01026
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang, 25 Apr 2024, TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning, https://arxiv.org/abs/2404.16635
Daniel Kienzle, Marco Kantonis, Robin Schön, Rainer Lienhart, 23 May 2024, Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation, https://arxiv.org/abs/2405.14467
Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin, 31 Mar 2024, A General and Efficient Training for Transformer via Token Expansion, https://arxiv.org/abs/2404.00672 Code: https://github.com/Osilly/TokenExpansion (Token merging to accelerate training.)
Xu, L., Wang, L. & Guo, Z., 2024, ATFTrans: attention-weighted token fusion transformer for robust and efficient object tracking. Neural Comput & Applic 36, 7043–7056 (2024). https://doi.org/10.1007/s00521-024-09444-0 https://link.springer.com/article/10.1007/s00521-024-09444-0
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Yancheng Wang, Yingzhen Yang, 21 Jul 2024, Efficient Visual Transformer by Learnable Token Merging, https://arxiv.org/abs/2407.15219 Code: https://github.com/Statistical-Deep-Learning/LTM
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You, 17 Dec 2024, Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training, https://arxiv.org/abs/2412.12496
Seungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim, 21 Dec 2024, ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition, https://arxiv.org/abs/2412.16491
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Karim Haroun, Thibault Allenet, Karim Ben Chehida, Jean Martinet, 2025, Dynamic hierarchical token merging for vision transformers, VISAPP-2025- 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Feb 2025, Porto, Portugal. hal 04885469 https://hal.science/hal-04885469/document
J. Shin, M. Kang, Y. Han, J. Park and L. -S. Kim, "AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer," in IEEE Transactions on Computers, doi: 10.1109/TC.2025.3540638. https://ieeexplore.ieee.org/abstract/document/10880106/
Yu Yang, Yue Zhou, Xiaofang Hu, Shukai Duan, 2025, KFF: K-Feature Fusion Token Merging for Vision Transformer, Expert Systems with Applications, 128206, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2025.128206, https://www.sciencedirect.com/science/article/abs/pii/S0957417425018263
Leon G\"otz, Marcel Kollovieh, Stephan G\"unnemann, Leo Schwinn, 5 Aug 2025, Efficient Time Series Processing for Transformers and State-Space Models through Token Merging, https://arxiv.org/abs/2405.17951
Dong Liu, Yanxuan Yu, 16 Aug 2025, QuickMerge++: Fast Token Merging with Autoregressive Prior, https://arxiv.org/abs/2508.13204
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat, 12 Sep 2025, Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge, https://arxiv.org/abs/2509.09955
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, 11 Sep 2025, Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication, https://arxiv.org/abs/2509.09168
Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim, 15 Oct 2025, What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging, https://arxiv.org/abs/2510.13232
Wenyi Gong, Mieszko Lis, 26 Sep 2025, CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones, https://arxiv.org/abs/2509.21764
Mandar Karhade, MD. PhD., Feb 2026, The End of Token Inflation with DeepSeek OCR-2: How “Context Optical Compression” Re-Engineers Document Processing from First Principles, https://ithinkbot.com/the-end-of-token-inflation-with-deepseek-ocr-2-8acdd653c2cf
Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang, 23 Mar 2026, Rethinking Token Reduction for Large Vision-Language Models, https://arxiv.org/abs/2603.21701 https://github.com/MArSha1147/MetaCompress
Surendra Pathak, Bo Han, 11 Apr 2026 (v2), Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies, https://arxiv.org/abs/2603.27960