Aussie AI

Token Merging

  • Last Updated 10 May, 2026
  • by David Spuler, Ph.D.

What is Token Merging?

Token merging is an LLM inference optimization that merges two or more adjacent tokens into a single token. This optimization reduces the number of tokens for a GPU to process, thereby directly improving speed, and reducing compute. Token merging is a type of token reduction, and is closely related to token pruning, which is simply removing tokens from processing. Merging of multiple tokens together can be performed before sending the prompt to the LLM, or dynamically during inference based on analysis of which tokens are important.

Releated research areas include:

Research on Token Merging

Research papers include:

  • Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert, 25 May 2024, Accelerating Transformers with Spectrum-Preserving Token Merging, https://arxiv.org/abs/2405.16148
  • Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim, 20 Mar 2024, vid-TLDR: Training Free Token merging for Light-weight Video Transformer, https://arxiv.org/abs/2403.13347 Code: https://github.com/mlvlab/vid-TLDR (Token merging in video with a focus on the background of the image.)
  • Maxim Bonnaerens, Joni Dambre, 17 Aug 2023 (v2), Learned Thresholds Token Merging and Pruning for Vision Transformers, https://arxiv.org/abs/2307.10780
  • Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi, 27 May 2023, PuMer: Pruning and Merging Tokens for Efficient Vision Language Models, https://arxiv.org/abs/2305.17530
  • Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia, 19 May 2023, Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding, https://arxiv.org/abs/2305.11392
  • Daniel Bolya, Judy Hoffman, 30 Mar 2023, Token Merging for Fast Stable Diffusion, https://arxiv.org/abs/2303.17604
  • Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman, 1 Mar 2023 (v3), Token Merging: Your ViT But Faster, https://arxiv.org/abs/2210.09461
  • Cedric Renggli, André Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, Carlos Riquelme, 24 Feb 2022, Learning to Merge Tokens in Vision Transformers, https://arxiv.org/abs/2202.12015
  • Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, 2 Dec 2023, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, https://arxiv.org/abs/2312.01026
  • Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang, 25 Apr 2024, TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning, https://arxiv.org/abs/2404.16635
  • Daniel Kienzle, Marco Kantonis, Robin Schön, Rainer Lienhart, 23 May 2024, Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation, https://arxiv.org/abs/2405.14467
  • Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin, 31 Mar 2024, A General and Efficient Training for Transformer via Token Expansion, https://arxiv.org/abs/2404.00672 Code: https://github.com/Osilly/TokenExpansion (Token merging to accelerate training.)
  • Xu, L., Wang, L. & Guo, Z., 2024, ATFTrans: attention-weighted token fusion transformer for robust and efficient object tracking. Neural Comput & Applic 36, 7043–7056 (2024). https://doi.org/10.1007/s00521-024-09444-0 https://link.springer.com/article/10.1007/s00521-024-09444-0
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
  • Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
  • Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
  • Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
  • Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
  • Yancheng Wang, Yingzhen Yang, 21 Jul 2024, Efficient Visual Transformer by Learnable Token Merging, https://arxiv.org/abs/2407.15219 Code: https://github.com/Statistical-Deep-Learning/LTM
  • Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
  • Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
  • J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
  • Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
  • Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You, 17 Dec 2024, Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training, https://arxiv.org/abs/2412.12496
  • Seungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim, 21 Dec 2024, ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition, https://arxiv.org/abs/2412.16491
  • Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
  • Karim Haroun, Thibault Allenet, Karim Ben Chehida, Jean Martinet, 2025, Dynamic hierarchical token merging for vision transformers, VISAPP-2025- 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Feb 2025, Porto, Portugal. hal 04885469 https://hal.science/hal-04885469/document
  • J. Shin, M. Kang, Y. Han, J. Park and L. -S. Kim, "AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer," in IEEE Transactions on Computers, doi: 10.1109/TC.2025.3540638. https://ieeexplore.ieee.org/abstract/document/10880106/
  • Yu Yang, Yue Zhou, Xiaofang Hu, Shukai Duan, 2025, KFF: K-Feature Fusion Token Merging for Vision Transformer, Expert Systems with Applications, 128206, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2025.128206, https://www.sciencedirect.com/science/article/abs/pii/S0957417425018263
  • Leon G\"otz, Marcel Kollovieh, Stephan G\"unnemann, Leo Schwinn, 5 Aug 2025, Efficient Time Series Processing for Transformers and State-Space Models through Token Merging, https://arxiv.org/abs/2405.17951
  • Dong Liu, Yanxuan Yu, 16 Aug 2025, QuickMerge++: Fast Token Merging with Autoregressive Prior, https://arxiv.org/abs/2508.13204
  • Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat, 12 Sep 2025, Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge, https://arxiv.org/abs/2509.09955
  • Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, 11 Sep 2025, Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication, https://arxiv.org/abs/2509.09168
  • Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim, 15 Oct 2025, What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging, https://arxiv.org/abs/2510.13232
  • Wenyi Gong, Mieszko Lis, 26 Sep 2025, CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones, https://arxiv.org/abs/2509.21764
  • Mandar Karhade, MD. PhD., Feb 2026, The End of Token Inflation with DeepSeek OCR-2: How “Context Optical Compression” Re-Engineers Document Processing from First Principles, https://ithinkbot.com/the-end-of-token-inflation-with-deepseek-ocr-2-8acdd653c2cf
  • Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang, 23 Mar 2026, Rethinking Token Reduction for Large Vision-Language Models, https://arxiv.org/abs/2603.21701 https://github.com/MArSha1147/MetaCompress
  • Surendra Pathak, Bo Han, 11 Apr 2026 (v2), Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies, https://arxiv.org/abs/2603.27960

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research Topics

Read more about: