Aussie AI
Vocabulary Trimming
-
Last Updated 1 June, 2026
-
by David Spuler, Ph.D.
What is Vocabulary Trimming?
Vocabulary trimming in LLMs is reducing the size of the token vocabulary for inference optimization. This reduces the size of the embedding dimension, thereby reducing both the computation cost and the memory size of model weights.
On the downside, vocabulary size reduction generally means that texts may need to be expressed in more tokens. This means that the token sequence length increases for some input prompts, so this dimension of LLM layer processing is worse, whereas the embedding dimension is improved. Hence, there are important tradeoffs in this approach.
Vocabulary trimming and lexical shortlisting have been use in Neural Machine Translation (NMT) for the translation of foreign languages. This research predates much of the LLM research, with many NMT techniques using other types of AI models, rather than LLMs and Transformers. The use of vocabulary trimming in LLMs remains largely unexplored and is an area warranting further research.
Related areas of LLM inference optimization include:
- Embeddings
- Tokenization
- Vocabulary expansion
- Token pruning
- Embeddings pruning
- Shortlisting
- Funnel transformer
Research on Vocabulary Trimming
Research papers on reducing the size of an LLM vocabulary:
- Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
- Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438 https://dl.acm.org/doi/10.1145/3552326.3587438 PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf (Dynamic routing to small or large LLMs based on the query.)
- Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
- Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
- J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
- Nikolay Bogoychev, Pinzhen Chen, 21 Sep 2021 (v3), The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation, https://arxiv.org/abs/2101.00421 https://aclanthology.org/2021.insights-1.12/
- Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
- Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni, 15 Feb 2024, Fast Vocabulary Transfer for Language Model Compression, https://arxiv.org/abs/2402.09977
- Yuta Nozaki, Dai Nakashima, Ryo Sato, Naoki Asaba, Shintaro Kawamura, Efficient Vocabulary Reduction for Small Language Models, Jan 2025, Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 771–783, January 19–24, 2025, Association for Computational Linguistic, https://aclanthology.org/2025.coling-industry.64.pdf
- Miles Williams, Young D. Kwon, Rui Li, Alexandros Kouris, Stylianos I. Venieris, 14 Feb 2026, Speculative Decoding with a Speculative Vocabulary, https://arxiv.org/abs/2602.13836 (Adaptively per-token reducing the drafter's vocabulary for a smaller unembedding matrix.)
- Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee, 3 Jul 2025 (v2), VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs https://arxiv.org/abs/2506.22694
- Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yux uan Li, Jie Zhou, Hao Zhou, Jianyong Wang, Zhiyuan Liu, and Maosong Sun, 2025, FR-spec: Accelerating large-vocabulary language models via frequency ranked speculative sampling. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3909–3921, Vienna, Austria. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.198 https://aclanthology.org/2025.acl-long.198/ https://aclanthology.org/2025.acl-long.198.pdf
- Do, D.-T., Le, N.-K., & Nguyen, L.-M. (2026). AdaSpec: Adaptive Multilingual Speculative Decoding with Self-Synthesized Language-Aware Training and Vocabulary Simplification. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30530-30538. https://doi.org/10.1609/aaai.v40i36.40307 https://ojs.aaai.org/index.php/AAAI/article/view/40307
- David Spuler, May 31st, 2026, Chapter 49. Eagle, Medusa, and FR-Spec, in book LLM Inference Optimization: State-of-the-Art Research, Table of Contents: https://www.aussieai.com/book/llm-inference-optimization https://www.amazon.com/dp/B0H3FKR39T
- Hanling Zhang, Yayu Zhou, Tongcheng Fang, Zhihang Yuan, Guohao Dai, Wanli Ouyang, Yu Wang, 18 Apr 2026 (v3), VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models, https://arxiv.org/abs/2508.15229
- Semin Kim, Aug 04, 2025, Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration, https://blog.squeezebits.com/vocabulary-trimming-methods
- Asahi Ushio, May 2026 (accessed), Vocabulary Trimming, https://github.com/asahi417/lm-vocab-trimmer
- Asahi Ushio, Yi Zhou, Jose Camacho-Collados, 19 Oct 2023 (v3), An Efficient Multilingual Language Model Compression through Vocabulary Trimming, https://arxiv.org/abs/2305.15020
- Asahi Ushio, Yi Zhou, and Jose Camacho-Collados, 2023, Efficient Multilingual Language Model Compression through Vocabulary Trimming, In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14725–14739, Singapore. Association for Computational Linguistics, https://aclanthology.org/2023.findings-emnlp.981/
- Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar, 3 Feb 2026 (v3), DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models, https://arxiv.org/abs/2510.13847
- Eugene Kwek, Wenpeng Yin, 10 Oct 2025 (v3), COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens, https://arxiv.org/abs/2509.06836
- M. Fuadi, A. D. Wibawa and S. Sumpeno, 2026, Efficient Transformer Models via Language-Aware Frequency-Based Vocabulary Pruning, IEEE Access, vol. 14, pp. 50993-51006, 2026, doi: 10.1109/ACCESS.2026.3679735, https://ieeexplore.ieee.org/abstract/document/11460136
- Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma, 8 Apr 2026, MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies, https://arxiv.org/abs/2605.26444
- Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang, 28 May 2026 (v2), EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation. https://arxiv.org/abs/2605.27390
- Ofir Ben Shoham, 5 Mar 2026, Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding, https://arxiv.org/abs/2603.05210
- Anton Plaksin, Sergei Krutikov, Sergei Skvortsov, Alexander Samarin, 11 May 2026, SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding, https://arxiv.org/abs/2605.10453
- Bohan Zhao, Zane Cao, Yongchao He, 30 Nov 2025, SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving, https://arxiv.org/abs/2512.00719
- Daniil Gurgurov, Michal Gregor, Josef van Genabith, Simon Ostermann, 6 Nov 2025 (v2), On Multilingual Encoder Language Model Compression for Low-Resource Languages, https://arxiv.org/abs/2505.16956
- Loïck Bourdois, Tom Aarsen, Bram Vanroy, Woojun Jung, Manuel Romero, Prithiv Sakthi, May 28, 2026, Introduction to Trimming ✂, https://huggingface.co/blog/lbourdois/introduction-to-trimming
- Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter, 30 Mar 2024, An Analysis of BPE Vocabulary Trimming in Neural Machine Translation, https://arxiv.org/abs/2404.00397
- Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang, 6 Feb 2026 (v2), Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail, https://arxiv.org/abs/2512.23087
- R. E. Madsen, S. Sigurdsson, L. K. Hansen and J. Larsen, "Pruning the vocabulary for better context recognition," Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Cambridge, UK, 2004, pp. 483-488, doi: 10.1109/ICPR.2004.1334270, https://ieeexplore.ieee.org/abstract/document/1334270
- Suraj Nair, Eugene Yang, Dawn Lawrie, James Mayfield, and Douglas W. Oard. 2023. BLADE: Combining Vocabulary Pruning and Intermediate Pretraining for Scaleable Neural CLIR. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). Association for Computing Machinery, New York, NY, USA, 1219–1229. https://doi.org/10.1145/3539618.3591644 https://dl.acm.org/doi/abs/10.1145/3539618.3591644
- Ziqing Yang, Yiming Cui, and Zhigang Chen. 2022. TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 35–43, Dublin, Ireland. Association for Computational Linguistics. https://aclanthology.org/2022.acl-demo.4/ (Token and vocabulary pruning in text.)
- Darshan Fofadiya, 11 Dec 2025 (v2), Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning, https://arxiv.org/abs/2512.03343
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: