Aussie AI
Multi-Token Decoding
-
Last Updated 26 August, 2025
-
by David Spuler, Ph.D.
Research on Multi-Token Decoding
Research papers include:
- Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin, 1 May 2024, DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling, https://arxiv.org/abs/2405.00888 (A model trained to predict multiple tokens ahead.)
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
- Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
- Michael Nuñez, July 4, 2024, Meta drops AI bombshell: Multi-token prediction models now open for research, https://venturebeat.com/ai/meta-drops-ai-bombshell-multi-token-prediction-models-now-open-for-research/
- Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 12 Jul 2024, Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference, https://arxiv.org/abs/2407.09722
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 2024, Exploring and Improving Drafts in Blockwise Parallel Decoding, https://openreview.net/pdf?id=KtnUTS1f91
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, September 11, 2023, Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, https://www.together.ai/blog/medusa
- Wei Zhong, Manasa Bharadwaj, 1 Jun 2024 (v2), S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314
- Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli, 12 Sep 2024, Faster Speech-LLaMA Inference with Multi-token Prediction, https://arxiv.org/abs/2409.08148
- Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu, 8 Oct 2024, ParallelSpec: Parallel Drafter for Efficient Speculative Decoding, https://arxiv.org/abs/2410.05589 (Multi-token prediction in draft models for speculative decoding.)
- Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen, 14 Oct 2024, Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation, https://arxiv.org/abs/2410.10141 https://github.com/ozyyshr/TempSpec
- Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung, 17 Oct 2024, Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding, https://arxiv.org/abs/2410.13839
- Anonymous Authors, Oct 2024, Optimized Multi-Token Joint Decoding With Auxiliary Model for LLM Inference, https://openreview.net/pdf?id=ZHhBawo3k5
- Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao, 27 Oct 2024, FIRP: Faster LLM inference via future intermediate representation prediction, https://arxiv.org/abs/2410.20488
- DP Ghosh, DA Team, Oct 29, 2024, Multi-Token Prediction with Extended Transformer Layers, https://www.researchgate.net/profile/Debiprasad-Ghosh/publication/385311204_Multi-Token_Prediction_with_Extended_Transformer_Layers/links/671fdd2c55a5271cdee28059/Multi-Token-Prediction-with-Extended-Transformer-Layers.pdf
- Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah, 26 Nov 2024, Attamba: Attending To Multi-Token States, https://arxiv.org/abs/2411.17685
- Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh, Diptendu Dutta, Debasish Jana, 14 Dec 2024, HITgram: A Platform for Experimenting with n-gram Language Models, https://arxiv.org/abs/2412.10717
- Y Li, K Livescu, J Zhou, Dec 2024, Beyond Token Generation: Adaptive Chunk-Distilled Language Modeling, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://neurips2024-enlsp.github.io/papers/paper_90.pdf (Generate multiple tokens in decoding by inserting RAG chunks directly into the decoding output.)
- Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
- Yanhong Li, Karen Livescu, Jiawei Zhou, 31 Dec 2024, Chunk-Distilled Language Modeling, https://arxiv.org/abs/2501.00343 (Multi-token decoding using retrieval.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 20 Nov 2024 (v2), From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838
- Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
- Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
- Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
- Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang, 27 Feb 2025, Speculative Decoding and Beyond: An In-Depth Review of Techniques, https://arxiv.org/abs/2502.19732
- Yijiong Yu, 26 Mar 2025, Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence, https://arxiv.org/abs/2503.20533 https://github.com/yuyijiong/parallel-decoding-in-one-sequence
- Chengen Wang, Murat Kantarcioglu, 14 Mar 2025, A Review of DeepSeek Models' Key Innovative Techniques, https://arxiv.org/abs/2503.11486
- L. Xiong et al., May 2025, DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models, IEEE/CAA Journal of Automatica Sinica, vol. 12, no. 5, pp. 841-858, May 2025, doi: 10.1109/JAS.2025.125495, https://ieeexplore.ieee.org/abstract/document/11005752
- Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis, 15 May 2025, Multi-Token Prediction Needs Registers, https://arxiv.org/abs/2505.10518
- Somesh Mehra, Javier Alonso Garcia, Lukas Mauch, 13 Feb 2025, On multi-token prediction for efficient LLM inference, https://arxiv.org/abs/2502.09419?
- Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, Tat-Seng Chua, 23 May 2025, L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models, https://arxiv.org/abs/2505.17505
- Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/
- Anirudhan Badrinath, Prabhat Agarwal, Laksh Bhasin, Jaewon Yang, Jiajing Xu, Charles Rosenberg, 6 Aug 2025, PinRec: Outcome-Conditioned, Multi-Token Generative Retrieval for Industry-Scale Recommendation Systems, https://arxiv.org/abs/2504.10507
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home