Aussie AI
Early Exit Speculative Decoding
-
Last Updated 18 April, 2026
-
by David Spuler, Ph.D.
Early Exit Speculative Decoding: Book Excerpts and Blog Articles
Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:
- David Spuler, Ph.D., 25th August, 2024, Sequential Speculative Decoding, Aussie AI Blog, https://www.aussieai.com/blog/sequential-speculative-decoding
- David Spuler, 2024, Patent: Speculative Decoding with Early Exit for Optimized Transformer On-device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-speculative-decoding-early-exit
- David Spuler, 2024, Edit Decoding with Early Exit for Optimized Transformer On-Device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-edit-decoding-early-exit
Research on Early Exit Speculative Decoding
Research papers include:
- Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
- Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
- Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
- Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
- Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
- Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131
- Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (This paper mentions early exit in relation to generalized speculative decoding.)
- Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
- Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)
- Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
- Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
- Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
- Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
- Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
- Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 13 Jan 2025, A Survey of Early Exit Deep Neural Networks in NLP, https://arxiv.org/abs/2501.07670 (Good survey of exit exit classifier types.)
- Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 5 Feb 2025, QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache, https://arxiv.org/abs/2502.10424 (Combining self-speculative decoding with KV quantization.)
- Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa, 28 May 2025, RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding, https://arxiv.org/abs/2505.22135
- Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda, 27 May 2025, Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits, https://arxiv.org/abs/2505.21594
- Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram, 7 Aug 2025, DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding, https://arxiv.org/abs/2504.05598
- Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li, 19 Sep 2025, Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding, https://arxiv.org/abs/2509.19368
- Divya Jyoti Bajpai and Manjesh Kumar Hanawal, 26 Oct 2025, FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference, https://arxiv.org/abs/2510.22641
- Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang, 26 Sep 2025, Self-Speculative Biased Decoding for Faster Live Translation, https://arxiv.org/abs/2509.21740
- Arnav Jalan, 17 Mar 2026, Speculative Decoding: 2-3x Faster LLM Inference (2026): How speculative decoding works, draft model selection, EAGLE3 vs Medusa, acceptance rate math, vLLM and SGLang setup. Real benchmarks from Llama 3.1 on H100s, https://blog.premai.io/speculative-decoding-2-3x-faster-llm-inference-2026/
- David Spuler, Ph.D., 25th August, 2024, Sequential Speculative Decoding, Aussie AI Blog, https://www.aussieai.com/blog/sequential-speculative-decoding
- David Spuler, 2024, Patent: Speculative Decoding with Early Exit for Optimized Transformer On-device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-speculative-decoding-early-exit
- David Spuler, 2024, Edit Decoding with Early Exit for Optimized Transformer On-Device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-edit-decoding-early-exit
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home