Aussie AI

Early Exit Speculative Decoding

Last Updated 18 April, 2026

by David Spuler, Ph.D.

Early Exit Speculative Decoding: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, Ph.D., 25th August, 2024, Sequential Speculative Decoding, Aussie AI Blog, https://www.aussieai.com/blog/sequential-speculative-decoding
David Spuler, 2024, Patent: Speculative Decoding with Early Exit for Optimized Transformer On-device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-speculative-decoding-early-exit
David Spuler, 2024, Edit Decoding with Early Exit for Optimized Transformer On-Device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-edit-decoding-early-exit

Research on Early Exit Speculative Decoding

Research papers include:

Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131
Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (This paper mentions early exit in relation to generalized speculative decoding.)
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 13 Jan 2025, A Survey of Early Exit Deep Neural Networks in NLP, https://arxiv.org/abs/2501.07670 (Good survey of exit exit classifier types.)
Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 5 Feb 2025, QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache, https://arxiv.org/abs/2502.10424 (Combining self-speculative decoding with KV quantization.)
Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa, 28 May 2025, RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding, https://arxiv.org/abs/2505.22135
Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda, 27 May 2025, Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits, https://arxiv.org/abs/2505.21594
Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram, 7 Aug 2025, DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding, https://arxiv.org/abs/2504.05598
Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li, 19 Sep 2025, Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding, https://arxiv.org/abs/2509.19368
Divya Jyoti Bajpai and Manjesh Kumar Hanawal, 26 Oct 2025, FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference, https://arxiv.org/abs/2510.22641
Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang, 26 Sep 2025, Self-Speculative Biased Decoding for Faster Live Translation, https://arxiv.org/abs/2509.21740
Arnav Jalan, 17 Mar 2026, Speculative Decoding: 2-3x Faster LLM Inference (2026): How speculative decoding works, draft model selection, EAGLE3 vs Medusa, acceptance rate math, vLLM and SGLang setup. Real benchmarks from Llama 3.1 on H100s, https://blog.premai.io/speculative-decoding-2-3x-faster-llm-inference-2026/
David Spuler, Ph.D., 25th August, 2024, Sequential Speculative Decoding, Aussie AI Blog, https://www.aussieai.com/blog/sequential-speculative-decoding
David Spuler, 2024, Patent: Speculative Decoding with Early Exit for Optimized Transformer On-device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-speculative-decoding-early-exit
David Spuler, 2024, Edit Decoding with Early Exit for Optimized Transformer On-Device Inference, Aussie AI Patent Filing, https://www.aussieai.com/research/patent-edit-decoding-early-exit