Aussie AI

Early Exit Training

Last Updated 25 April, 2026

by David Spuler, Ph.D.

What is Early Exit Training?

Early exit training is the use of the well-known inference optimization of early exiting layers in the training path. Inference has been the primary area for early exiting of layers, but it has been applied to training in a few ways:

Training optimization — exiting layers early in the forward or backward passes.
Dataset pruning — discarding entire token sequences of training data.
Training early exit models — the early exiting is actually used in inference, but the model has to be trained specially to be more accurate despite early exit inference.
Layer freezing in training — somewhat similar to early exiting layers of the backward pass.
LayerDrop — a structured dropout method in training, which is aimed at accuracy improvement (avoid overfitting, improve generalization), not training efficiency.
Knowledge distillation — early exit is one way to optimize the distillation training phase.
LoRA/QLoRA fine-tuning — this efficient fine-tuning approach can be further optimized with early exit.

Research on Early Exit Training

Research papers include:

Jan Hůla, David Adamczyk, Tomáš Filip, Martin Pavlíček, Petr Sosík, 27 Mar 2026, Two-dimensional early exit optimisation of LLM inference, https://arxiv.org/abs/2604.18592 (Early exit during training along the layer dimension and the sequence's sentence dimension.)
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou. 16 Jun 2024 (v3), EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 https://github.com/pan-x-c/EE-LLM
Alperen Görmez and Erdem Koyuncu. 2026. Dataset Pruning Using Early Exit Networks. ACM Trans. Intell. Syst. Technol. 17, 2, Article 34 (April 2026), 25 pages. https://doi.org/10.1145/3785502 https://dl.acm.org/doi/full/10.1145/3785502 (Prunes/discards training data based on early exit.)
Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. 2020. Why should we add early exits to neural networks? Cognitive Computation 12, 5 (2020), 954–966. https://arxiv.org/abs/2004.12814
A Görmez, 2024, Efficient Neural Network Inference and Training Using Early Exit Strategies, PhD Thesis, Electrical and Computer Engineering, Graduate College, University of Illinois at Chicago, https://indigo.uic.edu/articles/thesis/Efficient_Neural_Network_Inference_and_Training_Using_Early_Exit_Strategies/29049200/1/files/54497495.pdf
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, 17 Sep 2014, Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842 https://arxiv.org/pdf/1409.4842 (GoogleNet paper with relevance to early exiting and training.)
Devdhar Patel, Hava Siegelmann, 25 Dec 2022, QuickNets: Saving Training and Preventing Overconfidence in Early-Exit Neural Architectures, https://arxiv.org/abs/2212.12866 (Uses early exiting in training to reduce training costs.)
Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv:1708.00489. https://arxiv.org/abs/1708.00489
Yamin Sepehri, Pedram Pad, Ahmet Caner Yüzügüler, Pascal Frossard, L. Andrea Dunbar, 20 May 2024 (v4), Hierarchical Training of Deep Neural Networks Using Early Exiting, https://arxiv.org/abs/2303.02384 (Optimized training in edge-cloud architecture via early exiting.)
Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin, 11 Mar 2026 (v2), AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth, https://arxiv.org/abs/2603.01914
Pengfei Guo, Warren Richard Morningstar, Raviteja Vemulapalli, Karan Singhal, Vishal M. Patel, Philip Andrew Mansfield, 11 Sep 2023, Towards Federated Learning Under Resource Constraints via Layer-wise Training and Depth Dropout, https://arxiv.org/abs/2309.05213
Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu, 26 Feb 2024, Layer-wise Regularized Dropout for Neural Language Models, https://arxiv.org/abs/2402.16361
Pouya Shaeri, Ariane Middel, 16 May 2025, MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection, https://arxiv.org/abs/2505.11416
Haseena Rahmath P, Vishal Srivastava, Kuldeep Chaurasia, Roberto G. Pacheco, and Rodrigo S. Couto. 2024. Early-Exit Deep Neural Network - A Comprehensive Survey. ACM Comput. Surv. 57, 3, Article 75 (March 2025), 37 pages. https://doi.org/10.1145/3698767 https://dl.acm.org/doi/full/10.1145/3698767 https://dl.acm.org/doi/pdf/10.1145/3698767
Alperen Görmez, Erdem Koyuncu, 8 Jul 2022, Pruning Early Exit Networks, https://arxiv.org/abs/2207.03644

Dataset Pruning and Early Exit

Dataset pruning can use early exit to detect training data sequences that won't be useful. It's similar to deduplication of training data, but at a semantic level. This is somewhat similar to early exiting of the training stack, especially if the weight updates at the previously processed levels are still retained.

Alperen Görmez and Erdem Koyuncu. 2026. Dataset Pruning Using Early Exit Networks. ACM Trans. Intell. Syst. Technol. 17, 2, Article 34 (April 2026), 25 pages. https://doi.org/10.1145/3785502 https://dl.acm.org/doi/full/10.1145/3785502 (Prunes/discards training data based on early exit.)
A Görmez, 2024, Efficient Neural Network Inference and Training Using Early Exit Strategies, PhD Thesis, Electrical and Computer Engineering, Graduate College, University of Illinois at Chicago, https://indigo.uic.edu/articles/thesis/Efficient_Neural_Network_Inference_and_Training_Using_Early_Exit_Strategies/29049200/1/files/54497495.pdf
Devdhar Patel, Hava Siegelmann, 25 Dec 2022, QuickNets: Saving Training and Preventing Overconfidence in Early-Exit Neural Architectures, https://arxiv.org/abs/2212.12866 (Uses early exiting in training to reduce training costs.)
Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv:1708.00489. https://arxiv.org/abs/1708.00489

Layer Freezing in Training or Fine-Tuning

Layerwise freezing of model weights can be viewed as an "early exit" problem if you look at it correctly with your head tilted sideways. Frozen weight layers cause the backward pass to stop updating weights at a particular layer. This is like doing "early exiting of layers" in the backward propagation phase.

Hwang, T., Seo, H., Jung, J., & Jung, S. (2025). Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models. Applied Sciences, 15(19), 10434. https://doi.org/10.3390/app151910434 https://neurips.cc/virtual/2025/loc/san-diego/poster/117825
Minhyuk Seo, Hyunseo Koh, Jonghyun Choi, 16 Mar 2025 (v2), Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling, https://arxiv.org/abs/2410.15143
Jian Ma, Xinchen Lyu, Jun Jiang, Qimei Cui, Haipeng Yao, Xiaofeng Tao, 23 Mar 2025, SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices, https://arxiv.org/abs/2503.18986
Andrew Brock, Theodore Lim, J.M. Ritchie, Nick Weston, 18 Jun 2017 (v2), FreezeOut: Accelerate Training by Progressively Freezing Layers, https://arxiv.org/abs/1706.04983 https://github.com/ajbrock/FreezeOut
Jaejun Lee, Raphael Tang, Jimmy Lin, 8 Nov 2019, What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning, https://arxiv.org/abs/1911.03090
Yiding Wang, Decang Sun, Kai Chen, Fan Lai, Mosharaf Chowdhury, 11 Mar 2023 (v2), Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing, https://arxiv.org/abs/2201.06227
Li Yang, Sen Lin, Fan Zhang, Junshan Zhang, Deliang Fan, 13 Mar 2023, Efficient Self-supervised Continual Learning with Progressive Task-correlated Layer Freezing, https://arxiv.org/abs/2303.07477
Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, Xulong Tang, 30 Jan 2024, SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing, https://arxiv.org/abs/2401.16720
Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang, 1 Jun 2025 (v3), A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models, https://arxiv.org/abs/2406.11753
Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun, 3 Nov 2025 (v2), ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers, https://arxiv.org/abs/2504.00502
Chence Yang, Ci Zhang, Lei Lu, Qitao Tan, Sheng Li, Ao Li, Xulong Tang, Shaoyi Huang, Jinzhen Wang, Guoming Li, Jundong Li, Xiaoming Zhai, Jin Lu, Geng Yuan, 20 Aug 2025, Rethinking the Potential of Layer Freezing for Efficient DNN Training, https://arxiv.org/abs/2508.15033
Andrzej D. Dobrzycki, Ana M. Bernardos, Jos\'e R. Casar, 5 Sep 2025, An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures, https://arxiv.org/abs/2509.05490
Sybelle Goedicke-Fritz (1), Michelle Bous (1), Annika Engel (2), Matthias Flotho (2 and 5), Pascal Hirsch (2), Hannah Wittig (1), Dino Milanovic (2), Dominik Mohr (1), Mathias Kaspar (6), Sogand Nemat (3), Dorothea Kerner (3), Arno B\"ucker (3), Andreas Keller (2 and 5 and 7), Sascha Meyer (4), Michael Zemlin (1), Philipp Flotho (2 and 5) ((1) Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany, (2) Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, (3) Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany, (4) Clinical Centre Karlsruhe, Franz-Lust Clinic for Paediatrics, Karlsruhe, Germany, (5) Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarland University Campus, Germany, (6) Digital Medicine, University Hospital of Augsburg, Augsburg, Germany, (7) Pharma Science Hub (PSH), Saarland University Campus, Germany), 10 Oct 2025, Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants, https://arxiv.org/abs/2507.12269

LayerDrop: Avoid Overfitting in Training

LayerDrop is a type of structured dropout during training or fine-tuning. The idea is not compute optimization in training, but that randomly skipping layers actually helps the accuracy of the model by introducing some "noise" into training. This helps avoid overfitting and leads to better generalization. There are various ways to do "dropout" in general, and LayerDrop is a form of layerwise dropout.

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
A Hannan, A Brutti, D Falavigna, 2024, LDASR: An Experimental Study on Layer Drop using Conformer-based Architecture, https://eurasip.org/Proceedings/Eusipco/Eusipco2024/pdfs/0000151.pdf
Yangkun Li, Weizhi Ma, Chong Chen, Min Zhang, Yiqun Liu, Shaoping Ma, Yuekui Yang, 14 May 2022 (v2), A Survey on Dropout Methods and Experimental Verification in Recommendation, https://arxiv.org/abs/2204.02027
Angela Fan, Edouard Grave, Armand Joulin, 25 Sep 2019, Reducing Transformer Depth on Demand with Structured Dropout, https://arxiv.org/abs/1909.11556 (Original LayerDrop paper.)
Qiang Wang, Tong Xiao, Jingbo Zhu, 16 Oct 2020, Training Flexible Depth Model by Multi-Task Learning for Neural Machine Translation, https://arxiv.org/abs/2010.08265
Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal, 21 Nov 2022, Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention, https://arxiv.org/abs/2211.11701 https://github.com/zinengtang/Perceiver_VL
Stefan Wager, William Fithian, Sida Wang, Percy Liang, 31 Oct 2014 (v2), Altitude Training: Strong Bounds for Single-Layer Dropout, https://arxiv.org/abs/1407.3289
Junsuk Choe, Hyunjung Shim, 27 Aug 2019, Attention-based Dropout Layer for Weakly Supervised Object Localization, https://arxiv.org/abs/1908.10028
Jintao Guo, Lei Qi, Yinghuan Shi, Yang Gao, 17 Sep 2023 (v2), PLACE dropout: A Progressive Layer-wise and Channel-wise Dropout for Domain Generalization, https://arxiv.org/abs/2112.03676 https://github.com/lingeringlight/PLACEdropout
Wenhui Zhu, Peijie Qiu, Xiwen Chen, Oana M. Dumitrascu, Yalin Wang, 24 May 2024 (v2), PDL: Regularizing Multiple Instance Learning with Progressive Dropout Layers, https://arxiv.org/abs/2308.10112
Pengfei Guo, Warren Richard Morningstar, Raviteja Vemulapalli, Karan Singhal, Vishal M. Patel, Philip Andrew Mansfield, 11 Sep 2023, Towards Federated Learning Under Resource Constraints via Layer-wise Training and Depth Dropout, https://arxiv.org/abs/2309.05213
Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu, 26 Feb 2024, Layer-wise Regularized Dropout for Neural Language Models, https://arxiv.org/abs/2402.16361
Shilong Wang, Jianchun Liu, Hongli Xu, Jiaming Yan, Xianjun Gao, 13 Mar 2025, Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout, https://arxiv.org/abs/2503.10217
Pouya Shaeri, Ariane Middel, 16 May 2025, MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection, https://arxiv.org/abs/2505.11416
Hang Xu, Wei Yu, Jiangtong Tan, Zhen Zou, Feng Zhao, 15 Jun 2025, Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution, https://arxiv.org/abs/2506.12738 https://github.com/xuhang07/Adpative-Dropout
Haseena Rahmath P, Vishal Srivastava, Kuldeep Chaurasia, Roberto G. Pacheco, and Rodrigo S. Couto. 2024. Early-Exit Deep Neural Network - A Comprehensive Survey. ACM Comput. Surv. 57, 3, Article 75 (March 2025), 37 pages. https://doi.org/10.1145/3698767 https://dl.acm.org/doi/full/10.1145/3698767 https://dl.acm.org/doi/pdf/10.1145/3698767

LoRA, QLoRA, and PEFT with Early Exit

LoRA and QLoRA are types of Parameter-Efficient Fine-Tuning (PEFT), which is an efficient type of model fine-tuning. Early exit has been considered as one of the ways to speed up the training of the LoRA/QLoRA adapters.

Jiacheng Liu, Peng Tang, Xiaofeng Hou, Chao Li, and Pheng-Ann Heng. 2024. LoRAExit: Empowering Dynamic Modulation of LLMs in Resource-limited Settings using Low-rank Adapters. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9211–9225, Miami, Florida, USA. Association for Computational Linguistics. https://aclanthology.org/2024.findings-emnlp.539/
Mengyun Liu, Shanshan Huang, Jianan Jiang, 12 Jan 2026, EdgeNav-QE: QLoRA Quantization and Dynamic Early Exit for LAM-based Navigation on Edge Devices, https://arxiv.org/abs/2602.15836
Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon, 5 Jan 2026, LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference, https://arxiv.org/abs/2601.02569 https://github.com/hosseinbv/LoRA-Drop.git
D. Borsos, "Development and Optimisation of Exit Procedures for LoRaWanNodes," 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY), Pula, Croatia, 2024, pp. 000523-000527, doi: 10.1109/SISY62279.2024.10737616, https://ieeexplore.ieee.org/document/10737616

Knowledge Distillation and Early Exit

Knowledge distillation can use early exit as a way to speed up distillation, which is a type of training.

Boyi Liu, Zimu Zhou, Yongxin Tong, 15 Jan 2026, CAFEDistill: Learning Personalized and Dynamic Models through Federated Early-Exit Network Distillation, https://arxiv.org/abs/2601.10015
Salim Khazem, 3 Feb 2026, SAFE-KD: Risk-Controlled Early-Exit Distillation for Vision Backbones, https://arxiv.org/abs/2602.03043
Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu, 26 Feb 2024, Layer-wise Regularized Dropout for Neural Language Models, https://arxiv.org/abs/2402.16361
Anas Anwarul Haq Khan, Utkarsh Verma, Ganesh Ramakrishnan, 11 Sep 2025 (v2), Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization, https://arxiv.org/abs/2504.21831
Lehao Qu, Shuyuan Li, Zimu Zhou, Boyi Liu, Yi Xu, and Yongxin Tong. 2025. DarkDistill: Difficulty-Aligned Federated Early-Exit Network Training on Heterogeneous Devices. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '25). Association for Computing Machinery, New York, NY, USA, 2374–2385. https://doi.org/10.1145/3711896.3736902 https://dl.acm.org/doi/10.1145/3711896.3736902
Dong, Y., He, Q., Rui, P., Zheng, Z., Li, Z., Chen, F., Jin, H., & Yang, Y. (2026). EnViT: Enhancing the Performance of Early-Exit Vision Transformers via Exit-Aware Structured Dropout-Enabled Self-Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20852-20860. https://doi.org/10.1609/aaai.v40i25.39225 https://ojs.aaai.org/index.php/AAAI/article/view/39225 https://ojs.aaai.org/index.php/AAAI/article/view/39225/43186
Haseena Rahmath P, Vishal Srivastava, Kuldeep Chaurasia, Roberto G. Pacheco, and Rodrigo S. Couto. 2024. Early-Exit Deep Neural Network - A Comprehensive Survey. ACM Comput. Surv. 57, 3, Article 75 (March 2025), 37 pages. https://doi.org/10.1145/3698767 https://dl.acm.org/doi/full/10.1145/3698767 https://dl.acm.org/doi/pdf/10.1145/3698767
Shiting Xu, DEEP-CWS: Distilling Efficient pre-trained models with Early exit and Pruning for scalable Chinese Word Segmentation, Information Sciences, Volume 719, 2025, 122470, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2025.122470 https://www.sciencedirect.com/science/article/abs/pii/S0020025525006024
Meng, L., Zhang, R., Shan, W. (2026). Robust and Efficient Early Exit for Large Language Models: Mitigating KV Cache Loss and Enhancing Exit Stability. In: Jin, L., Wang, L. (eds) Advances in Neural Networks – ISNN 2025. ISNN 2025. Lecture Notes in Computer Science, vol 15951. Springer, Singapore. https://doi.org/10.1007/978-981-95-1233-1_7 https://link.springer.com/chapter/10.1007/978-981-95-1233-1_7