Aussie AI Blog

Early Exit as a Training Optimization

April 27, 2026

by David Spuler, Ph.D.

Early Exit in Training

I have over 300 research papers on layerwise early exit in inference, but only 15 about using early exiting of layers to speed up training. There are other papers on training of early exit models, which is a different thing. It feels like the universe is trying to tell us something. Admittedly, to my knowledge, none of the mainstream inference engines or model architectures in the industry actually use early exiting techniques, even for inference, which is also saying something. The few papers I have are in relation to using early exit in training to do:

Data set pruning
Layer freezing
Early exiting of layers

Early exit decisions. The approach for use of early exit in pre-training stack is similar to inference: add some "exit heads" at whichever layers we want to consider. At these exit heads, the computation adds an unembedding matrix multiply to compute the logits (from the hidden state), from which we get the output token probabilities as seen at that layer. In pre-training, we know the next token we want and its probability, so we can make a decision based on that information about whether that layer is predicting the right token and whether it has a high-enough confidence level.

And then what?

If the wrong token is being predicted, or the confidence is not high enough, then we just progress on to the next layer without exiting. If we decide to exit at a layer, then there are options:

Skip to backprop — compute the losses based on the current layer's probabilities, and do a backprop, or
Skip backprop, too — Don't do backprop because the model is already predicting it accurately enough, or
Skip this whole batch — discard this training data entirely and don't do backprop (i.e., dataset pruning).

There's not really a right or wrong answer, because almost nobody's doing this technique, even in research labs.

References on Early Exit of Training Layers

Jan Hůla, David Adamczyk, Tomáš Filip, Martin Pavlíček, Petr Sosík, 27 Mar 2026, Two-dimensional early exit optimisation of LLM inference, https://arxiv.org/abs/2604.18592 (Early exit during training along the layer dimension and the sequence's sentence dimension.)
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou. 16 Jun 2024 (v3), EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 https://github.com/pan-x-c/EE-LLM
Alperen Görmez and Erdem Koyuncu. 2026. Dataset Pruning Using Early Exit Networks. ACM Trans. Intell. Syst. Technol. 17, 2, Article 34 (April 2026), 25 pages. https://doi.org/10.1145/3785502 https://dl.acm.org/doi/full/10.1145/3785502 (Prunes/discards training data based on early exit.)
Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. 2020. Why should we add early exits to neural networks? Cognitive Computation 12, 5 (2020), 954–966. https://arxiv.org/abs/2004.12814
A Görmez, 2024, Efficient Neural Network Inference and Training Using Early Exit Strategies, PhD Thesis, Electrical and Computer Engineering, Graduate College, University of Illinois at Chicago, https://indigo.uic.edu/articles/thesis/Efficient_Neural_Network_Inference_and_Training_Using_Early_Exit_Strategies/29049200/1/files/54497495.pdf
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, 17 Sep 2014, Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842 https://arxiv.org/pdf/1409.4842 (GoogleNet paper with relevance to early exiting and training.)
Devdhar Patel, Hava Siegelmann, 25 Dec 2022, QuickNets: Saving Training and Preventing Overconfidence in Early-Exit Neural Architectures, https://arxiv.org/abs/2212.12866 (Uses early exiting in training to reduce training costs.)
Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv:1708.00489. https://arxiv.org/abs/1708.00489
Yamin Sepehri, Pedram Pad, Ahmet Caner Yüzügüler, Pascal Frossard, L. Andrea Dunbar, 20 May 2024 (v4), Hierarchical Training of Deep Neural Networks Using Early Exiting, https://arxiv.org/abs/2303.02384 (Optimized training in edge-cloud architecture via early exiting.)
Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin, 11 Mar 2026 (v2), AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth, https://arxiv.org/abs/2603.01914
Pengfei Guo, Warren Richard Morningstar, Raviteja Vemulapalli, Karan Singhal, Vishal M. Patel, Philip Andrew Mansfield, 11 Sep 2023, Towards Federated Learning Under Resource Constraints via Layer-wise Training and Depth Dropout, https://arxiv.org/abs/2309.05213
Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu, 26 Feb 2024, Layer-wise Regularized Dropout for Neural Language Models, https://arxiv.org/abs/2402.16361
Pouya Shaeri, Ariane Middel, 16 May 2025, MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection, https://arxiv.org/abs/2505.11416
Haseena Rahmath P, Vishal Srivastava, Kuldeep Chaurasia, Roberto G. Pacheco, and Rodrigo S. Couto. 2024. Early-Exit Deep Neural Network - A Comprehensive Survey. ACM Comput. Surv. 57, 3, Article 75 (March 2025), 37 pages. https://doi.org/10.1145/3698767 https://dl.acm.org/doi/full/10.1145/3698767 https://dl.acm.org/doi/pdf/10.1145/3698767
Alperen Görmez, Erdem Koyuncu, 8 Jul 2022, Pruning Early Exit Networks, https://arxiv.org/abs/2207.03644