Aussie AI

Early Exit Training

  • Last Updated 25 April, 2026
  • by David Spuler, Ph.D.

What is Early Exit Training?

Early exit training is the use of the well-known inference optimization of early exiting layers in the training path. Inference has been the primary area for early exiting of layers, but it has been applied to training in a few ways:

  • Training optimization — exiting layers early in the forward or backward passes.
  • Dataset pruning — discarding entire token sequences of training data.
  • Training early exit models — the early exiting is actually used in inference, but the model has to be trained specially to be more accurate despite early exit inference.
  • Layer freezing in training — somewhat similar to early exiting layers of the backward pass.
  • LayerDrop — a structured dropout method in training, which is aimed at accuracy improvement (avoid overfitting, improve generalization), not training efficiency.
  • Knowledge distillation — early exit is one way to optimize the distillation training phase.
  • LoRA/QLoRA fine-tuning — this efficient fine-tuning approach can be further optimized with early exit.

Research on Early Exit Training

Research papers include:

Dataset Pruning and Early Exit

Dataset pruning can use early exit to detect training data sequences that won't be useful. It's similar to deduplication of training data, but at a semantic level. This is somewhat similar to early exiting of the training stack, especially if the weight updates at the previously processed levels are still retained.

Layer Freezing in Training or Fine-Tuning

Layerwise freezing of model weights can be viewed as an "early exit" problem if you look at it correctly with your head tilted sideways. Frozen weight layers cause the backward pass to stop updating weights at a particular layer. This is like doing "early exiting of layers" in the backward propagation phase.

  • Hwang, T., Seo, H., Jung, J., & Jung, S. (2025). Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models. Applied Sciences, 15(19), 10434. https://doi.org/10.3390/app151910434 https://neurips.cc/virtual/2025/loc/san-diego/poster/117825
  • Minhyuk Seo, Hyunseo Koh, Jonghyun Choi, 16 Mar 2025 (v2), Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling, https://arxiv.org/abs/2410.15143
  • Jian Ma, Xinchen Lyu, Jun Jiang, Qimei Cui, Haipeng Yao, Xiaofeng Tao, 23 Mar 2025, SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices, https://arxiv.org/abs/2503.18986
  • Andrew Brock, Theodore Lim, J.M. Ritchie, Nick Weston, 18 Jun 2017 (v2), FreezeOut: Accelerate Training by Progressively Freezing Layers, https://arxiv.org/abs/1706.04983 https://github.com/ajbrock/FreezeOut
  • Jaejun Lee, Raphael Tang, Jimmy Lin, 8 Nov 2019, What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning, https://arxiv.org/abs/1911.03090
  • Yiding Wang, Decang Sun, Kai Chen, Fan Lai, Mosharaf Chowdhury, 11 Mar 2023 (v2), Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing, https://arxiv.org/abs/2201.06227
  • Li Yang, Sen Lin, Fan Zhang, Junshan Zhang, Deliang Fan, 13 Mar 2023, Efficient Self-supervised Continual Learning with Progressive Task-correlated Layer Freezing, https://arxiv.org/abs/2303.07477
  • Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, Xulong Tang, 30 Jan 2024, SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing, https://arxiv.org/abs/2401.16720
  • Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang, 1 Jun 2025 (v3), A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models, https://arxiv.org/abs/2406.11753
  • Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun, 3 Nov 2025 (v2), ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers, https://arxiv.org/abs/2504.00502
  • Chence Yang, Ci Zhang, Lei Lu, Qitao Tan, Sheng Li, Ao Li, Xulong Tang, Shaoyi Huang, Jinzhen Wang, Guoming Li, Jundong Li, Xiaoming Zhai, Jin Lu, Geng Yuan, 20 Aug 2025, Rethinking the Potential of Layer Freezing for Efficient DNN Training, https://arxiv.org/abs/2508.15033
  • Andrzej D. Dobrzycki, Ana M. Bernardos, Jos\'e R. Casar, 5 Sep 2025, An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures, https://arxiv.org/abs/2509.05490
  • Sybelle Goedicke-Fritz (1), Michelle Bous (1), Annika Engel (2), Matthias Flotho (2 and 5), Pascal Hirsch (2), Hannah Wittig (1), Dino Milanovic (2), Dominik Mohr (1), Mathias Kaspar (6), Sogand Nemat (3), Dorothea Kerner (3), Arno B\"ucker (3), Andreas Keller (2 and 5 and 7), Sascha Meyer (4), Michael Zemlin (1), Philipp Flotho (2 and 5) ((1) Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany, (2) Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, (3) Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany, (4) Clinical Centre Karlsruhe, Franz-Lust Clinic for Paediatrics, Karlsruhe, Germany, (5) Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarland University Campus, Germany, (6) Digital Medicine, University Hospital of Augsburg, Augsburg, Germany, (7) Pharma Science Hub (PSH), Saarland University Campus, Germany), 10 Oct 2025, Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants, https://arxiv.org/abs/2507.12269

LayerDrop: Avoid Overfitting in Training

LayerDrop is a type of structured dropout during training or fine-tuning. The idea is not compute optimization in training, but that randomly skipping layers actually helps the accuracy of the model by introducing some "noise" into training. This helps avoid overfitting and leads to better generalization. There are various ways to do "dropout" in general, and LayerDrop is a form of layerwise dropout.

LoRA, QLoRA, and PEFT with Early Exit

LoRA and QLoRA are types of Parameter-Efficient Fine-Tuning (PEFT), which is an efficient type of model fine-tuning. Early exit has been considered as one of the ways to speed up the training of the LoRA/QLoRA adapters.

  • Jiacheng Liu, Peng Tang, Xiaofeng Hou, Chao Li, and Pheng-Ann Heng. 2024. LoRAExit: Empowering Dynamic Modulation of LLMs in Resource-limited Settings using Low-rank Adapters. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9211–9225, Miami, Florida, USA. Association for Computational Linguistics. https://aclanthology.org/2024.findings-emnlp.539/
  • Mengyun Liu, Shanshan Huang, Jianan Jiang, 12 Jan 2026, EdgeNav-QE: QLoRA Quantization and Dynamic Early Exit for LAM-based Navigation on Edge Devices, https://arxiv.org/abs/2602.15836
  • Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon, 5 Jan 2026, LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference, https://arxiv.org/abs/2601.02569 https://github.com/hosseinbv/LoRA-Drop.git
  • D. Borsos, "Development and Optimisation of Exit Procedures for LoRaWanNodes," 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY), Pula, Croatia, 2024, pp. 000523-000527, doi: 10.1109/SISY62279.2024.10737616, https://ieeexplore.ieee.org/document/10737616

Knowledge Distillation and Early Exit

Knowledge distillation can use early exit as a way to speed up distillation, which is a type of training.

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

More AI Research Topics

Read more about: