Aussie AI

Layer Skipping

  • Last Updated 15 August, 2025
  • by David Spuler, Ph.D.

What is Layer Skipping?

Layer skipping is an LLM inference optimization where some layers of model processing are simply skipped. The input and output of each layer is the same, as a vector of activations, so any layer can be skipped, although this can reduce model accuracy. There are many related techniques that optimize layers:

  • Early exiting — skipping all layers at the end.
  • Layer reordering — switching the orders
  • Layer fusion — combining two layers together
  • Mixture-of-recursion — repeatedly using the same fused layers.
  • Hybrid local-global attention — using different attention modules in each layer.

Research on Layer Skipping

Research papers include:

  • Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang, 3 Jun 2024, Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching, https://arxiv.org/abs/2406.01733 Code: https://github.com/horseee/learning-to-cache (Layer skipping in diffusion transformers via layer caching.)
  • Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
  • Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
  • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
  • Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, 7 Apr 2024, Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models, https://arxiv.org/abs/2404.04900 (Token-specific layer routing is similar to layer skipping and dynamic depth pruning.)
  • Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng, 10 Apr 2024, CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers, https://arxiv.org/abs/2404.06709 (Similar to layer skipping or layer fusion, but concurrently calculates some layers that seem to be less important, rather than running the layers sequentially.)
  • Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
  • Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang, 22 Mar 2024, Hierarchical Skip Decoding for Efficient Autoregressive Text Generation, https://arxiv.org/abs/2403.14919 (A new decoding algorithm called Hierarchical Skip Decoding involving layer skipping.)
  • Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
  • Haoyu Wang, Yaqing Wang, Tianci Liu, Tuo Zhao, and Jing Gao, 2023, HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference https://aclanthology.org/2023.findings-emnlp.283.pdf (Layer skipping during fine-tuning.)
  • Rafael Fão de Moura, Paulo C Santos, João Paulo C de Lima, Marco AZ Alves, Antonio CS Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In International Conference on Embedded Computer Systems. Springer, 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5
  • J. Jin, A. Dundar, and E. Culurciello. 2014, Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, https://arxiv.org/abs/1412.5474
  • David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-token layer skipping for a type of adaptive inference with conditional computation.)
  • Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
  • Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV
  • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
  • Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
  • J Park, DY Kim, YH Moon, 2022, Lazy Net: Lazy Entry Neural Networks for Accelerated and Efficient Inference 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), https://ieeexplore.ieee.org/abstract/document/9953031
  • David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823. https://aclanthology.org/2021.findings-acl.425/ PDF: https://aclanthology.org/2021.findings-acl.425.pdf
  • Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama, 2017, Adaptive Neural Networks for Efficient Inference, Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017. http://proceedings.mlr.press/v70/bolukbasi17a.html http://proceedings.mlr.press/v70/bolukbasi17a/bolukbasi17a.pdf
  • Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
  • Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
  • Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
  • Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen, 3 Jul 2024, DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs. https://arxiv.org/abs/2407.11030
  • H Wang, 2024, Minimalism Yields Maximum Results: Deep Learning with Limited Resource, Ph.D. Thesis, Purdue University, PDF: https://hammer.purdue.edu/articles/thesis/Minimalism_Yields_Maximum_Results_Deep_Learning_with_Limited_Resource/26349415/1/files/47855029.pdf
  • Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane, 16 Aug 2024, Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning, https://arxiv.org/abs/2408.08670 (Faster fine-tuning by selecting layers, freezing layers, or slimming them to fewer fine-tuned parameters.)
  • Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 9 Jul 2024 (v3), Not All Layers of LLMs Are Necessary During Inference, https://arxiv.org/abs/2403.02181
  • Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
  • Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 16 Aug 2024 (v2), DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling, https://arxiv.org/abs/2403.01695
  • J. Li, Q. Li and P. Wang, 2024, From Static to Dynamic: A Deeper, Faster, and Adaptive Language Modeling Approach, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650050, https://ieeexplore.ieee.org/abstract/document/10650050 (Uses a preliminary "estimator module" to decide which layers to use.)
  • Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15
  • Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez, Ram Pasunuru, Scott Yih, Sravya Popuri, Xing Liu, Carole-Jean Wu, 30 Sep 2024, Characterizing and Efficiently Accelerating Multimodal Generation Model Inference, https://arxiv.org/abs/2410.00215 (Analyzes the bottlenecks in inference, finding the usual problems of autoregression, but also more interesting issues such as that linear kernels can be expensive, and KV cache reordering is a bottleneck in beam search, and layer skipping is analyzed.)
  • Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
  • Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
  • Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
  • Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
  • Y Zhou, C Zhou, W Xie, X Wang, J Chen, Z Ni, J Li, 2024, The Benefits in Shallow: Merge Decoding Across Large Language Model Layers. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15360. Springer, Singapore. https://doi.org/10.1007/978-981-97-9434-8_30 https://link.springer.com/chapter/10.1007/978-981-97-9434-8_30
  • Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
  • Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Yu Cheng, 27 Nov 2024 (v2), Accelerating Vision Diffusion Transformers with Skip Branches, https://arxiv.org/abs/2411.17616 https://github.com/OpenSparseLLMs/Skip-DiT.git
  • Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu, 4 Jan 2025, AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference, https://arxiv.org/abs/2501.02336 (Optimally skipping sublayer components in FFN and attention during prefill and decoding phases.)
  • Jaehoon Heo, Adiwena Putra, Jieon Yoon, Sungwoong Yune, Hangyeol Lee, Ji-Hoon Kim, Joo-Young Kim, 10 Jan 2025, EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models, https://arxiv.org/abs/2501.05680
  • Boyao Wang, Rui Pan, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang, 5 Feb 2025, Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training, https://arxiv.org/abs/2502.03460
  • Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 19 Feb 2025, Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, https://arxiv.org/abs/2502.13842
  • Benyamin Jamialahmadi, Parsa Kavehzadeh, Mehdi Rezagholizadeh, Parsa Farinneya, Hossein Rajabzadeh, Aref Jafari, Boxing Chen, Marzieh S.Tahaei, 10 Mar 2025 (v2), Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models, https://arxiv.org/abs/2503.05005 (A form of PEFT where additional tokens are added at some layer points.)
  • Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang, 1 Mar 2025, Position-Aware Depth Decay Decoding (D3): Boosting Large Language Model Inference Efficiency, https://arxiv.org/abs/2503.08524
  • Ning Yang, Fangxin Liu, Junjie Wang, Tao Yang, Kan Liu, Haibing Guan, Li Jiang, 23 May 2025, DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies, https://arxiv.org/abs/2505.17420
  • Ben Dickson, July 22, 2025, Mixture-of-recursions delivers 2x faster inference—Here’s how to implement it, https://venturebeat.com/ai/mixture-of-recursions-delivers-2x-faster-inference-heres-how-to-implement-it/
  • Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun, 21 Jul 2025 (v2), Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, https://www.arxiv.org/abs/2507.10524 (MoR is an adaptive layer fusion or layer reuse method to a fixed "recursive level" and also combined with related optimizations to KV cache management techniques.)

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research Topics

Read more about: