Aussie AI

Layer Reordering

Last Updated 21 August, 2025

by David Spuler, Ph.D.

Research on Layer Reordering

Research papers include:

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi, 5 Jul 2024, LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order, https://arxiv.org/abs/2407.04513
Rohit Kumar Thakur, July 2025, Google DeepMind Just Dropped a ‘Transformers Killer’ Architecture, https://ninza7.medium.com/google-deepmind-just-dropped-a-transformers-killer-architecture-c6c1d9288922 (Covers Mixture-of-Recursions algorithm.)
Ben Dickson, July 22, 2025, Mixture-of-recursions delivers 2x faster inference—Here’s how to implement it, https://venturebeat.com/ai/mixture-of-recursions-delivers-2x-faster-inference-heres-how-to-implement-it/
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun, 21 Jul 2025 (v2), Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, https://www.arxiv.org/abs/2507.10524 (MoR is an adaptive layer fusion or layer reuse method to a fixed "recursive level" and also combined with related optimizations to KV cache management techniques.)