Aussie AI

Normalization Pruning

Last Updated 2 August, 2025

by David Spuler, Ph.D.

Normalization Pruning

Some research has suggested that the "normalization" layer in Transformers can be pruned without a major loss in model accuracy. This is similar to other Transformer optimization techniques involving architectural changes, such as FFN pruning and shallow decoder architectures.

There have been a few research papers that investigated the need for the "normalization" components in Transformers, and whether it can be removed.

Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In ACL, 2018, https://arxiv.org/abs/1804.09849
Ma, J. and Yarats, D. On the adequacy of untuned warmup for adaptive optimization. arXiv:1910.04209, 2019. https://arxiv.org/abs/1910.04209
Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, Fixup Initialization: Residual Learning Without Normalization, Mar 2019, https://arxiv.org/abs/1901.09321
Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470
Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In Proc. Int. Conf. on Machine Learning (ICML), pages 4475-4483, https://proceedings.mlr.press/v119/huang20f.html Code: https://github.com/layer6ai-labs/T-Fixup
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D., and Chao, L. Learning deep transformer models for machine translation. In ACL, 2019. https://arxiv.org/abs/1906.01787
Nguyen, T. and Salazar, J., Transformers without tears: Improving the normalization of self-attention. In arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. In ICLR, 2020 https://arxiv.org/abs/1908.03265 Code: https://github.com/LiyuanLucasLiu/RAdam
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of the needs for various Transformer components including normalization; see "Section 5.2.3 Normalization-free Transformer".)
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. ReZero is All You Need: Fast Convergence at Large Depth. CoRR abs/2003.04887 (2020). arXiv:2003.04887 https://arxiv.org/abs/2003.04887
Sharath Nittur Sridhar, Anthony Sarah, and Sairam Sundaresan. TrimBERT: Tailoring BERT for Trade-offs. arXiv:2202.12411 [cs], February 2022. http://arxiv.org/abs/2202.12411 (Optimizations include softmax replacement and removing half of all LayerNorms.)
David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Bobby He, Thomas Hofmann, 31 May 2024 (v2), Simplifying Transformer Blocks, https://arxiv.org/abs/2311.01906 (Examines the removal of various Transformer sublayer components including skip connections, projection/value parameters, and normalization.)
David Spuler, March 2024, Norm Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch24-norm-pruning
James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz, 5 Oct 2021, Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping, https://arxiv.org/abs/2110.01765
Nandan Kumar Jha, Brandon Reagen, 12 Oct 2024, ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models, https://arxiv.org/abs/2410.09637
Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 13 Mar 2025, Transformers without Normalization, https://arxiv.org/abs/2503.10622 (Using a tanh variant to avoid normalization such as LayerNorm.)
Dr. Ashish Bamania, July 2025, You Don’t Need Normalization In Transformers Anymore, A deep dive into the internals of Layer Normalization, and how a simple function called Dynamic Tanh (DyT) can replace them entirely in the Transformer architecture without any loss in performance, https://ai.gopubby.com/you-dont-need-normalization-in-transformers-anymore-0c737e846b91
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 14 Jun 2025 (v2), Transformers without Normalization, https://arxiv.org/abs/2503.10622v2 https://jiachenzhu.github.io/DyT/