Aussie AI

Normalization Pruning

  • Last Updated 2 August, 2025
  • by David Spuler, Ph.D.

Normalization Pruning

Some research has suggested that the "normalization" layer in Transformers can be pruned without a major loss in model accuracy. This is similar to other Transformer optimization techniques involving architectural changes, such as FFN pruning and shallow decoder architectures.

There have been a few research papers that investigated the need for the "normalization" components in Transformers, and whether it can be removed.

  • Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In ACL, 2018, https://arxiv.org/abs/1804.09849
  • Ma, J. and Yarats, D. On the adequacy of untuned warmup for adaptive optimization. arXiv:1910.04209, 2019. https://arxiv.org/abs/1910.04209
  • Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, Fixup Initialization: Residual Learning Without Normalization, Mar 2019, https://arxiv.org/abs/1901.09321
  • Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470
  • Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In Proc. Int. Conf. on Machine Learning (ICML), pages 4475-4483, https://proceedings.mlr.press/v119/huang20f.html Code: https://github.com/layer6ai-labs/T-Fixup
  • Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D., and Chao, L. Learning deep transformer models for machine translation. In ACL, 2019. https://arxiv.org/abs/1906.01787
  • Nguyen, T. and Salazar, J., Transformers without tears: Improving the normalization of self-attention. In arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
  • Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. In ICLR, 2020 https://arxiv.org/abs/1908.03265 Code: https://github.com/LiyuanLucasLiu/RAdam
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of the needs for various Transformer components including normalization; see "Section 5.2.3 Normalization-free Transformer".)
  • Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. ReZero is All You Need: Fast Convergence at Large Depth. CoRR abs/2003.04887 (2020). arXiv:2003.04887 https://arxiv.org/abs/2003.04887
  • Sharath Nittur Sridhar, Anthony Sarah, and Sairam Sundaresan. TrimBERT: Tailoring BERT for Trade-offs. arXiv:2202.12411 [cs], February 2022. http://arxiv.org/abs/2202.12411 (Optimizations include softmax replacement and removing half of all LayerNorms.)
  • David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Bobby He, Thomas Hofmann, 31 May 2024 (v2), Simplifying Transformer Blocks, https://arxiv.org/abs/2311.01906 (Examines the removal of various Transformer sublayer components including skip connections, projection/value parameters, and normalization.)
  • David Spuler, March 2024, Norm Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch24-norm-pruning
  • James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz, 5 Oct 2021, Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping, https://arxiv.org/abs/2110.01765
  • Nandan Kumar Jha, Brandon Reagen, 12 Oct 2024, ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models, https://arxiv.org/abs/2410.09637
  • Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
  • Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 13 Mar 2025, Transformers without Normalization, https://arxiv.org/abs/2503.10622 (Using a tanh variant to avoid normalization such as LayerNorm.)
  • Dr. Ashish Bamania, July 2025, You Don’t Need Normalization In Transformers Anymore, A deep dive into the internals of Layer Normalization, and how a simple function called Dynamic Tanh (DyT) can replace them entirely in the Transformer architecture without any loss in performance, https://ai.gopubby.com/you-dont-need-normalization-in-transformers-anymore-0c737e846b91
  • Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 14 Jun 2025 (v2), Transformers without Normalization, https://arxiv.org/abs/2503.10622v2 https://jiachenzhu.github.io/DyT/

More Research on Pruning Types

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: