Aussie AI

Normalization Optimizations

  • Last Updated 8 August, 2025
  • by David Spuler, Ph.D.

Research has suggested various ways to speed up the normalization component. Examples of normalization improvements include:

  • Normalization alternatives
  • Normalization approximations
  • Removing normalization ("norm pruning")
  • Placement of normalization blocks (i.e. "pre-norm" vs "post-norm")
  • Fused normalization (see kernel operator fusion)

Normalization Implementation and Optimization

Normalization functions are not usually as significant as MatMul in terms of time cost, but they can still be significant. A typical normalization requires a multi-scan of all of the elements of the output vectors. And this is done multiple times per token throughout each inference phase.

Example: BatchNorm in C++: The batch normalization operation involves scanning the full vector, modifying each element so that it is re-centered to a zero mean, and re-scaled to a normal magnitude. A naive non-optimized version of C++ of BatchNorm looks like this:

    void yapi_vector_batch_normalize_basic(    // Basic normalization (BatchNorm)
	float v[], int n, 
	float epsilon, // Smoothing term -- usually 1^e-5 (0.00005)
	float lambda, // Scaling term hyper-parameter (multiplication)
	float beta    // Bias/shift term hyper-parameter (addition)
    ) 
    {
	float fmean = yapi_vector_mean(v, n);  // Calculate "mean" (aka average)
	float variance = yapi_vector_variance_of_mean(v, n, fmean);  // Variance (sum-of-diffs-squared)
	
	float denom = sqrtf(variance + epsilon);  // like std. deviation, but smoothed by epsilon
	for (int i = 0; i < n; i++) {
		v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
	}
	yapi_vector_multiply_scalar(v, n, lambda);  // Scale all values by lambda hyper-param
	yapi_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values 
    }

This version is very inefficient with literally five scans of the entire vector. Loop fusion can obviously improve this, with the loops doing multiplication by lambda and the addition of beta merged into the prior for loop. Another optimization is to replace the division by "denom" with its reciprocal and a multiplication. Division is often an order-of-magnitude worse than multiplication.

Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by lambda, and adding beta. We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration. And then we notice that division and multiply are two versions of the same operation, so we can then use the loop fusion technique to merge the division-by-denom and multiply-by-lambda into a single multiplication by a combined scaling factor. Faster C++ code that has one less loop, and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:

	yapi_vector_add_scalar(v, n, -mean);  // Subtract the mean
	float scalef = lambda / denom;  // Combined scale factor
	yapi_vector_multiply_scalar(v, n, scalef);  // Scale by both denom and lambda 
	yapi_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values 

Another way to optimize this code is simply to remove lambda and beta variables. Choosing lambda=1 and beta=0 means that the last two scalar multiplication and scalar addition loops can be avoided. However, there's now little benefit to removing lambda in the merged code above, although the add-beta loop can still be removed. Anyway, whether we can remove these parameters is not a speed decision, but depends on whether these two learned parameters are important to the overall model's capability. Note that there is also little value in trying to remove epsilon, as it is only used once in total.

Research on Optimizating Normalization: Research papers on fast versions of normalization functions:

Approximating Normalization

Research on approximate normalization functions:

LayerNorm Approximation

Research papers on LayerNorm approximations:

  • Wenxun Wang; Shuchang Zhou; Wenyu Sun; Peiqin Sun; Yongpan Liu, Nov 2023, SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/abstract/document/10323725
  • S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
  • Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
  • M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
  • Y Wu, Z Wang, WD Lu, Oct 2023, PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers https://arxiv.org/pdf/2310.09385.pdf
  • Wenjie Li,, Dongxu LYu, Gang Wang, Aokun Hu, Ningyi Xu, Guanghui He, October 2024, Hardware-oriented algorithms for softmax and layer normalization of large language models, Science China, Vol. 67, Iss. 10, 200404:1–200404:15, https://doi.org/10.1007/s11432-024-4137-4 http://scis.scichina.com/en/2024/200404.pdf
  • W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
  • ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
  • Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 13 Mar 2025, Transformers without Normalization, https://arxiv.org/abs/2503.10622 (Using a tanh variant to avoid normalization such as LayerNorm.)

Integer-Only Normalization

One approximations for normalization is to use integer-only arithmetic (see also overview of integers in inference). Research on integer-only normalization algorithms:

  • Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
  • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Integer-only quantized weights and activations with INT4 or INT8, but also uses integers for batch normalization and residual connection components, too.)
  • A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
  • Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, “Towards fully 8-bit integer inference for the transformer model,” the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034

Layer Normalization Placement Reordering (Pre-Norm/Post-Norm)

The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a "post-norm" architecture. Subsequently, researchers found that switching to a "pre-norm" architecture, instead of post-norm, could fix one of the major problems with the original Transformer, namely that it was initially unstable in training, requiring a "warmup" phase. Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.

Since then, several researchers have explored where to place the layer normalization submodule. The general consensus seems to be that placing them before computations ("pre-norm") is better than after calcuations ("post-norm"). However, there are papers going either way, so there's still room for definitive research.

  • He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Springer, 2016, https://arxiv.org/abs/1603.05027 Code: https://github.com/KaimingHe/resnet-1k-layers (Only uses layer normalization on the input streams.)
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I., Language models are unsupervised multitask learners. 2019, PDF: https://cs.brown.edu/courses/cs146/assets/papers/language_models_are_unsupervised_multitask_learners.pdf (Layer normalization is moved to layer inputs.)
  • Baevski, A. and Auli, M., Adaptive input representations for neural language modeling. Int. Conf. Learn. Represent., 2019. https://arxiv.org/abs/1809.10853 (Has layer normalization before the self-attention and FFN blocks.)
  • Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M. Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, Raia Hadsell, 2020, Stabilizing Transformers for Reinforcement Learning, https://arxiv.org/abs/1910.06764, PDF: http://proceedings.mlr.press/v119/parisotto20a/parisotto20a.pdf (Has normalization at the inputs of layers.)
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models that vary "normalization placement", i.e. as "pre-LN" or "post-LN". Also examines various alternatives and substitutes for normalization.)
  • Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Found evidence that pre-norm was better than post-norm.)
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of HLT-NAACL. Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423, https://arxiv.org/abs/1810.04805 (Post-norm.)
  • Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the Difficulty of Training Transformers. In Proceedings of EMNLP. 5747–5763. https://doi.org/10.18653/v1/2020.emnlp-main.463, https://arxiv.org/abs/2004.08249 (Post-norm)
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NeurIPS. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htm, https://arxiv.org/abs/1706.03762 (Post-norm was used in the original 2017 Transformer paper.)
  • Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for Neural Machine Translation. In Proceedings of AMTA. 193–199. https://www.aclweb.org/anthology/W18-1819, PDF: https://aclanthology.org/W18-1819.pdf, Code: https://github.com/tensorflow/tensor2tensor (Has a reference implementation of the vanilla Transformer, which was post-norm, but this is pre-norm.)
  • Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning Deep Transformer Models for Machine Translation. In Proceedings of ACL. 1810–1822. https://doi.org/10.18653/v1/p19-1176, https://arxiv.org/abs/1906.01787, Code: https://github.com/wangqiangneu/dlcl (Researching pre-norm vs post-norm for Transformers.)
  • Tobias Domhan. 2018. How much attention do you need? a granular analysis of neural machine translation architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1799–1808. PDF: https://aclanthology.org/P18-1167.pdf (Uses a pre-norm architecture, based on Tensor2Tensor.)
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL. 67–72. https://www.aclweb.org/anthology/P17-4012, https://arxiv.org/abs/1701.02810 (Pre-norm architecture.)
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 [cs.LG] https://arxiv.org/abs/1904.10509 (Pre-norm)
  • Alexei Baevski and Michael Auli. 2019. Adaptive Input Representations for Neural Language Modeling. In Proceedings of ICLR. https://openreview.net/forum?id=ByxZX20qFQ, https://arxiv.org/abs/1809.10853 (Uses pre-norm with normalization at the inputs.)
  • Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Tested both standard pre-norm and RMSNorm architectures.)
  • David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  • An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
  • Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
  • Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu, 2020, On Layer Normalization in the Transformer Architecture. arXiv:2002.04745 [cs, stat], June 2020. http://arxiv.org/abs/2002.04745 (Pre-norm versus post-norm.)
  • Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han, 2020, Understanding the Difficulty of Training Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.463. https://aclanthology.org/2020.emnlp-main.463 (Pre-norm versus post-norm.)
  • Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
  • Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo, 4 Feb 2025, Peri-LN: Revisiting Layer Normalization in the Transformer Architecture, https://arxiv.org/abs/2502.02732 (Instead of pre-norm or post-norm, analyzes "layer normalization (LN) peripherally around sublayers").
  • Sebastian Raschka, Jul 19, 2025, The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison

Research on Normalization Alternatives and Optimization

Research papers on different types of normalization or other alternatives:

General Research on Normalization

Research papers on normalization issues in general:

  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Biao Zhang, Rico Sennrich, 2019, Root Mean Square Layer Normalization, https://arxiv.org/abs/1910.07467
  • Shaked Brody, Uri Alon, Eran Yahav, May 2023, On the Expressivity Role of LayerNorm in Transformers' Attention, https://arxiv.org/abs/2305.02582 Code: https://github.com/tech-srl/layer_norm_expressivity_role
  • Toan Q. Nguyen, Julian Salazar, 2019, Transformers without Tears: Improving the Normalization of Self-Attention, https://arxiv.org/abs/1910.05895
  • S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37. JMLR.org, 2015, pp. 448–456. [Online]. Available: http://proceedings.mlr.press/ v37/ioffe15.html
  • David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei, 1 Mar 2022, DeepNet: Scaling Transformers to 1,000 Layers, https://arxiv.org/abs/2203.00555 https://arxiv.org/pdf/2203.00555.pdf (New normalization function DeepNorm)
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)

RMSNorm Research

RMSNorm is based on the Root Mean Squared (RMS) calculation. Research papers on RMSNorm include:

LayerNorm Optimizations

Research papers on optimizing LayerNorm:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: