Aussie AI

Normalization Optimizations

Last Updated 8 August, 2025

by David Spuler, Ph.D.

Research has suggested various ways to speed up the normalization component. Examples of normalization improvements include:

Normalization alternatives
Normalization approximations
Removing normalization ("norm pruning")
Placement of normalization blocks (i.e. "pre-norm" vs "post-norm")
Fused normalization (see kernel operator fusion)

Normalization Implementation and Optimization

Normalization functions are not usually as significant as MatMul in terms of time cost, but they can still be significant. A typical normalization requires a multi-scan of all of the elements of the output vectors. And this is done multiple times per token throughout each inference phase.

Example: BatchNorm in C++: The batch normalization operation involves scanning the full vector, modifying each element so that it is re-centered to a zero mean, and re-scaled to a normal magnitude. A naive non-optimized version of C++ of BatchNorm looks like this:

    void yapi_vector_batch_normalize_basic(    // Basic normalization (BatchNorm)
	float v[], int n, 
	float epsilon, // Smoothing term -- usually 1^e-5 (0.00005)
	float lambda, // Scaling term hyper-parameter (multiplication)
	float beta    // Bias/shift term hyper-parameter (addition)
    ) 
    {
	float fmean = yapi_vector_mean(v, n);  // Calculate "mean" (aka average)
	float variance = yapi_vector_variance_of_mean(v, n, fmean);  // Variance (sum-of-diffs-squared)
	
	float denom = sqrtf(variance + epsilon);  // like std. deviation, but smoothed by epsilon
	for (int i = 0; i < n; i++) {
		v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
	}
	yapi_vector_multiply_scalar(v, n, lambda);  // Scale all values by lambda hyper-param
	yapi_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values 
    }

This version is very inefficient with literally five scans of the entire vector. Loop fusion can obviously improve this, with the loops doing multiplication by lambda and the addition of beta merged into the prior for loop. Another optimization is to replace the division by "denom" with its reciprocal and a multiplication. Division is often an order-of-magnitude worse than multiplication.

Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by lambda, and adding beta. We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration. And then we notice that division and multiply are two versions of the same operation, so we can then use the loop fusion technique to merge the division-by-denom and multiply-by-lambda into a single multiplication by a combined scaling factor. Faster C++ code that has one less loop, and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:

	yapi_vector_add_scalar(v, n, -mean);  // Subtract the mean
	float scalef = lambda / denom;  // Combined scale factor
	yapi_vector_multiply_scalar(v, n, scalef);  // Scale by both denom and lambda 
	yapi_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values

Another way to optimize this code is simply to remove lambda and beta variables. Choosing lambda=1 and beta=0 means that the last two scalar multiplication and scalar addition loops can be avoided. However, there's now little benefit to removing lambda in the merged code above, although the add-beta loop can still be removed. Anyway, whether we can remove these parameters is not a speed decision, but depends on whether these two learned parameters are important to the overall model's capability. Note that there is also little value in trying to remove epsilon, as it is only used once in total.

Research on Optimizating Normalization: Research papers on fast versions of normalization functions:

OneFlow, Dec 24, 2021 How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization, https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8, Code: https://github.com/Oneflow-Inc/oneflow (Efficient one-pass LayerNorm and efficient variance calculations.)
J. Lei Ba, J. R. Kiros, and G. E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450, https://arxiv.org/abs/1607.06450
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
Sam Shleifer, Jason Weston, and Myle Ott. NormFormer: Improved Transformer Pretraining with Extra Normalization. arXiv:2110.09456 [cs], November 2021. http://arxiv.org/abs/2110.09456
David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Dr. Ashish Bamania, July 2025, You Don’t Need Normalization In Transformers Anymore, A deep dive into the internals of Layer Normalization, and how a simple function called Dynamic Tanh (DyT) can replace them entirely in the Transformer architecture without any loss in performance, https://ai.gopubby.com/you-dont-need-normalization-in-transformers-anymore-0c737e846b91
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 14 Jun 2025 (v2), Transformers without Normalization, https://arxiv.org/abs/2503.10622v2 https://jiachenzhu.github.io/DyT/

Approximating Normalization

Research on approximate normalization functions:

Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In Proc. Int. Conf. on Machine Learning (ICML), pages 4475-4483, https://proceedings.mlr.press/v119/huang20f.html, Code: https://github.com/layer6ai-labs/T-Fixup (Pruning of layer normalization.)
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, July 2016, Layer Normalization, https://arxiv.org/abs/1607.06450
Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, Fixup Initialization: Residual Learning Without Normalization, Mar 2019, https://arxiv.org/abs/1901.09321 (Bye bye, normalization.)
Nguyen, T. and Salazar, J., Transformers without tears: Improving the normalization of self-attention. In arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891 (Has a section on approximating normalization.)
Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/023560744aae353c03f7ae787f2998dd-Paper-mlsys2023.pdf (Integer-only-arithmetic quantization with integer-only versions of Softmax, LayerNorm, and GELU.)
Wenjie Li,, Dongxu LYu, Gang Wang, Aokun Hu, Ningyi Xu, Guanghui He, October 2024, Hardware-oriented algorithms for softmax and layer normalization of large language models, Science China, Vol. 67, Iss. 10, 200404:1–200404:15, https://doi.org/10.1007/s11432-024-4137-4 http://scis.scichina.com/en/2024/200404.pdf
W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 13 Mar 2025, Transformers without Normalization, https://arxiv.org/abs/2503.10622 (Using a tanh variant to avoid normalization such as LayerNorm.)

LayerNorm Approximation

Research papers on LayerNorm approximations:

Wenxun Wang; Shuchang Zhou; Wenyu Sun; Peiqin Sun; Yongpan Liu, Nov 2023, SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/abstract/document/10323725
S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
Y Wu, Z Wang, WD Lu, Oct 2023, PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers https://arxiv.org/pdf/2310.09385.pdf
Wenjie Li,, Dongxu LYu, Gang Wang, Aokun Hu, Ningyi Xu, Guanghui He, October 2024, Hardware-oriented algorithms for softmax and layer normalization of large language models, Science China, Vol. 67, Iss. 10, 200404:1–200404:15, https://doi.org/10.1007/s11432-024-4137-4 http://scis.scichina.com/en/2024/200404.pdf
W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 13 Mar 2025, Transformers without Normalization, https://arxiv.org/abs/2503.10622 (Using a tanh variant to avoid normalization such as LayerNorm.)

Integer-Only Normalization

One approximations for normalization is to use integer-only arithmetic (see also overview of integers in inference). Research on integer-only normalization algorithms:

Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Integer-only quantized weights and activations with INT4 or INT8, but also uses integers for batch normalization and residual connection components, too.)
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, “Towards fully 8-bit integer inference for the transformer model,” the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034

Layer Normalization Placement Reordering (Pre-Norm/Post-Norm)

The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a "post-norm" architecture. Subsequently, researchers found that switching to a "pre-norm" architecture, instead of post-norm, could fix one of the major problems with the original Transformer, namely that it was initially unstable in training, requiring a "warmup" phase. Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.

Since then, several researchers have explored where to place the layer normalization submodule. The general consensus seems to be that placing them before computations ("pre-norm") is better than after calcuations ("post-norm"). However, there are papers going either way, so there's still room for definitive research.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Springer, 2016, https://arxiv.org/abs/1603.05027 Code: https://github.com/KaimingHe/resnet-1k-layers (Only uses layer normalization on the input streams.)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I., Language models are unsupervised multitask learners. 2019, PDF: https://cs.brown.edu/courses/cs146/assets/papers/language_models_are_unsupervised_multitask_learners.pdf (Layer normalization is moved to layer inputs.)
Baevski, A. and Auli, M., Adaptive input representations for neural language modeling. Int. Conf. Learn. Represent., 2019. https://arxiv.org/abs/1809.10853 (Has layer normalization before the self-attention and FFN blocks.)
Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M. Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, Raia Hadsell, 2020, Stabilizing Transformers for Reinforcement Learning, https://arxiv.org/abs/1910.06764, PDF: http://proceedings.mlr.press/v119/parisotto20a/parisotto20a.pdf (Has normalization at the inputs of layers.)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models that vary "normalization placement", i.e. as "pre-LN" or "post-LN". Also examines various alternatives and substitutes for normalization.)
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Found evidence that pre-norm was better than post-norm.)
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of HLT-NAACL. Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423, https://arxiv.org/abs/1810.04805 (Post-norm.)
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the Difficulty of Training Transformers. In Proceedings of EMNLP. 5747–5763. https://doi.org/10.18653/v1/2020.emnlp-main.463, https://arxiv.org/abs/2004.08249 (Post-norm)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NeurIPS. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htm, https://arxiv.org/abs/1706.03762 (Post-norm was used in the original 2017 Transformer paper.)
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for Neural Machine Translation. In Proceedings of AMTA. 193–199. https://www.aclweb.org/anthology/W18-1819, PDF: https://aclanthology.org/W18-1819.pdf, Code: https://github.com/tensorflow/tensor2tensor (Has a reference implementation of the vanilla Transformer, which was post-norm, but this is pre-norm.)
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning Deep Transformer Models for Machine Translation. In Proceedings of ACL. 1810–1822. https://doi.org/10.18653/v1/p19-1176, https://arxiv.org/abs/1906.01787, Code: https://github.com/wangqiangneu/dlcl (Researching pre-norm vs post-norm for Transformers.)
Tobias Domhan. 2018. How much attention do you need? a granular analysis of neural machine translation architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1799–1808. PDF: https://aclanthology.org/P18-1167.pdf (Uses a pre-norm architecture, based on Tensor2Tensor.)
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL. 67–72. https://www.aclweb.org/anthology/P17-4012, https://arxiv.org/abs/1701.02810 (Pre-norm architecture.)
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 [cs.LG] https://arxiv.org/abs/1904.10509 (Pre-norm)
Alexei Baevski and Michael Auli. 2019. Adaptive Input Representations for Neural Language Modeling. In Proceedings of ICLR. https://openreview.net/forum?id=ByxZX20qFQ, https://arxiv.org/abs/1809.10853 (Uses pre-norm with normalization at the inputs.)
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Tested both standard pre-norm and RMSNorm architectures.)
David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu, 2020, On Layer Normalization in the Transformer Architecture. arXiv:2002.04745 [cs, stat], June 2020. http://arxiv.org/abs/2002.04745 (Pre-norm versus post-norm.)
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han, 2020, Understanding the Difficulty of Training Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.463. https://aclanthology.org/2020.emnlp-main.463 (Pre-norm versus post-norm.)
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo, 4 Feb 2025, Peri-LN: Revisiting Layer Normalization in the Transformer Architecture, https://arxiv.org/abs/2502.02732 (Instead of pre-norm or post-norm, analyzes "layer normalization (LN) peripherally around sublayers").
Sebastian Raschka, Jul 19, 2025, The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison

Research on Normalization Alternatives and Optimization

Research papers on different types of normalization or other alternatives:

Biao Zhang and Rico Sennrich, 2019, Root Mean Square Layer Normalization. arXiv:1910.07467 [cs, stat], October 2019. http://arxiv.org/abs/1910.07467 (RMS normalization paper.)
Toan Q. Nguyen and Julian Salazar. 2019. Transformers without Tears: Improving the Normalization of Self-Attention. CoRR abs/1910.05895 (2019). arXiv:1910.05895, https://arxiv.org/abs/1910.05895
Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. PowerNorm: Rethinking Batch Normalization in Transformers. In Proceedings of ICML. 8741–8751. http://proceedings.mlr.press/v119/shen20e.html
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. 2019. Understanding and Improving Layer Normalization. In Proceedings of NeurIPS. 4383–4393. https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html, https://arxiv.org/abs/1911.07013
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin, Nov 2019, Understanding and Improving Layer Normalization, https://arxiv.org/abs/1911.07013
Richmond Alake, Apr 22, 2020 Batch Normalization In Neural Networks Explained (Algorithm Breakdown), Towards Data Science, https://towardsdatascience.com/batch-normalization-explained-algorithm-breakdown-23d2794511c (BatchNorm explained quite well.)
Sergey Ioffe, Christian Szegedy, March 2015, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, https://arxiv.org/abs/1502.03167 (Original BatchNorm paper.)

General Research on Normalization

Research papers on normalization issues in general:

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Biao Zhang, Rico Sennrich, 2019, Root Mean Square Layer Normalization, https://arxiv.org/abs/1910.07467
Shaked Brody, Uri Alon, Eran Yahav, May 2023, On the Expressivity Role of LayerNorm in Transformers' Attention, https://arxiv.org/abs/2305.02582 Code: https://github.com/tech-srl/layer_norm_expressivity_role
Toan Q. Nguyen, Julian Salazar, 2019, Transformers without Tears: Improving the Normalization of Self-Attention, https://arxiv.org/abs/1910.05895
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37. JMLR.org, 2015, pp. 448–456. [Online]. Available: http://proceedings.mlr.press/ v37/ioffe15.html
David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei, 1 Mar 2022, DeepNet: Scaling Transformers to 1,000 Layers, https://arxiv.org/abs/2203.00555 https://arxiv.org/pdf/2203.00555.pdf (New normalization function DeepNorm)
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)

RMSNorm Research

RMSNorm is based on the Root Mean Squared (RMS) calculation. Research papers on RMSNorm include:

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Biao Zhang, Rico Sennrich, 2019, Root Mean Square Layer Normalization, https://arxiv.org/abs/1910.07467
Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
David Spuler, March 2024, Root Mean Square Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch24-rmsnorm-root-mean-square
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi, 31 Dec 2024, 2 OLMo 2 Furious, https://arxiv.org/abs/2501.00656
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)
Sebastian Raschka, Jul 19, 2025, The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/

LayerNorm Optimizations

Research papers on optimizing LayerNorm:

Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Mahsa Salmani, Nikita Trukhanov, Ilya Soloveychik, 14 Oct 2024, SLaNC: Static LayerNorm Calibration, https://arxiv.org/abs/2410.10553
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 29 May 2024, On the Role of Attention Masks and LayerNorm in Transformers, https://arxiv.org/abs/2405.18781
David Spuler, March 2024, Layer Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch24-layer-normalization
ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo, 4 Feb 2025, Peri-LN: Revisiting Layer Normalization in the Transformer Architecture, https://arxiv.org/abs/2502.02732 (Instead of pre-norm or post-norm, analyzes "layer normalization (LN) peripherally around sublayers").
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf, Feb 19, 2025, The Ultra-Scale Playbook: Training LLMs on GPU Clusters, Hugging Face, https://huggingface.co/spaces/nanotron/ultrascale-playbook https://huggingface.co/spaces/nanotron/ultrascale-playbook/resolve/main/The_Ultra-Scale_Playbook_Training_LLMs_on_GPU_Clusters.pdf