Aussie AI

Fused and Shared Epilogues

Bonus Material for "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

What are Epilogues?

Epilogues are things that happen after matrix multiplications. Especially in LLM kernels, the computations usually involve doing a matrix multiplication, and then some other post-processing. Here are some common examples that happen to the data that comes out of a matrix multiplication kernel:

Activation functions — e.g. RELU changes all negatives to zero, and there are many other more complicated ones (GELU, SiLU, SwigLU, to name but a few).
Bias vector addition — the bias vector is a set of extra trained LLM parameters that are added after the bigger matrix multiplication (it’s not as big as the weights used in the MatMul, because it’s only a vector not a matrix).
Residual vector addition — the residual vector is not a set of trained parameters, but is a dynamic vector added in a “residual connection” (also called a “skip connection”), and it uses the activation vector that was computed prior to the big matrix multiplication (i.e., its values are only known at runtime, unlike a trained bias vector).
Normalization (LayerNorm or BatchNorm) — creating a normalized set of data, usually after one stage and before it gets sent into the next state of LLM computations.
Softmax — a special type of normalization that smooths all elements of a vector into a probabilistic distribution where the total sum of all vector elements is one (which means 100% in probabilities), and each vector element is an individual probability between zero and one (inclusive).
Quantization/Dequantization — e.g. converting to-and-fro between FP32 and FP4 or FP8, which is effectively done by applying a simple scaling factor.

As you can see, all of these things are less expensive than the big matrix multiplication. All of them are vector operations, with vector addition or vector normalization, and some of them are “embarassingly parallel” vector operations, where each element of the vector can be processed independently. In theory, vector operations are O(n) complexity compared with O(n^2) complexity of matrix-vector multiplication (GEMV), or O(n^3) complexity of matrix-matrix multiplication (GEMM).

But it’s not zero! Hence, there are multiple attempts to optimize these post-processing costs:

Epilogue fusion
Epilogue sharing
Epilogue pruning
Epilogue Visitor Tree (EVT) computations
Kernel fission (un-fusing, rarely)

Let’s examine all these issues.

Fused Epilogues

Fused epilogues are a specific type of “kernel fusion” operation related to epilogues. Kernel fusion refers to the general optimization of combining any two operations into a single operation. ML graph compilers are a good example where kernel fusion is used more generally to combine two or more “nodes” representing operations in the model graph. However, in LLM kernels, the main use of kernel fusion is with epilogues.

Let’s consider a simplistic example: combining matrix-vector multiplication with the RELU activation function. A naive version uses two separate kernels:

Matrix-vector (GEMV) kernel — output vector (of activations).
RELU vector kernel — output modified vector.

You can see the inefficiency here:

GEMV kernel has to store the vector as output.
RELU kernel then loads that vector as input (and stores it back again).

The slug is double memory operations. We aren’t going to reduce the computations by fusing the kernels, because we’ll still do the matrix multiplication and the RELU arithmetic. Instead, kernel fusion avoids memory storage for the temporary vector data. In practice, the fused “GEMV-RELU” kernel is a very simple code modification:

Fused-GEMV-RELU computes RELU on each element before storing the output vector.

Note that fused epilogues are not the only type of kernel fusion. For example, if we have two kernels for compute the “min” and “max” of a vector, both are horizontal reductions, and we can fuse them into a “fused min-max” vector kernel. In that case, it’s kernel fusion of two parallel operations, rather than a preliminary computation and an epilogue. We could even fuse a third reduction, such as a “sum” computation.

Shared Epilogues

Shared epilogues are similar to fused epilogue kernels, but operate on multiple matrices. The idea is to create a kernel that just does the epilogue operation (or two), which can then be used on multiple matrices. The basic idea:

Fused epilogues — one matrix is processed with a hard-coded epilogue.
Shared epilogues — multiple matrices are processed with epilogues (and their metadata).

Obviously, it’s faster to specialize a kernel to a particular type of epilogue operation. However, then you’ve got to rewrite that kernel for every type of matrix or vector that you’re using it on.

Maybe we could share some code?

That’s how to get to shared epilogues. Comparing shared epilogues to fused epilogues, we get this trade-off:

More flexible than fused epilogues
Not quite as fast

The idea of sharing the epilogue with a generalization to an object representation is that such a kernel involves:

The operation (or two)
Additional raw data (shared)
Metadata about the operation

The two types of data vary with the type of operations, and are not always required. For example, a shared epilogue for a bias vector addition can contain the extra data for that bias vector, in its object, because that’s known at compile-time, and is the same when applied to every vector. Metadata is usually a smaller amount of data, but could include settings like the scaling parameters for a dequantization epilogue, or the alpha and beta parameters for a complex activation function (e.g., the “slope” parameter in leaky RELU).

Epilogue Pruning

Imagine if your epilogue operation cost zero. Well, there are a lot of researcher’s who’ve had that idea, and there are plenty of research papers on these subtopics. The versions that have progressed from research papers into real-world usage in some major models, although not always used:

Bias vector pruning — lots of major models no longer use the extra trainable parameters in bias vectors, but just use matrices of weights.
Residual vector pruning — residual connections are supposed to be needed to avoid the vanishing gradient problem, but sometimes they’re not necessary.

The pruning ideas that are more research, rather than used in industry model architectures:

Activation function pruning — removing the activation function step between the two MatMuls in FFNs (creates “bilinear layers”).
Softmax pruning — instead of normalizing into probabilities, just keep going.
Normalization pruning — the same idea where normalized values may not be necessary.

Your mileage may vary for these approaches.

Fused Prologues?

Yes, they do exist, and are used in real-world LLM backends, but they always feel sad that they don’t get much attention in press releases.

There are plenty of places where an extra operation is done before a major matrix multiplication operation. Some examples that are happening in real CUDA kernels include:

“Pre-norm” normalization before a block.
Transposing a matrix before MatMul.
Adding padding to a matrix or tensor.
Quantization scaling of inputs.
Masking operations on data.
Data layout transformations (e.g, tensor re-shaping).

A good example is in block-scaled quantization from FP32 to FP4/FP8 in Blackwell. In this case the input data needs to be scaled prior to major computations. Some specific other uses include:

TMA multicast loads setting up data in SMs.
Grouped GEMM APIs with prologue steps.
Cluster Launch Control (CLC) of persistent kernels with setup steps.

A lot of the coding issues for prologues are very similar to fused epilogues, and “shared prologues” are possible. After all, both epilogues and prologues are types of kernel fusion optimizations.

Epilogue Visitor Tree (EVT)

These are advanced types of fused epilogues that are handled in the CUTLASS library. The operations that are to be performed in the epilogue are stored in a hierarchical tree, and CUTLASS has support for executing that. This has been available in CUTLASS since the Hopper architecture. Conceptually, this is like a tree subset of a graph, akin to the idea of a “compute graph” in CUDA, or the type of “operator fusion” done on nodes in the graph representations used in ML graph compilers.

Epilogue Visitor Tree Research. Research papers include:

Colfax Research, October 25, 2024, Epilogue Fusion in CUTLASS with Epilogue Visitor Trees, https://research.colfax-intl.com/epilogue_visitor_tree/
Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie, 2024, EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree, Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '24), Vol. 3. Association for Computing Machinery, New York, NY, USA, 301–316, https://doi.org/10.1145/3620666.3651369, https://dl.acm.org/doi/10.1145/3620666.3651369
Ganesh Bikshandi, Jay Shah, December 2023, Developing CUDA Kernels for Accelerated Matrix Multiplication on NVIDIA Hopper Architecture using the CUTLASS Library, https://research.colfax-intl.com/nvidia-hopper-gemm-cutlass/, PDF: https://research.colfax-intl.com/wp-content/uploads/2023/12/colfax-gemm-kernels-hopper.pdf

References

Fused Epilogues and CUDA Kernel Fusion. Research papers include:

Szymon Karpiński, Nov 18, 2024, Fusing Epilog Operations with Matrix Multiplication Using nvmath-python, https://developer.nvidia.com/blog/fusing-epilog-operations-with-matrix-multiplication-using-nvmath-python/
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu, 23 Oct 2024 (v5), FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion, https://arxiv.org/abs/2406.06858
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918
Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python, Jul 09, 2025, Ashwin Srinath and Andy Terrel, https://developer.nvidia.com/blog/delivering-the-missing-building-blocks-for-nvidia-cuda-kernel-fusion-in-python/
Vrushank Desai, Feb 18th 2024, Part VI - Kernel Fusion in CUDA An implementation of a fused GPU kernel combining Group Normalization and Mish activation into a single kernel, https://www.vrushankdes.ai/diffusion-policy-inference-optimization/part-vi---kernel-fusion-in-cuda
Jiří Filipovič, Matúš Madzin, Jan Fousek, and Ludĕk Matyska, 2015, Optimizing CUDA code by kernel fusion: application on BLAS, J. Supercomput. 71, 10 (October 2015), 3934–3957, https://doi.org/10.1007/s11227-015-1483-z, https://dl.acm.org/doi/10.1007/s11227-015-1483-z
Han Zhao, Weihao Cui, Quan Chen, Youtao Zhang, Yanchao Lu, Chao Li, Jingwen Leng, Minyi Guo, 2022, Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS, IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Korea, Republic of, 2022, pp. 800-813, doi: 10.1109/HPCA53966.2022.00064, https://mivenhan.github.io/publication/2022tacker/2022tacker.pdf, https://ieeexplore.ieee.org/document/9773253
Mohamed Wahib, 2015, Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications, https://github.com/wahibium/KFF

Fused LayerNorm. Research papers are below:

Moin Nadeem, 2022, Fused LayerNorm, MosaicML, https://docs.mosaicml.com/projects/composer/en/stable/method_cards/fused_layernorm.html (Fused LayerNorm in MosaicML Composer.)
NVIDIA, apex.normalization.fused_layer_norm, Apex (A PyTorch Extension), 2018, https://nvidia.github.io/apex/layernorm.html (Fused LayerNorm in Apex.)
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient GPU serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
B Hagedorn, B Fan, H Chen, C Cecka, 2023, Graphene: An IR for Optimized Tensor Computations on GPUs, ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada, PDF: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018 (Includes various kernel fusions including fused MHA and fused LayerNorm.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Extensive analysis of fusion opportunities in Transformers, such as fused LayerNorm.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (“Section A.2 Fusion Implementation” lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, Jun Yao, March 2023, PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing, Technical Report, https://arxiv.org/abs/2303.10845 (Method uses a fused layernorm.)
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Ofer Dekel, 29 Apr 2025, Blockbuster, Part 1: Block-level AI Operator Fusion, https://arxiv.org/abs/2505.07829

Fused BatchNorm. Research papers are below:

S. Mehta and M. Rastegari, “MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer,” International Conference on Learning Representations, 2021. https://arxiv.org/abs/2110.02178 (Fusion of elements of the convolutional layers including batch normalization into convolutions.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Fused batchnorm with batch normalization merged into convolutions.)
Michael Anderson, Evangelos Georganas, Sasikanth Avancha, Alexander Heinecke, 2018, Tensorfolding: Improving convolutional neural network performance with fused microkernels, SC18, November 11-16, 2018, Dallas, Texas, USA PDF: https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post155s2-file3.pdf"> https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post155s2-file3.pdf (Includes fused batch norm and fused RELU, along with a process called “tensor folding”.)
D. Jung, W. Jung, , B. Kim, S. Lee, W. Rhee, and J. H. Ahn, Restructuring batch normalization to accelerate CNN training, 2018. PDF: https://mlsys.org/Conferences/2019/doc/2019/18.pdf, Code: https://github.com/scale-snu/caffe-bn-restructuring, Code: https://github.com/scale-snu/mkldnn-bn-restructuring (Coverage of batch normalization, merging into prior and subsequent layers, and a technique called “batch norm fission”.)
E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, “Anatomy of high-performance deep learning convolutions on simd architectures,” in Accepted to Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18. IEEE Press, 2018, https://arxiv.org/abs/1808.05567 (Investigates kernel fusion for RELU, bias, and normalization, although mostly calls it “layer fusion”.)
D. Das, N. Mellempudi, D. Mudigere, D. D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. O. Pirogov, “Mixed precision training of convolutional neural networks using integer operations,” CoRR, vol. abs/1802.00930, 2018. http://arxiv.org/abs/1802.00930 (Fused element-wise layers with RELU and batch normalization.)
Mathilde Guillemot, Catherine Heusele, Rodolphe Korichi, Sylvianne Schnebert, Liming Chen, Feb 2020, Breaking batch normalization for better explainability of deep neural networks through layer-wise relevance propagation, https://arxiv.org/abs/2002.11018 (Focuses on explainability propagations through batch norm, but is also a type of fused batch norm.)
Pytorch, 2023, Fusing Convolution and Batch Norm Using Custom Function, https://pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html
J Zhang, 2023, Quantization for High-dimensional Data and Neural Networks: Theory and Algorithms, Ph.D. Thesis, University of California, San Diego, https://escholarship.org/content/qt9bd2k7gf/qt9bd2k7gf.pdf (See section 4.7: Fusing Convolution and Batch Normalization Layers.)
Xiaoming (Jason) Cui, Ashraf Bhuiyan, 2023, Optimizing Transformer Model Inference on Intel® Processors, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html
S. R. Bulo, L. Porzi, and P. Kontschieder, In-place activated batchnorm for memory-optimized training of DNNs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5639–5647. https://arxiv.org/abs/1712.02616 Code: https://github.com/mapillary/inplace_abn (Fused BatchNorm with activations in a single layer.)

Fused Softmax. Fused Softmax is the optimization of kernel fusion whereby the Softmax normalization is fused into the prior operation. Softmax is not especially amenable to fusion because its computation requires the fully-computed vector for each element, rather than being a simple elementwise computation. Nevertheless, Softmax is an expensive operation that is worth the invested time, and fusing Softmax is a standard LLM efficiency optimization. Research papers are below:

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (“Section A.2 Fusion Implementation” lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Zitong Li, Aparna Chandramowlishwaran, 12 May 2025, Fused3S: Fast Sparse Attention on Tensor Cores, https://arxiv.org/abs/2505.08098

Fused Activation Functions. Fused activation functions are LLM kernel optimizations from merging the computation of the activation function on the vector of activations with the prior operation, such as a matrix-vector multiplication kernel. Activation functions are simple, elementwise vector operations that are easily parallelizable and amenable to merging back into the end of the prior kernel. Hence, fused activation functions are a common LLM kernel optimization. Research papers are below:

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient GPU serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)

Fused RELU. RELU is a very simple activation function that can be readily fused into another kernel. Research papers are below:

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (“Section A.2 Fusion Implementation” lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
D. Das, N. Mellempudi, D. Mudigere, D. D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. O. Pirogov, “Mixed precision training of convolutional neural networks using integer operations,” CoRR, vol. abs/1802.00930, 2018. http://arxiv.org/abs/1802.00930 (Fused element-wise layers with RELU and batch normalization.)

Fused GELU. Research papers are below:

Wenxuan Zeng, Meng Li, Wenjie Xiong, Tong Tong, Wen-jie Lu, Jin Tan, Runsheng Wang, Ru Huang, Aug 2023, MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention, https://arxiv.org/abs/2211.13955, PDF: https://openaccess.thecvf.com/content/ICCV2023/papers/Zeng_MPCViT_Searching_for_Accurate_and_Efficient_MPC-Friendly_Vision_Transformer_with_ICCV_2023_paper.pdf, Code: https://github.com/PKU-SEC-Lab/mpcvit (Optimizes Softlayer, GELU, and MatMul. Fuses two linear layers with an approximated linear version of GELU.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine with fused GroupNorm + GELU operator in NHWC tensor memory format)

Fused SwiGLU. Research papers are below:

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel

Fused Add-Bias. Fused add-bias is a kernel fusion optimization whereby adding bias vectors is fused into the previous operation. Adding bias vectors is an additional step often performed after matrix multiplications. Since vector addition is an easily vectorizable elementwise addition of two vectors, this can often be fused into the prior operator, usually a GEMM (matrix-matrix) or GEMV (matrix-vector) multiplication kernel. Note that some model architectures dispense with bias vectors completely (“bias pruning”), which is even faster than merging them! Research papers are below:

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Examines benefit from add-bias fusion, as well as many other fusion opportunities. “Section A.2 Fusion Implementation” lists various types of fusion of bias operations.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
David Spuler, March 2024, Example: Fused VMM-add-bias, in Generative AI in C++, https://www.aussieai.com/book/ch31-fused-vmm-addbias
Y. Zhang, O. A. Kailani, B. Zhou and W. Zhao, “AdderNet 2.0: Optimal AdderNet Accelerator Designs With Activation-Oriented Quantization and Fused Bias Removal-Based Memory Optimization,” in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3539912. https://ieeexplore.ieee.org/abstract/document/10884535/

Fused Matrix Transpose (Fused Prologue!). One of the main optimizations to matrix-matrix multiplications is to use the transpose of the second matrix, because it is stored in column-major order. This means the data is stored in contiguous memory, which is better for coalesced memory accesses. Hence, the computation of the matrix tranpose can be fused into the matrix multiplication kernel. Research papers are below:

J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient GPU serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, Bin Ren, 2021, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, June 2021, Pages 883–898, https://doi.org/10.1145/3453483.3454083, https://dl.acm.org/doi/10.1145/3453483.3454083, https://arxiv.org/abs/2108.13342 (Includes some discussion of fusing matrix tranposition.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Mentions of optimizations of the transpose operation, with numerous other optimizations.)

CUDA C++ Optimization Book:

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Fused and Shared Epilogues

What are Epilogues?

Fused Epilogues

Shared Epilogues

Epilogue Pruning

Fused Prologues?

Epilogue Visitor Tree (EVT)

References

Quick Links

Product

New to Writing?

Writing Styles