Aussie AI
FFN Fusion
-
Last Updated 10 October, 2025
-
by David Spuler, Ph.D.
What is FFN Fusion?
FFN fusion is either the merging of two whole FFN components (FFN merging) or the kernel fusion of the three sub-components within one FFN block (intra-FFN kernel fusion), which attempts to fuse the two matrix multiplications and one activation function. There are, in fact, a number of sub-methods of FFN fusion to consider:
- Inter-FFN fusion — merging two FFN blocks.
- Intra-FFN kernel fusion — fusing two or three internal FFN steps.
- Tiled FFN — end-to-end tiling of the three steps.
- Merged Matmuls — combining the two FFN matrices into one.
These FFN fusion approaches have been examined by different researchers. Let's examine each method in turn.
FFN Merging. Merging of two FFNs in adjacent layers has been researched by NVIDIA in Bercovich et. al. (March 2025), but it is necessarily limited to only situations where the attention blocks have been removed from layers by first doing the optimization of "attention head pruning." If attention blocks are moved from a model layer, then the two FFNs in those layers operate immediately after each other, and can be combined into a single, larger FFN block. This is called "FFN Fusion" in this research area.
Layer Fusion. There is a lot of research on fusing or merging entire model layers, which is a subtype of "parameter sharing" research. Since each layer contains both attention components and FFN modules, this is similar to FFN fusion, but is not specific to the FFN components. There's also related research on "KV cache layer fusion" to add to the fusion options, not to mention computation avoidance via early exiting layers and layer skipping. But the FFNs are a big chunk of computation in themselves, so we're looking here specifically at just reducing FFN computation with fusion.
Intra-FFN Kernel Fusion. Rather than merging two FFNs entirely, the other type of fusion for FFN blocks is inside the sub-layers of an FFN component. An FFN module typically involves three internal computation steps:
- Matrix multiplication
- Activation function (e.g., GELU or SiLU, less commonly RELU)
- Matrix multiplication
Fusing two components. Basic kernel fusion involves merging two of the three components. Well-known kernel fusion approaches can be used to add the activation function as a "fused epilogue" with the first matrix multiplication kernel. Similarly, a "fused prologue" before the second matrix multiplication is another alternative. These are both ordinary kernel development techniques that are a staple of backend kernel development, and not the subject of research. However, what's very difficult is trying to merge all three steps.
Fusing three components. Kernel fusion of all three steps of two matrix multiplications with an intervening activation function is very difficult. Two different approaches include:
- Tiling optimizations
- Merging the two matrices
Tiled FFN Matmuls. Each Matmul kernel can obviously be tiled as individual MatMuls, and the activation function can be fused at the end of the first. However, the idea of running an end-to-end tiling optimization through both MatMuls is more complex. The idea is motivated by the way that Flash Attention used end-to-end tiling to apply kernel fusion to the attention head sequence of MatMul-Softmax-Matmul. The activation functions in an FFN are elementwise, and therefore much simpler than Softmax, so the idea has merit. The goal is to propagate tiled matrix multiplication computations through a MatMul-GELU-MatMul sequence, or with a different activation function.
Bilinear layers are not FFNs. Note that the optimization known as "bilinear layers" is not applicable to FFNs. Bilinear layers are the removal of the activation function between two matrix operations, but from a Gated Linear Unit (GLU), rather than from the FFN sequence. The GLU is more complex than an FFN, and "activation function pruning" from an FFN is problematic.
If you remove the activation function from the FFN's MatMul-Activation-MatMul sequence, then you just have two matrix multiplications in a row, which can then be combined into a single MatMul using associativity of matrix algebra. Effectively, it becomes a single-layer FFN, which is reverting back to older types of models before some brainy boffin put the "multi-layer" into Multi-Layer Perceptron (MLP). An MLP is another name for the FFN, so if we remove the activation function from an FFN, we've got a single-layer FFN (or a non-multilayer MLP), and this creates a much less capable model overall.
Unfortunately, we can't just remove the activation function from a normal 2-layer FFN and run faster. Nevertheless, if we could merge the two matrices together in some other way that retained model accuracy, that would be a massive speed optimization, which motivates research on other ideas for merging the two MatMuls in an FFN.
Merging the two FFN MatMuls. Another optimization is to attempt to combine the two matrix multiplications into a single matrix, possibly a larger matrix. It is theoretically possible for a "linear" type of activation function, but the whole point of activation functions like GELU is to introduce non-linearity, to increase representational capability (i.e., so the model can learn complex things). There is some research on using a linear activation function, and it greatly decreases the accuracy of the model. The most advanced research by Hu et. al. (January 2025) involves using a linear approximation for some activation function ranges (with kernel fusion), and then reverting to non-linear computations for outlier values (without kernel fusion). This bifurcated computation shows some promise of improved speed with only limited accuracy degradation.
Three-Layer FFNs. Note that there's actually research on 3-layer FFNs, and they appear better than standard 2-layer FFNs in terms of model representational capacity; see Gerber (May 2025). This extra FFN layer would run slower, unless three smaller linear layers outperformed two larger linear layers (with fewer overall parameters), or if using 3-layer FFNs meant that each layer was smarter and fewer model layers were required in the overall model. This research area on 3-layer FFNs is still a work-in-progress.
Research on FFN Fusion
Research papers include:
- Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
- Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
- Asif Razzaq, March 29, 2025, NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized, https://www.marktechpost.com/2025/03/29/nvidia-ai-researchers-introduce-ffn-fusion-a-novel-optimization-technique-that-demonstrates-how-sequential-computation-in-large-language-models-llms-can-be-effectively-parallelized/
- Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
- Richie Li, 31 May 2025 (v3), Dataflow & Tiling Strategies in Edge-AI FPGA Accelerators: A Comprehensive Literature Review, https://arxiv.org/abs/2505.08992
- Farley, J., Gerstlauer, A., 2023, MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference. In: Henkler, S., Kreutz, M., Wehrmeister, M.A., Götz, M., Rettberg, A. (eds) Designing Modern Embedded Systems: Software, Hardware, and Applications. IESS 2022. IFIP Advances in Information and Communication Technology, vol 669. Springer, Cham. https://doi.org/10.1007/978-3-031-34214-1_7 https://link.springer.com/chapter/10.1007/978-3-031-34214-1_7 https://arxiv.org/abs/2107.06960 https://github.com/JacksonFarley/MAFAT
- Jackson Farley, Andreas Gerstlauer 14 Jul 2021 (v1), Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference, https://arxiv.org/abs/2107.06960v1 https://github.com/JacksonFarley/MAFAT
- Isaac Gerber, 10 May 2025, Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models, https://arxiv.org/abs/2505.06633 (Examines 3-layer FFN components.)
- Ruediger Ehlers, 2 Aug 2017 (v3), Formal Verification of Piece-Wise Linear Feed-Forward Neural Networks, https://arxiv.org/abs/1705.01320
- Razvan Pascanu, Guido Montufar, Yoshua Bengio, 14 Feb 2014 (v5), On the number of response regions of deep feed forward networks with piece-wise linear activations, https://arxiv.org/abs/1312.6098
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home