Aussie AI Blog
LLM Attention and FFN Optimization are Opposites
-
March 22nd, 2026
-
by David Spuler, Ph.D.
Attention versus FFN Optimization
The two main types of compute components in Transformer engines for LLM inference are attention modules and Feed-Forward Networks (FFNs). The two have quite different compute bottleneck profiles:
- Attention — compute quadratic with input length; KV cache memory linear with inputs.
- FFN — compute linear with input length; memory usage does not grow.
Because of these different characteristics, the optimization methods are distinct.
- Attention optimizations — linear attention, hybrid Mamba attention, memory-efficent attention (e.g., Flash, Paged), KV cache compression, and many more methods.
- FFN optimizations — Mixture-of-Experts (MoE), matrix compute optimizations, kernel fusion.
There's also a lot of higher-level techniques that work above both of these: quantizations (of weights, activations, and/or KV cache data), multi-token prediction, prefill-decoding disaggregation, and many more.
But what's missing?
Matching Research Optimizations
Ironically, there are two examples where an optimization of attention is a research area for possible FFN optimizations, and vice-versa.
- Attention Mixture-of-Experts (MoE)
- FFN fusion (analogous to Flash attention).
Both of these areas seem on the cusp of becoming worthy of more research.
MoE Attention
The MoE architecture applies particularly to the FFN matrices, becoming a mainstream optimizations for huge trillion-parameter models. The idea very successfully reduces the number of active parameters during the computation, effectively doing dynamic sparsification of the FFN matrices, by having sub-parts of the weights as "experts" that focus on different elements. The adaptive choice of which experts to activate for a given input query is made by an MoE "gating" component.
The same concepts seem like it should apply to LLM self-attention with "attention experts" using a subset of weights. However, attempts to make MoE concepts work with attention modules haven't been very successful. But recent research has been showing some good results, such as Yang et al (2025). The attention algorithm needs to be modified somewhat to make the MoE optimizations effective, but this latest research seems to be working.
References on MoE Attention
- Yuanhang Yang, Chaozheng Wang, Jing Li, 23 Oct 2025 (v2), UMoE: Unifying Attention and FFN with Shared Experts, https://arxiv.org/abs/2505.07260
- Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong, 2022, Mixture of attention heads: Selecting attention heads per token In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4150–4162. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.278. https://doi.org/10.18653/v1/2022.emnlp-main.278, https://arxiv.org/abs/2210.05144
- Róbert Csordás, Piotr Piekos, Kazuki Irie, and Jürgen Schmidhuber, 30 Sep 2024, Switchhead: Accelerating transformers with mixture-of-experts attention In The Thirty-eighth Annual Conference on Neural Information Processing Systems, https://arxiv.org/abs/2312.07987
- Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng, 24 November, 2024, LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training, https://arxiv.org/abs/2411.15708
- G. Blecher and S. Fine, 2023, MoEAtt: A Deep Mixture of Experts Model using Attention-based Routing Gate, 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 1018-1024, doi: 10.1109/ICMLA58977.2023.00151, https://ieeexplore.ieee.org/abstract/document/10459810
FFN Fusion
FFN optimization seems to have received less research focus because it's not a major bottleneck for long-context inference. FFNs are only linear, whereas attention modules are quadratic, and FFNs don't have the memory growth problems of the KV cache. However, some good solutions for linear attention are starting to be used in major models, notably the hybrid Transformer-Mamba architecture for linear attention. Hence, if both attention and FFNs are linear, then it would make sense to further optimize the FFNs as well.
Flash attention to the rescue!
An important similarity of attention and FFNs is that both require two sequential matrix multiplications. If that was all, we could just merge the two matrices. But attention and FFNs have an intervening non-linear operation:
- Attention — Softmax normalization
- FFN — activation functions (e.g., RELU, GELU, etc.)
In theory, that non-linearity between the two operations stops any efficient merging of the two matrix operations. But that was the genius of Flash Attention, where the insight was that by storing some extra information about the Softmax results, it was possible to propagate the same compute through both matrix multiplications. This allows "tiled MatMul" operations that work on both matrix multiplication operations.
Can this be done for FFNs?
Several researchers have tried to merge the two MatMul computations in FFNs. There are two basic approaches: merge two FFN modules completely (see Bercovich et al, 2025), or merge the two linear matrix operations inside a single FFN (see Hu et al, 2025). Both of these ideas are at the cutting edge of research.
References on FFN Fusion
- Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
- Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
- Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu, 7 Dec 2025, Flash Multi-Head Feed-Forward Network, https://arxiv.org/abs/2512.06989
More AI Research Topics
Read more about:
Aussie AI Advanced C++ Coding Books
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
|
Advanced C++ Memory Techniques: Efficiency & Safety:
Get your copy from Amazon: Advanced C++ Memory Techniques |
|
Safe C++: Fixing Memory Safety Issues:
Get it from Amazon: Safe C++: Fixing Memory Safety Issues |
|
Efficient C++ Multithreading: Modern Concurrency Optimization:
Get your copy from Amazon: Efficient C++ Multithreading |
|
Efficient Modern C++ Data Structures:
Get your copy from Amazon: Efficient C++ Data Structures |
|
Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
Get your copy from Amazon: Low Latency C++ |