Aussie AI Blog

LLM Attention and FFN Optimization are Opposites

  • March 22nd, 2026
  • by David Spuler, Ph.D.

Attention versus FFN Optimization

The two main types of compute components in Transformer engines for LLM inference are attention modules and Feed-Forward Networks (FFNs). The two have quite different compute bottleneck profiles:

  • Attention — compute quadratic with input length; KV cache memory linear with inputs.
  • FFN — compute linear with input length; memory usage does not grow.

Because of these different characteristics, the optimization methods are distinct.

  • Attention optimizations — linear attention, hybrid Mamba attention, memory-efficent attention (e.g., Flash, Paged), KV cache compression, and many more methods.
  • FFN optimizations — Mixture-of-Experts (MoE), matrix compute optimizations, kernel fusion.

There's also a lot of higher-level techniques that work above both of these: quantizations (of weights, activations, and/or KV cache data), multi-token prediction, prefill-decoding disaggregation, and many more.

But what's missing?

Matching Research Optimizations

Ironically, there are two examples where an optimization of attention is a research area for possible FFN optimizations, and vice-versa.

  • Attention Mixture-of-Experts (MoE)
  • FFN fusion (analogous to Flash attention).

Both of these areas seem on the cusp of becoming worthy of more research.

MoE Attention

The MoE architecture applies particularly to the FFN matrices, becoming a mainstream optimizations for huge trillion-parameter models. The idea very successfully reduces the number of active parameters during the computation, effectively doing dynamic sparsification of the FFN matrices, by having sub-parts of the weights as "experts" that focus on different elements. The adaptive choice of which experts to activate for a given input query is made by an MoE "gating" component.

The same concepts seem like it should apply to LLM self-attention with "attention experts" using a subset of weights. However, attempts to make MoE concepts work with attention modules haven't been very successful. But recent research has been showing some good results, such as Yang et al (2025). The attention algorithm needs to be modified somewhat to make the MoE optimizations effective, but this latest research seems to be working.

References on MoE Attention

  1. Yuanhang Yang, Chaozheng Wang, Jing Li, 23 Oct 2025 (v2), UMoE: Unifying Attention and FFN with Shared Experts, https://arxiv.org/abs/2505.07260
  2. Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong, 2022, Mixture of attention heads: Selecting attention heads per token In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4150–4162. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.278. https://doi.org/10.18653/v1/2022.emnlp-main.278, https://arxiv.org/abs/2210.05144
  3. Róbert Csordás, Piotr Piekos, Kazuki Irie, and Jürgen Schmidhuber, 30 Sep 2024, Switchhead: Accelerating transformers with mixture-of-experts attention In The Thirty-eighth Annual Conference on Neural Information Processing Systems, https://arxiv.org/abs/2312.07987
  4. Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng, 24 November, 2024, LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training, https://arxiv.org/abs/2411.15708
  5. G. Blecher and S. Fine, 2023, MoEAtt: A Deep Mixture of Experts Model using Attention-based Routing Gate, 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 1018-1024, doi: 10.1109/ICMLA58977.2023.00151, https://ieeexplore.ieee.org/abstract/document/10459810

FFN Fusion

FFN optimization seems to have received less research focus because it's not a major bottleneck for long-context inference. FFNs are only linear, whereas attention modules are quadratic, and FFNs don't have the memory growth problems of the KV cache. However, some good solutions for linear attention are starting to be used in major models, notably the hybrid Transformer-Mamba architecture for linear attention. Hence, if both attention and FFNs are linear, then it would make sense to further optimize the FFNs as well.

Flash attention to the rescue!

An important similarity of attention and FFNs is that both require two sequential matrix multiplications. If that was all, we could just merge the two matrices. But attention and FFNs have an intervening non-linear operation:

  • Attention — Softmax normalization
  • FFN — activation functions (e.g., RELU, GELU, etc.)

In theory, that non-linearity between the two operations stops any efficient merging of the two matrix operations. But that was the genius of Flash Attention, where the insight was that by storing some extra information about the Softmax results, it was possible to propagate the same compute through both matrix multiplications. This allows "tiled MatMul" operations that work on both matrix multiplication operations.

Can this be done for FFNs?

Several researchers have tried to merge the two MatMul computations in FFNs. There are two basic approaches: merge two FFN modules completely (see Bercovich et al, 2025), or merge the two linear matrix operations inside a single FFN (see Hu et al, 2025). Both of these ideas are at the cutting edge of research.

References on FFN Fusion

  1. Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
  2. Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
  3. Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu, 7 Dec 2025, Flash Multi-Head Feed-Forward Network, https://arxiv.org/abs/2512.06989

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Modern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++