Aussie AI Blog

FFN Fusion with Piecewise Linear Approximations

  • March 3rd, 2026
  • by David Spuler, Ph.D.

What is FFN Fusion?

FFN fusion is the merging of two Feed-Forward Network (FFN) components in an LLM engine's Transformer architecture. The two main approaches in the literature are: (a) merging two FFNs together as in Berkovich et al (2025), and (b) merging the two matrix multiplications operations inside a single FFN as in Hu et al (2025).

It's the second approach of "intra-FFN fusion" that we would like to optimize with piecewise linear approximations. The FFN has three major computations in sequence:

  1. Matrix multiplication
  2. Activation function (e.g., GELU, RELU)
  3. Matrix multiplication

The two matrix multiplications are a "linear projection" of the input activations, and we're not talking about changing them. However, the activation functions are non-linear, and we'd like to use more linear functions so that they're faster to compute. It's not just about optimizing the calculations of these activation functions, which are elementwise vector operations, but rather, that their non-linearity prevents us from fully parallelizing the tiling in the two matrix operations inside the FFN.

Non-Linear is Good and Bad

There are multiple non-linear components in the Transformer architecture including Softmax normalization in attention modules, LayerNorm normalizations, and the activation functions in FFNs. This discussion is about getting rid of non-linear functions for improved efficiency, but we need to first acknowledge their power. The basic idea of why there are non-linear components is to make the LLM smarter. If we used only linear functions, the LLMs would be too simplistic. Hence, the trade-off is basically:

  • Linear — faster, easy to parallelize.
  • Non-linear — greater representational capability (i.e., intelligence)

If you remove the non-linear components from an LLM, it won't be as accurate. However, the question arises whether we can trade-off some of the non-linear capabilities, so that it can go a bit faster (albeit, with less smartness).

Why Pipelined Tiling?

The idea of this optimization is to "pipeline" a single "tile" in a matrix multiplication. Now, tiled matrix multiplication is a very well-known optimization, which you know if you've ever been to a GPU party. However, that's for just one operation, and it is a great speedup. The real trick is that we want to do tiling across two sequential matrix multiplications. There are two components of the Transformer layers where there is a paired sequence of GEMM or GEMV operations:

  • Attention modules: MatMul — Softmax — MatMul
  • FFNs: MatMul — activation function (RELU/GELU) — MatMul

This structural similarity has been noted in Zhang et al (2025). Both have three inputs and two matrix multiplication operations.

The genius of Flash Attention has fixed this bottleneck for attention modules. The key innovation was a way to store some extra information about Softmax, so that it would be used at the end of a pipelined sequence for a tile, rather than in the middle.

Why do we care?

The problem with a naive computations of a paired matrix multiplication sequence is two-fold:

  • Store the intermediate matrix results
  • Delayed parallelization bottleneck

Each tile in the second matrix multiplication has to wait for the data from not just its single tile, but for all the other computations. We can't get started on the second one until the first one completes and does its non-linear thing.

Tiled Pipelined Attention Modules

Before Flash Attention, it used to be that the attention module had to perform its first MatMul, then compute Softmax on that intermediate result, and then send it to the second MatMul. That means we need extra memory for the intermediate matrix, and also that the second MatMul is blocked waiting for the first one to finish (and for Softmax). The idea with Flash Attention is that for each rectangular tile of the matrices, we pass it through both matrix computations for just that one tile, and then fix up the Softmax stuff at the end. This improves efficiency because:

  • No intermediate tile is stored
  • Each tile is fully parallelizable (without any dependencies).

This allows us to parallelize the whole computation by pipelining the compute for each tile through to the end.

Tiled Pipelined FFN?

FFNs are a pair of sequential matrix multiplications with an intervening non-linear operation. Hence, they have the same problems as the attention module. Each tile must complete the computation of the first matrix multiplication operation. After this, it's possible to run the non-linear activation function, and then the second matrix computation.

There are various avenues of parallelization, so it's not a disaster, and in fact, it's how LLMs run today. The question is whether we can do better by somehow linearizing the activation functions. There are two papers that have attempted this, Hu et al (2025) and Zhang et al (2025), using different approaches for a "Flash FFN" concept.

Simple Linear Approximations

Simple linear approximations are possible for RELU and GELU. This is a linear equation over the entire range of input values. In practice, this can be used on quantized kernels, but does introduce some inaccuracy, but quantization is already inherently trading off some accuracy, anyway.

This situation is fortuitous and allows us to pipeline the FFN computations. Each tile can compute the MatMul-Activation Function-MatMul triple sequence all the way through to the end, where it is accumulated in an "output tile" that must have other computations added. Nevertheless, this sequence avoids the need to store an intermediate values after the first MatMul, and means that each tile can complete the second matrix multiplication without waiting on any other tile. Less memory and more parallelizability.

In fact, as noted in Hu et al (2025), a linear activation function makes things even simpler. You can merge the two matrices together using associativity of matrix multiplication to get a single pre-computed matrix, and it's even more efficient. In practice, this only works if you use a single global linear approximation of GELU, and this loses a lot of model accuracy. However, it motivates further research into using piecewise linear approximation of GELU.

Additive Offsets Don't Work

Could we ensure linearity by making sure the function only used its linear computation? For example, with RELU, it's a linear function for non-negative values.

Since we know that we need RELU to never go negative for linearity, can we ensure that all RELU computations are on a non-negative number? In fact, yes, we can, but it doesn't work well. We could find the minimum value in the activations and the weights matrix, and add a large enough number to all the numbers in both, thereby creating a positive-only vector and matrix. If we follow through and compute GEMM-RELU-GEMM, and then at the end we reverse the additive offsets.

So, technically we have ensured that a RELU computations is always linear, and can therefore further optimize the whole FFN on this assumption (i.e., to pipeline the tiles of the GEMM-RELU-GEMM sequence). But the mathematics of this trick is not correct, and the output results are not identical, and we've literally lost all of the power of the non-linearity of RELU. In effect, we've never executed the second path of RELU (on negatives), and this makes our LLM much less accurate.

Piecewise Linear Approximations in FFNs

There has been some research into using linear approximations of the activation function (e.g., RELU or GELU), to speed up the two matrix multiplications in the FFN. Somewhat amusingly, we should first note that RELU is itself a piecewise linear approximation with two segments, which isn't really an "approximation". However, GELU is far more non-linear, but there are various research papers on linear approximations of GELU. The mathematics for two or more segments in a piecewise linear function make it much more complicated, but it has been attempted, such as in Hu et al (2025).

The problem is that we have two or more domains for the linear approximations, with different computations. As we're doing each tile, we can't tell which domain the final result will fall into, so we don't know which computation to use. If we just use whatever value has been accumulated so far, then that's incorrect. There would be many situations where the accumulated value would incease (or decrease) across the domain boundaries, meaning that we've used the wrong domain for the activation function's piecewise linear approximation. The results would be unpredictable.

We can consider some workarounds that try to correct the errors, such as:

    (a) trying to keep two sets of the tiled computations (for two domains of a piecewise linear approximation), or

    (b) detecting the situation and re-do it later.

None of these ideas seem to be worthwhile. But it's interesting!

References

  1. Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
  2. Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
  3. Tom Hubrecht, Orégane Desrentes, Florent de Dinechin, 2024, Activations in Low Precision with High Accuracy, ⟨hal-04776745) https://inria.hal.science/hal-04776745v1/document
  4. Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan, 25 Mar 2026 (v2), QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations, https://arxiv.org/abs/2511.06767
  5. Chirag Ahuja, Aug 6, 2025, The Integer Trick - How Hardware Optimally Calculates GELU, https://cahuja1992.github.io/20gelu
  6. Mohammad Erfan Sadeghi, Arash Fayyazi, Seyedarmin Azizi, and Massoud Pedram. 2024. PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '24). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3665314.3670843 https://dl.acm.org/doi/abs/10.1145/3665314.3670843
  7. Huang, J., Wu, Y., Zhuang, M., & Zhou, J. (2025). High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry. Electronics, 14(9), 1825. https://doi.org/10.3390/electronics14091825, https://www.mdpi.com/2079-9292/14/9/1825
  8. T. Mohaidat, M. R. K. Khan and K. Khalil, "Curvature-Based Piecewise Linear Approximation Method of GELU Activation Function in Neural Networks," 2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings), Mt Pleasant, MI, USA, 2024, pp. 1-5, doi: 10.1109/AIBThings63359.2024.10863181, https://ieeexplore.ieee.org/document/10863181
  9. James Robert Golden 11 Oct 2025, Equivalent Linear Mappings of Large Language Models, Transactions on Machine Learning Research (10/2025), https://openreview.net/forum?id=oDWbJsIuEp, PDF: https://openreview.net/pdf?id=oDWbJsIuEp, Project: https://github.com/jamesgolden1/equivalent-linear-LLMs/
  10. Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu, 7 Dec 2025, Flash Multi-Head Feed-Forward Network, https://arxiv.org/abs/2512.06989

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Modern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++