Aussie AI Blog

FFN Fusion with Tiled Pipelined RELU

  • March 31st, 2026
  • by David Spuler, Ph.D.

FFN Fusion with RELU

We've already examined the idea of using a piecewise linear approximation of the activation function in an FFN to improve efficiency (see FFN Fusion Optimizations with Piecewise Linear Approximations). This is an attempt to explore an idea to use RELU in a way that the two matrix multiplications inside an FFN can be:

  • Tiled in both MatMuls
  • Pipelined across two MatMuls (with intervening RELU activation function)

Note that RELU is already itself a piecewise linear function, so we don't need to create an approximation for it. For negative inputs, RELU is the linear function y=0 (horizontal line), and for non-negatives it's the linear function y=x (diagonal line).

Prefill vs Decode Phases

The compute bottleneck in FFNs differs in the prefill versus decoding phase. Prefill is also called "prompt processing" and involves analyzing the input tokens without outputing any tokens, but generating the KV cache. The decoding phase follows prefill and outputs all the resulting tokens, one at a time. The basic compute characteristics of an FFN are:

  • Prefill — compute-bound
  • Decode — memory-bound

Much of the analysis below is more applicable to the memory-bound decoding phase of an FFN computation. The main memory bottleneck in decode mode MatMul kernels is the data transfer costs from moving LLM weights from the High Bandwidth Memory or HBM (i.e., from GPU VRAM) to the faster shared memory in the GPU.

FFNs vs GLUs

Note that this discussion is about "classic FFNs" (also called a basic "MLP") that consist of three steps:

  • Up projection — MatMul into a larger dimension space.
  • RELU activation function — elementwise computation.
  • Down projection — MatMul to compress back down to the model's dimension.

So, there's two MatMuls in there.

The Gated Linear Unit (GLU) or "Gated MLP" is an improvement over the classic FFNs. The GLU is now used in production for most frontier models, such as SwiGLU (Swish GLU). The sequence of a GLU looks like:

  • Up projection
  • Gate projection
  • Gating (Hadamard product)
  • Activation function (Swish in SwiGLU)
  • Down projection

That's three matrices and three MatMul operations. There's also a fourth matrix operation called a "Hadamard product" and that's an elementwise multiply operation, which isn't the same thing as a matrix multiplication (and it's faster).

Anyway, the point is that I'm attempting to optimize the old-school FFNs, rather than the more advanced GLUs.

Simple Linear Approximation

If RELU were a single linear function, then the whole FFN computation would simplify. We could compact the two matrices into a single matrix, and only need one MatMul operation. The idea was mentioned in Hu et al (2025) and the idea applies to any linear activation function.

Alternatively, we could optimize the FFN to use a tiled MatMul that crosses both matrices. In other words, we could keep one tile in memory, process it with the first MatMul, then the activation function, and then the same tile into the second MatMul. This would be faster by: (a) avoiding storing the intermediate results after the first MatMul back to memory, and (b) improved parallelization because the second MatMul's tile operations would not depend on the first matrix being completed first.

This "pipelined tiling" MatMul optimization is exactly what Flash Attention did for the attention module, by propagating a single tiled operation through the MatMul-Softmax-MatMul sequence. Note that the non-linear function in attention is Softmax normalization, rather than an activation function, and Flash Attention used a clever idea of storing extra information (this is different to piecewise linear approximations). For more on the analogous architectures of the attention and FFN modules, see Zhang et al (2025).

Unfortunately, RELU is not linear, and none of this works for FFNs. Each of its two domain segments are individually linear, but RELU is itself only piecewise-linear. We cannot ignore this and just apply the piecewise-linear approximation on the output from the first matrix operation, because this is an intermediate result, and the value could cross over into the other part of the domain. In other words, we could be using the wrong part of the piecewise function.

If we could guarantee that only one of the RELU domains were used, then it would effectively be linear. This shows that if all the computations were non-negative, then only the y=x line of RELU would be used, and we could compact the FFN matrices together. However, in the real world of an LLM engine, both the activations vector and the LLM weight matrix may contain both negatives and positives.

Exploring an Idea: Post-FFN Correction of RELU

So, here's the new idea:

  • Apply RELU on the interim tile outputs (intentionally allowing mistakes).
  • Compute all the tiles in a pipeline (MatMul-RELU-MatMul) or maybe even compact the matrices together.
  • Correct the domain crossover mistakes at the end.

How can we do this?

Let's examine the possible mistakes. There are two ways that the RELU computation on an interim result can be using an incorrect domain (i.e., negative versus non-negatives as inputs to RELU). The effect is to make an error with the inputs to the second MatMul computation. The two possible errors:

  • Input element was zeroed by RELU, but should have used y=x.
  • Input element was allowed via y=x but should ultimately have been zeroed.

These cause the element in the resulting vector (after RELU) to have an error. This error then propagates into the second MatMul, where it affects many possible elements, because the element is multiplied against an entire column of the second matrix. In other words, every element of the final vector output from the FFN contains an error.

To fix this at the end, we need to:

  • Detect these error cases
  • Apply a correction

What are the errors? There are four cases to consider depending on the sign of the interim activations computed for a tile (from the first MatMul, before RELU) and whether the final resulting vector element is zero or positive (after the first MatMul and RELU, but before the second MatMul). These are the cases enumerated:

  • Interim tile output element positive, final result should be positive — not an error (correctly accumulated).
  • Interim tile output element negative, final result should be zero — also not an error (correctly zeroed).
  • Interim tile output element positive, final result should be zero — error because the optimized value is positive (should be zero, but this accumulates an erroneous positive value).
  • Interim tile output element negative, final result should be positive — error because the optimized tile value is zero, so we haven't accumulated this negative value (which should reduce the final positive value).

Note that it's more difficult to discern errors from the results of the second MatMul (whether tiled or final results), because it can be positive, zero or negative. Although the inputs from RELU are non-negative, the weights in the second matrix could be negative, and a negative output value is possible.

Detecting this situation doesn't seem easy. How do we track whether the interim tile elements are positive or negative, across multiple tile computations, and then compare all of those to the final output value? This seems like we have to store all of the tiles that were interim values, and then go back to compare against them, which is starting to sound like a de-optimization.

Making the correction doesn't seem that easy, either, even if we resolve the difficulty with detecting the errors. If the erroneous element from the first MatMul and RELU was incorrectly used, because it was only an interim result from a tile, then we have the wrong output from the first MatMul/RELU block. This means we've sent the wrong input value to the second MatMul in the pipelined-tiling optimization. If one of the input elements in the input vector is incorrect for GEMV (the second one), then this causes an error in all of the output vector elements. Doesn't sound very easy to fix!

Splitting Positives and Negatives

The trick I'm going to explore here is to separate the computations of positives and negatives. For simplicity, we'll limit the discussion to the GEMV computation in the decoding phase of a single token (not prefill, which is a GEMM over lots of tokens) and also only for a single LLM query (not multi-user batching, which also becomes GEMM, even in decode mode). Let's consider the inputs to the first GEMV computation in the FFN:

  • Activations vector v — may contain positives and negatives (after LayerNorm).
  • Weights matrix M — can have positive and negative weights.

The final output is a vector, which can contain negatives and positives. This then goes into RELU, which will zero out the negatives, leaving only zero and positive values.

Here's the key idea: separate the inputs based on either sign. We do this for both the vector and input matrix:

  • v — split to vpos and vneg
  • M — split to Mpos and Mneg

where:

    v = vpos + vneg

    M = Mpos + Mneg

The computation becomes:

    vM = (vpos + vneg) (Mpos + Mneg)

    = vpos Mpos + vneg Mpos + vpos Mneg + vneg Mneg

Yikes! That's four calls to GEMV instead of just one. Not to mention the extra computation and extra storage to split them into positive and negative versions. But we're going to keep going...

The result is four GEMV operations, where we know the sign of the elements of the resulting output vector:

    vpos Mpos — positive

    vneg Mpos — negative

    vpos Mneg — negative

    vneg Mneg — positive

All of these vectors could contain zeroes, but that's fine. A zero element doesn't cause any problems, and in fact, there'll be a lot of them, because we've effectively sparsified everything by 50% on average.

Now, we can compute the final matrix to send into RELU by computing these four GEMV kernels, adding them, and sending the final result. But that's not the point here, and gains nothing. Rather, we want to compute these four results for a single tile only, and then propagate that tile into RELU and then to the second MatMul.

We have a vague idea that by storing the interim results as negative and positive, we can accumulate the totals of negatives or positives that need to be corrected. In addition to the accumulator "output tile" that has the final FFN results (albeit, with errors we've introduced), we also need:

  • Negative values accumulator tile
  • Positive values accumulator tile

At the end of computation of all the tiles, we have the total positive and negative values in these stored tiles. In this way, we can identify whether the final result (of the first MatMul) would have been positive or negative, and know whether an error needs correction.

Hence, by having two or three accumulator tiles, we have the information to go back and correct errors. In theory, the final output is correct, and this is a lossless optimization of an FFN.

Discussion

This is an attempt at a trick of splitting positives and negatives to gain a pipelined-tiled FFN. I called this an "exploration" of a possible optimization, because maybe there's something here, but at first sight, it seems like it would be a de-optimization,

There is extra compute in this approach at both the start and end. Each kernel has to split the input tiles (segments of the input vector, tiles of the first matrix) into negatives and positives. We could pre-compute that for the matrix, but we don't want to have to transfer two copies of the entire weights matrix, so let's do it dynamically inside the kernel. There's also the extra compute to detect and correct the errors at the end. Does the extra gains from pipelined-tiling outweight this extra cost?

This optimization actually adds extra compute in an attempt to allow pipelined-tiling, which reduces the need for storing interim results. The advantage of this method is only when:

  • Memory-bound GPU computations (where tensor core compute is faster than HBM-to-shared memory transfers)
  • Decoding phase (not prefill, which is compute-bound)

Note that the issue is not just the amount of memory bytes, but the speed of data transfer being slower than the blazingly fast tensor core computations (although transfer costs are often proportional to the number of bytes being transferred). There are GPU matrix multiplication kernels where the compute is effectively "free" because the tensor cores are waiting for data transfers from HBM into shared memory. That's when it might make sense to double the amount of computation to get memory transfer cost reductions.

Compacted Matrices Further Optimization

If the activation function were just y=x, then the two matrix multiplications could be combined into a single matrix multiplication kernel (see: Hu et al, 2025). This is a much greater optimization than the pipelined-tiling optimization of an FFN. We'd get rid of an entire MatMul!

Is this RELU optimization workable via splitting negatives and positives for this super-optimized case? It seems not, because we can't see the interim results after the first MatMul, because they've been blended into the combined matrix.

Extensions and Future Work

One wonders what the impact of simply ignoring domain crossover errors in a piecewise-linear approximation would be on the final results. In other words, making no attempt to correct things at the end. The assumption here is that it completely invalidates the resulting computation, but I haven't seen anyone analyze that in detail. Maybe such crossovers would be rare enough that the LLM still works?

The above discussion looks at pipelined-tiled GEMV kernels. The same approach should work for GEMM, in the situation where GEMM is still memory-bound (e.g., not prefill), but the details are not discussed here.

Arithmetic overflow problems may arise from splitting negatives and positives into separate arithmetic streams. When together, they can offset each other, whereas they'll go to large positives and large negatives, before being later combined. Hence, numerical stability is worsened, especially in quantized arithmetic. Further analysis of these overflow difficulties is warranted.

Find a way to make this work when the two matrices are compacted into only one matrix. Perhaps there's a way to store extra information about the first matrix, which is used later, even though we no longer do a MatMul with this first matrix.

This idea only applies to RELU as a special case. Can this be generalized to piecewise linear approximations of GELU or other classic activation functions?

References

  1. Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
  2. Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
  3. Tom Hubrecht, Orégane Desrentes, Florent de Dinechin, 2024, Activations in Low Precision with High Accuracy, ⟨hal-04776745) https://inria.hal.science/hal-04776745v1/document
  4. Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan, 25 Mar 2026 (v2), QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations, https://arxiv.org/abs/2511.06767
  5. Chirag Ahuja, Aug 6, 2025, The Integer Trick - How Hardware Optimally Calculates GELU, https://cahuja1992.github.io/20gelu
  6. Mohammad Erfan Sadeghi, Arash Fayyazi, Seyedarmin Azizi, and Massoud Pedram. 2024. PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '24). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3665314.3670843 https://dl.acm.org/doi/abs/10.1145/3665314.3670843
  7. Huang, J., Wu, Y., Zhuang, M., & Zhou, J. (2025). High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry. Electronics, 14(9), 1825. https://doi.org/10.3390/electronics14091825, https://www.mdpi.com/2079-9292/14/9/1825
  8. T. Mohaidat, M. R. K. Khan and K. Khalil, "Curvature-Based Piecewise Linear Approximation Method of GELU Activation Function in Neural Networks," 2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings), Mt Pleasant, MI, USA, 2024, pp. 1-5, doi: 10.1109/AIBThings63359.2024.10863181, https://ieeexplore.ieee.org/document/10863181
  9. James Robert Golden 11 Oct 2025, Equivalent Linear Mappings of Large Language Models, Transactions on Machine Learning Research (10/2025), https://openreview.net/forum?id=oDWbJsIuEp, PDF: https://openreview.net/pdf?id=oDWbJsIuEp, Project: https://github.com/jamesgolden1/equivalent-linear-LLMs/
  10. Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu, 7 Dec 2025, Flash Multi-Head Feed-Forward Network, https://arxiv.org/abs/2512.06989

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Modern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++