Aussie AI

FFN Optimization

Last Updated 18 April, 2026

by David Spuler, Ph.D.

FFN Optimization: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, Ph.D., March 31st, 2026, FFN Fusion with Tiled Pipelined RELU, Aussie AI Blog, https://www.aussieai.com/blog/ffn-tiled-pipelined-relu
David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize
David Spuler, Ph.D., September 29, 2025, Promising LLM Inference Optimization Research, Aussie AI Blog, https://www.aussieai.com/blog/promising-llm-inference-optimization
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
David Spuler, Ph.D., April 18th, 2026 What is Prefill? Aussie AI Blog, https://www.aussieai.com/blog/what-is-prefill

Research on FFN Optimization

Research papers include:

8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
Amin Aminifar, Baichuan Huang, Azra Abtahi, Amir Aminifar, May 2024, Lightweight Inference for Forward-Forward Algorithm, https://whubaichuan.github.io/data/LightFF.pdf
Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, et al. (additional authors not shown), 7 May 2024 (v1), last revised 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-AI, https://arxiv.org/abs/2405.04434 (Introduces MHLA attention and FFN optimizations, amongst other advances.)
Asif Razzaq, March 29, 2025, NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized, https://www.marktechpost.com/2025/03/29/nvidia-ai-researchers-introduce-ffn-fusion-a-novel-optimization-technique-that-demonstrates-how-sequential-computation-in-large-language-models-llms-can-be-effectively-parallelized/
Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
Isaac Gerber, 10 May 2025, Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models, https://arxiv.org/abs/2505.06633
Ruediger Ehlers, 2 Aug 2017 (v3), Formal Verification of Piece-Wise Linear Feed-Forward Neural Networks, https://arxiv.org/abs/1705.01320
Razvan Pascanu, Guido Montufar, Yoshua Bengio, 14 Feb 2014 (v5), On the number of response regions of deep feed forward networks with piece-wise linear activations, https://arxiv.org/abs/1312.6098
Shashank Sonkar, Richard G. Baraniuk, 25 May 2023 (v2), Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design, https://arxiv.org/abs/2305.13297
James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901
Yao Xu, Mingyu Xu, Fangyu Lei, Wangtao Sun, Xiangrong Zeng, Bingning Wang, Guang Liu, Shizhu He, Jun Zhao, Kang Liu, 22 May 2025, Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN, https://arxiv.org/abs/2505.17153 https://anonymous.4open.science/r/Shift-FFN (FFN computations based on the diff between the current and prior token's embeddings.)
Samyak Jha, Junho Kim, 30 Jan 2026, CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models, https://arxiv.org/abs/2602.00247 (FFN linear approximations for visual tokens; also using Hadamard products.)
Aayush Gautam, Mukul Gagrani, Junyoung Park, Mingu Lee, Chiris Lott, Narasimha Reddy, 30 Jan 2026, Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity, https://arxiv.org/abs/2602.00397
Ojasva Nema, Kaustubh Sharma, Aditya Chauhan, Parikshit Pareek, 5 Feb 2026, Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias, https://arxiv.org/abs/2602.05635
Yuanhang Yang, Chaozheng Wang, Jing Li, 23 Oct 2025 (v2), UMoE: Unifying Attention and FFN with Shared Experts, https://arxiv.org/abs/2505.07260 (Using MoE experts across both attention and FFN modules.)
Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum, 16 Dec 2024, FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing, https://arxiv.org/abs/2412.11494 https://arxiv.org/pdf/2412.11494
Guo-Hao Xu, Jingzhen Ding, Huping Ding, Zhao Xu, Kaifu Zhang, Sep 2024, FTP: Efficient Prefilling for Long-Context LLM Inference via FFN Token Pruning, https://openreview.net/forum?id=fL8Zp8o6RL PDF: https://openreview.net/pdf?id=fL8Zp8o6RL
Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu, 7 Dec 2025, Flash Multi-Head Feed-Forward Network, https://arxiv.org/abs/2512.06989
M Shojaei, Jul 24, 2025, SwiGLU: The FFN Upgrade I Use to Get Free Performance, https://dev.to/mshojaei77/swiglu-the-ffn-upgrade-i-use-to-get-free-performance-33jc
David Spuler, Ph.D., March 31st, 2026, FFN Fusion with Tiled Pipelined RELU, Aussie AI Blog, https://www.aussieai.com/blog/ffn-tiled-pipelined-relu
David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize
David Spuler, Ph.D., September 29, 2025, Promising LLM Inference Optimization Research, Aussie AI Blog, https://www.aussieai.com/blog/promising-llm-inference-optimization
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
David Spuler, Ph.D., April 18th, 2026 What is Prefill? Aussie AI Blog, https://www.aussieai.com/blog/what-is-prefill