Aussie AI
FFN Optimization
-
Last Updated 18 April, 2026
-
by David Spuler, Ph.D.
FFN Optimization: Book Excerpts and Blog Articles
Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:
- David Spuler, Ph.D., March 31st, 2026, FFN Fusion with Tiled Pipelined RELU, Aussie AI Blog, https://www.aussieai.com/blog/ffn-tiled-pipelined-relu
- David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize
- David Spuler, Ph.D., September 29, 2025, Promising LLM Inference Optimization Research, Aussie AI Blog, https://www.aussieai.com/blog/promising-llm-inference-optimization
- David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- David Spuler, Ph.D., April 18th, 2026 What is Prefill? Aussie AI Blog, https://www.aussieai.com/blog/what-is-prefill
Research on FFN Optimization
Research papers include:
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
- Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
- Amin Aminifar, Baichuan Huang, Azra Abtahi, Amir Aminifar, May 2024, Lightweight Inference for Forward-Forward Algorithm, https://whubaichuan.github.io/data/LightFF.pdf
- Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
- Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
- Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
- Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
- Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
- Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, et al. (additional authors not shown), 7 May 2024 (v1), last revised 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-AI, https://arxiv.org/abs/2405.04434 (Introduces MHLA attention and FFN optimizations, amongst other advances.)
- Asif Razzaq, March 29, 2025, NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized, https://www.marktechpost.com/2025/03/29/nvidia-ai-researchers-introduce-ffn-fusion-a-novel-optimization-technique-that-demonstrates-how-sequential-computation-in-large-language-models-llms-can-be-effectively-parallelized/
- Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
- Isaac Gerber, 10 May 2025, Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models, https://arxiv.org/abs/2505.06633
- Ruediger Ehlers, 2 Aug 2017 (v3), Formal Verification of Piece-Wise Linear Feed-Forward Neural Networks, https://arxiv.org/abs/1705.01320
- Razvan Pascanu, Guido Montufar, Yoshua Bengio, 14 Feb 2014 (v5), On the number of response regions of deep feed forward networks with piece-wise linear activations, https://arxiv.org/abs/1312.6098
- Shashank Sonkar, Richard G. Baraniuk, 25 May 2023 (v2), Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design, https://arxiv.org/abs/2305.13297
- James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901
- Yao Xu, Mingyu Xu, Fangyu Lei, Wangtao Sun, Xiangrong Zeng, Bingning Wang, Guang Liu, Shizhu He, Jun Zhao, Kang Liu, 22 May 2025, Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN, https://arxiv.org/abs/2505.17153 https://anonymous.4open.science/r/Shift-FFN (FFN computations based on the diff between the current and prior token's embeddings.)
- Samyak Jha, Junho Kim, 30 Jan 2026, CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models, https://arxiv.org/abs/2602.00247 (FFN linear approximations for visual tokens; also using Hadamard products.)
- Aayush Gautam, Mukul Gagrani, Junyoung Park, Mingu Lee, Chiris Lott, Narasimha Reddy, 30 Jan 2026, Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity, https://arxiv.org/abs/2602.00397
- Ojasva Nema, Kaustubh Sharma, Aditya Chauhan, Parikshit Pareek, 5 Feb 2026, Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias, https://arxiv.org/abs/2602.05635
- Yuanhang Yang, Chaozheng Wang, Jing Li, 23 Oct 2025 (v2), UMoE: Unifying Attention and FFN with Shared Experts, https://arxiv.org/abs/2505.07260 (Using MoE experts across both attention and FFN modules.)
- Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum, 16 Dec 2024, FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing, https://arxiv.org/abs/2412.11494 https://arxiv.org/pdf/2412.11494
- Guo-Hao Xu, Jingzhen Ding, Huping Ding, Zhao Xu, Kaifu Zhang, Sep 2024, FTP: Efficient Prefilling for Long-Context LLM Inference via FFN Token Pruning, https://openreview.net/forum?id=fL8Zp8o6RL PDF: https://openreview.net/pdf?id=fL8Zp8o6RL
- Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu, 7 Dec 2025, Flash Multi-Head Feed-Forward Network, https://arxiv.org/abs/2512.06989
- M Shojaei, Jul 24, 2025, SwiGLU: The FFN Upgrade I Use to Get Free Performance, https://dev.to/mshojaei77/swiglu-the-ffn-upgrade-i-use-to-get-free-performance-33jc
- David Spuler, Ph.D., March 31st, 2026, FFN Fusion with Tiled Pipelined RELU, Aussie AI Blog, https://www.aussieai.com/blog/ffn-tiled-pipelined-relu
- David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize
- David Spuler, Ph.D., September 29, 2025, Promising LLM Inference Optimization Research, Aussie AI Blog, https://www.aussieai.com/blog/promising-llm-inference-optimization
- David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- David Spuler, Ph.D., April 18th, 2026 What is Prefill? Aussie AI Blog, https://www.aussieai.com/blog/what-is-prefill
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home