Aussie AI
Weight Precomputations
- 
                                            Last Updated 15 October, 2025
- 
                                            by David Spuler, Ph.D.
Weights are static during inference, so why not fiddle with them before we start? Of course, that's exactly the underlying idea of quantization and static pruning. Quantization precomputes new versions of the weights that are quantized to integers or lower precision floating-point. Pruning removes weights by changing some of them to zero.
However, this section looks at other precomputation ideas. What useful information can we discern by preprocessing the weights and doing precomputations? Since the weight data is available after training, we can do intervening changes "offline" without affecting inference speed, and use the precomputed data in some way to speed up inference thereafter.
Research on Weight Precomputations
Some of the papers with generalized ideas about pre-examining weights to speed up inference include:
- T. J. Ham, S. J. Jung, S. Kim et al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Preprocessing of the key matrix in attention, with focus on large positive and negative values.)
- Q. Chen, C. Sun, Z. Lu, and C. Gao, “Enabling energy-efficient inference for self-attention mechanisms in neural networks,” in IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 25–28, https://ieeexplore.ieee.org/document/9869924
- Tae Jun Ham; Yejin Lee; Seong Hoon Seo; Soosung Kim; Hyunji Choi; Sung Jun Jung; Jae W. Lee, 2021, ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/9499860/, https://taejunham.github.io/data/elsa_isca21.pdf (Precomputations involve the key and value matrices including dot products, hashing, and similarity checking.)
- J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap, “Scaling memory-augmented neural networks with sparse reads and writes,” in International Conference on Neural Information Processing Systems, NIPS, 2016. https://arxiv.org/abs/1610.09027
- Z Qu, L Liu, F Tu, Z Chen, Y Ding, Y Xie, 2022, Dota: detect and omit weak attentions for scalable transformer acceleration, https://dl.acm.org/doi/pdf/10.1145/3503222.3507738
- David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Nils Graef, 12 Mar 2024 (v3), Transformer tricks: Precomputing the first layer, https://arxiv.org/abs/2402.13388 Code: https://github.com/OpenMachine-ai/transformer-tricks (Because the first layer only depends on the embeddings, it can be precomputed.)
- SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
- Sean MacAvaney, Craig Macdonald, 14 Apr 2025, On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures, https://arxiv.org/abs/2504.09984
- S.-J. Lee, T.-H. Kim, 15 January 2024, Latency and accuracy optimization for binary neural network inference with locality-aware operation skipping, https://doi.org/10.1049/ell2.13090 https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/ell2.13090 https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ell2.13090
- Jiho Shin, Hoeseok Yang, Youngmin Yi, 19 Nov 2024, SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference, https://arxiv.org/abs/2411.12692
- Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the intervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
- Asif Razzaq, March 29, 2025, NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized, https://www.marktechpost.com/2025/03/29/nvidia-ai-researchers-introduce-ffn-fusion-a-novel-optimization-technique-that-demonstrates-how-sequential-computation-in-large-language-models-llms-can-be-effectively-parallelized/
- Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
- Richie Li, 31 May 2025 (v3), Dataflow & Tiling Strategies in Edge-AI FPGA Accelerators: A Comprehensive Literature Review, https://arxiv.org/abs/2505.08992
- Yingyan Lin, Charbel Sakr, Yongjune Kim, and Naresh Shanbhag, 2017, PredictiveNet: An energy-efficient convolu tional neural network via zero prediction. In IEEE Int’l. Symposium on Circuits and Systems (ISCAS).1–4. https://shanbhag.ece.illinois.edu/publications/yingyan-ISCAS-2017.pdf
- V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta and H. Esmaeilzadeh, "SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks," 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 2018, pp. 662-673, doi: 10.1109/ISCA.2018.00061. https://ieeexplore.ieee.org/document/8416863 https://cseweb.ucsd.edu/~vakhlagh/ISCA18-SnaPEA.pdf
- Vikas Natesh, H.T. Kung, 12 Apr 2025, PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations, https://arxiv.org/abs/2504.09064 (Split vectors into positive and negatives to avoid overflow in vector dot product accumulators.)
- Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top- Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360
- Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang, 3 Oct 2023 (v2), Alternating Updates for Efficient Transformers, https://arxiv.org/abs/2301.13310
- Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, 21 Nov 2023 (v3), Approximating Two-Layer Feedforward Networks for Efficient Transformers, https://arxiv.org/abs/2310.10837
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 30 Jul 2024 (v3), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514
- Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen, 26 Oct 2023, Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, https://arxiv.org/abs/2310.17157 https://github.com/FMInference/DejaVu
- Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota, 29 Jun 2025, Masked Gated Linear Unit, https://arxiv.org/abs/2506.23225 (Merging two GLU computations, similar to merging two FFN matrices.)
- Mengting Ai, Tianxin Wei, Yifan Chen, Zeming Guo, Jingrui He, 6 Jan 2025 (v3), MLP Fusion: Towards Efficient Fine-tuning of Dense and Mixture-of-Experts Language Models, https://arxiv.org/abs/2307.08941 https://github.com/weitianxin/MLP_Fusion
- Keshigeyan Chandrasegaran, Michael Poli, Daniel Y. Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei, 6 Jun 2025 (v2), Exploring Diffusion Transformer Designs via Grafting, https://arxiv.org/abs/2506.05340 https://grafting.stanford.edu/ (Architectural editing of model components.)
- Yuxin Ren, Benyou Wang, Lifeng Shang, Xin Jiang, Qun Liu, 20 May 2022, Exploring Extreme Parameter Compression for Pre-trained Language Models, https://arxiv.org/abs/2205.10036 https://github.com/twinkle0331/Xcompression (Splitting FFNs into sub-FFNs added together, and also using matrix/tensor decomposition.)
- T. Kobayashi and K. Watanabe, "Matrix Decomposition By Additive and Subtractive Factors," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10890799. https://ieeexplore.ieee.org/abstract/document/10890799
AI Books from Aussie AI
|   | The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory: 
 Get your copy from Amazon: The Sweetest Lesson | 
|   | RAG Optimization: Accurate and Efficient LLM Applications: 
new book on RAG architectures: 
 Get your copy from Amazon: RAG Optimization | 
|   | Generative AI Applications book: 
 Get your copy from Amazon: Generative AI Applications | 
|   | Generative AI programming book: 
 Get your copy from Amazon: Generative AI in C++ | 
|   | CUDA C++ Optimization book: 
 Get your copy from Amazon: CUDA C++ Optimization | 
|   | CUDA C++ Debugging book: 
 Get your copy from Amazon: CUDA C++ Debugging | 
More AI Research
Read more about:
