Aussie AI

Weight Precomputations

  • Last Updated 22 May, 2025
  • by David Spuler, Ph.D.

Weights are static during inference, so why not fiddle with them before we start? Of course, that's exactly the underlying idea of quantization and static pruning. Quantization precomputes new versions of the weights that are quantized to integers or lower precision floating-point. Pruning removes weights by changing some of them to zero.

However, this section looks at other precomputation ideas. What useful information can we discern by preprocessing the weights and doing precomputations? Since the weight data is available after training, we can do intervening changes "offline" without affecting inference speed, and use the precomputed data in some way to speed up inference thereafter.

Research on Weight Precomputations

Some of the papers with generalized ideas about pre-examining weights to speed up inference include:

  • T. J. Ham, S. J. Jung, S. Kim et al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Preprocessing of the key matrix in attention, with focus on large positive and negative values.)
  • Q. Chen, C. Sun, Z. Lu, and C. Gao, “Enabling energy-efficient inference for self-attention mechanisms in neural networks,” in IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 25–28, https://ieeexplore.ieee.org/document/9869924
  • Tae Jun Ham; Yejin Lee; Seong Hoon Seo; Soosung Kim; Hyunji Choi; Sung Jun Jung; Jae W. Lee, 2021, ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/9499860/, https://taejunham.github.io/data/elsa_isca21.pdf (Precomputations involve the key and value matrices including dot products, hashing, and similarity checking.)
  • J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap, “Scaling memory-augmented neural networks with sparse reads and writes,” in International Conference on Neural Information Processing Systems, NIPS, 2016. https://arxiv.org/abs/1610.09027
  • Z Qu, L Liu, F Tu, Z Chen, Y Ding, Y Xie, 2022, Dota: detect and omit weak attentions for scalable transformer acceleration, https://dl.acm.org/doi/pdf/10.1145/3503222.3507738
  • David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Nils Graef, 12 Mar 2024 (v3), Transformer tricks: Precomputing the first layer, https://arxiv.org/abs/2402.13388 Code: https://github.com/OpenMachine-ai/transformer-tricks (Because the first layer only depends on the embeddings, it can be precomputed.)
  • SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
  • Sean MacAvaney, Craig Macdonald, 14 Apr 2025, On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures, https://arxiv.org/abs/2504.09984

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: