Aussie AI

Optimizing Matrix Multiplication Algorithms and MatMul/GEMM Kernels

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

Matrix Multiplication Research

With papers on this since the 1970s, you'd think we'd be done optimizing the matrix multiplication code by now. Although they were sequential algorithms to start with, even parallel algorithms for matrix multiplication have decades of theory. And yet there is a continual stream of research papers on MatMul/GEMM optimizations.

Feel free to blame the AI bubble for this. AI engines are built on matrix multiplication, and it occurs in all of these standard LLM components:

Note that attention and FFN blocks occur in every layer of an LLM.

There's quite a lot of work to do to optimize attention and FFNs. The reasons for the ongoing research are that there are many nuances and special cases. Some examples:

  • Sparse versus dense matrices (i.e. SpMM versus GEMM algorithms).
  • Matrix-matrix versus matrix-vector multiplications.
  • Rectangular versus square matrices, including tall or flat rectangular matrices.
  • Special shapes of matrices (e.g. triangular).

Optimizations to the matrix processing code must also deal with:

  • Compute-bound versus memory-bound algorithms.
  • Different computational versus memory costs in different hardware architectures (e.g., CPUs, GPUs, NPUs).
  • Generalization from 2D matrices to 3D tensors (i.e., "tensor slices").
  • Integer versus floating-point arithmetic.
  • Quantization to fewer bits, such as 8-bit integer or 4-bit kernels (also reduced-bit floating-point FP8 and even FP4 kernels).
  • Permutation-based vector/matrix/tensor algorithms.
  • Sparse matrix storage compressions formats (there are various in-memory methods).
  • Non-negative matrices (e.g., activations arising after RELU, or fixed non-negatives from scaling the weight matrices).

There are also some features specific to AI engine algorithms and their use of matrices:

  • Prefill phase (prompt processing) is compute-bound, but in the decoding phase, FFNs are compute-bound because the matrices are static, whereas the QKV attention algorithms are memory-bound because the KV cache is not static data.
  • FFNs involve two matrix multiplications separated by an element-wise activation function (e.g., kernel fusion is possible).
  • Attention kernels have a particular QKV algorithm, for which there are many specialty kernels: e.g., Group Query Attention (GQA), Multi-Query Attention (MAQ), Flash Attention, Paged Attention, etc.
  • Transposed matrices are used in the QKV algorithm, adding another wrinkle.
  • Embedding and unembedding matrices used to convert between tokens and embedding vectors.

And there are some theoretical alternative methods of handling matrix multiplications:

  • Low-rank matrix factorization algorithms (e.g., SVD).
  • Approximate matrix multiplication
  • Element-wise multiplication (Hadamard product)

Since many of these techniques are orthogonal, you can see the combinatorial growth of the various sub-cases. We might be reading GEMM papers for a while!

Optimizing Matrix Multiplication

Matrix multiplications are the basis of neural network processing. The main bottleneck is the multiplication operation deep inside the nested loops of the matrix multiplier. The main optimization in modern times is hardware optimizations, to run lots of those computations in parallel.

The main techniques to use in faster MatMul/GEMM kernels include:

  • GPU versions of kernels
  • Advanced theoretical matrix multiplication algorithms (e.g. Strassen's)
  • Approximate matrix multiplication

Researchers have been optimizing matrix multiplication for decades, so you might think it's all done and dusted. Well, firstly, a lot of that old research is sequential algorithms rather than parallel. Furthermore, most of such research was focused on reducing the number of multiplications (i.e. FLOPs), whereas the cost of memory access is a greater issue on very large matrices. Hence, there's a steady stream of matrix multiplation optimization papers, and no end in sight. Some of areas of focus for this AI-related matrix multiplication research include:

Several of the components of a basic AI Transformer engine (for LLMs) use matrix multiplications in a specialized way. See also the more specific optimizations of these components:

Naive Matrix Multiplication Algorithm

The simplest algorithm to compute matrix multiplication is a cubic-complexity algorithm, which involves three nested loops. There are various papers with examples:

Survey Papers on Matrix Multiplication Kernels (MatMul/GEMM)

Survey papers on matrix multiplication include:

  • Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
  • Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
  • Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices, https://arxiv.org/abs/1608.00636
  • S. Afroz, M. Tahaseen, F. Ahmed, K. S. Farshee and M. N. Huda, "Survey on matrix multiplication algorithms," 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, 2016, pp. 151-155, doi: 10.1109/ICIEV.2016.7759985. https://ieeexplore.ieee.org/abstract/document/7759985
  • Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
  • Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí, 13 Jun 2025, The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference, https://arxiv.org/abs/2506.11728 (Full coverage of GEMM kernel implementations on CPU SIMD instruction sets, covering AVX, Arm Neon/SVE2, AMX, MMX, and more.)

Surveys on Sparse Matrix Multiplication

Review papers on sparse matrix multiplication (SpMM):

Sparse Matrix Kernels

Sparse matrices are those with lots of zeros and few non-zero values. There are various ways to optimize MatMul/GEMM for sparsity.

Tiled MatMul/GEMM

Research papers on tiled MatMul/GEMM optimizations:

CUDA Matrix Multiplication

There are various papers with examples of coding Matmul/GEMM in CUDA C++:

Lookup Table (LUT) for MatMul/GEMM

The use of table lookup can sometimes reduce or remove the number of multiplication operations. This method has been used in various MatMul/GEMM kernels, usually when they are quantized to smaller ranges than floating-point.

Research on LUT MatMUL/GEMM algorithms:

Matrix-Vector Multiplication

Matrix-vector multiplication is also called Vector Matrix Multiplication (VMM) or Generatized Matrix Vector (GEMV).

Transpose Optimizations in Matrix Multiplication

One of the basic methods to optimize matrix multiplication is to operate on a matrix's transpose, because this leads to contiguous memory blocks, which are faster to access (i.e., it's a memory access speed optimization, not a reduction in computations). The issue is the method to store matrices in row-major versus column-major ordering, with the two matrices in a matrix-matrix computation requiring opposite storage methods. It can even be beneficial in some cases to compute the tranpose on-the-fly (which has quadratic complexity), storing it as a temporary matrix, just to speed up the full matrix-matrix operation (which has cubic complexity). Hence, there are various optimizations including:

  • Fast computation of the tranpose.
  • Caching a pre-computed transpose

Research papers on handling a transpose, or creating one, for faster MatMul include:

Matrix Multiplication Optimization in AI Models

Most of AI inference algorithms are based on matrix multiplications or similarly on vector dot products. The primary optimization is to parallelize matrix or vector operations using hardware acceleration in GPUs or other chips. However, several improvements to matrix and vector algorithms have also been found. Research papers specific to matrix multiplication include:

Fast Matrix Multiplication Computation Theory

The main techniques for faster matrix multiplication, of general matrices, include:

  • Strassen's algorithm
  • Winograd's algorithm
  • Fast Fourier Transform (FFT) methods

Matrix multiplications can also be sped up by using specialized matrices:

General Matrix Multiplication Research

Papers on the low-level algorithms for optimizing matrix multiplication for general matrices (i.e. all matrices, not just a subtype):

Strassen Matrix Multiplication

The Strassen method is based on a way to multiply 2x2 matrices using seven arithmetic multiplications, rather than the usual eight. Generalizing this idea leads to a faster way to multiply general matrices. This reduces the asymptomic complexity of matrix-matrix multiplication to O(n^2.8) (where 2.8 is log base 2 of 7) rather than O(n^3), i.e., cubic 3.0 (where 3 is log base 2 of 8).

  • Strassen, V. The asymptotic spectrum of tensors and the exponent of matrix multiplication. In 27th Annual Symposium on Foundations of Computer Science 49–54 (IEEE, 1986), https://ieeexplore.ieee.org/document/4568194
  • Huang, J., Yu, C. D. & Geijn, R. A. V. D. Strassen’s algorithm reloaded on GPUs. ACM Trans. Math. Softw. 46, 1–22 (2020). https://dl.acm.org/doi/10.1145/3372419
  • Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
  • Huang, J., Smith, T. M., Henry, G. M. & Van De Geijn, R. A., Strassen’s algorithm reloaded. In International Conference for High Performance Computing, Networking, Storage and Analysis 690–701 (IEEE, 2016). https://ieeexplore.ieee.org/document/7877137
  • Burichenko, V. P., 2014, On symmetries of the Strassen algorithm. Preprint at https://arxiv.org/abs/1408.6273 (2014).
  • Grochow, J. A. & Moore, C. 2017, Designing Strassen’s algorithm. Preprint at https://arxiv.org/abs/1708.09398 (2017). https://arxiv.org/abs/1708.09398
  • Elser, V., 2016, A network that learns Strassen multiplication. J. Mach. Learn. Res. 17, 3964–3976 (2016). https://arxiv.org/abs/1601.07227
  • Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn, 3 May 2016, Implementing Strassen's Algorithm with BLIS, https://arxiv.org/abs/1605.01078
  • Gianfranco Bilardi, Lorenzo De Stefani, 7 May 2016, The I/O complexity of Strassen's matrix multiplication with recomputation, https://arxiv.org/abs/1605.02224
  • Christian Ikenmeyer, Vladimir Lysikov, 13 Feb 2019 (v2), Strassen's 2x2 matrix multiplication algorithm: A conceptual perspective, https://arxiv.org/abs/1708.08083
  • Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn, 24 Aug 2018, Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs, https://arxiv.org/abs/1808.07984
  • Chandan Misra, Sourangshu Bhattacharya, Soumya K. Ghosh, 22 Nov 2018 (v2), Stark: Fast and Scalable Strassen's Matrix Multiplication using Apache Spark, https://arxiv.org/abs/1811.07325
  • Viviana Arrigoni, Annalisa Massini, 6 Feb 2019, Fast Strassen-based A^t A Parallel Multiplication, https://arxiv.org/abs/1902.02104
  • Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, Emanuele Rodolà, 25 Oct 2021, Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose, https://arxiv.org/abs/2110.13042
  • Osman B. Guney, Suayb S. Arslan, 10 Oct 2022, Fault-Tolerant Strassen-Like Matrix Multiplication, https://arxiv.org/abs/2210.04441
  • Paolo D'Alberto, 20 Dec 2023, Strassen's Matrix Multiplication Algorithm Is Still Faster, https://arxiv.org/abs/2312.12732
  • Jean-Guillaume Dumas, Clément Pernet, Alexandre Sedoglavic, 28 Jun 2024 (v2), Strassen's algorithm is not optimally accurate, https://arxiv.org/abs/2402.05630
  • Afzal Ahmad, Linfeng Du, Wei Zhang, 4 Jun 2024, Fast and Practical Strassen's Matrix Multiplication using FPGAs, https://arxiv.org/abs/2406.02088
  • Tomonori Kouya, 7 Oct 2014, Accelerated Multiple Precision Matrix Multiplication using Strassen's Algorithm and Winograd's Variant, https://arxiv.org/abs/1410.1599
  • Tomonori Kouya, 29 Oct 2015, Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms https://arxiv.org/abs/1510.08642
  • Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou, 18 May 2009 (v5), Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm, https://arxiv.org/abs/0707.2347
  • Jean-Guillaume Dumas, Victor Pan, 17 Dec 2016, Fast Matrix Multiplication and Symbolic Computation, https://arxiv.org/abs/1612.05766
  • Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germán Pizarro, Matías Fuentes, Francisco Meza, Cristian B. Calderon, Cristóbal Rojas, 6 Feb 2025 (v2), Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method, https://arxiv.org/abs/2501.19215
  • K. Deepthi, B. Mounika and Y. C. Rao, "Modular and Power-Efficient Matrix Multiplication: Implementation of Strassen's Algorithm," 2025 6th International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 2025, pp. 485-489, doi: 10.1109/ICIRCA65293.2025.11089773, https://ieeexplore.ieee.org/abstract/document/11089773
  • Trevor E. Pogue, Nicola Nicolici, 14 Feb 2025, Strassen Multisystolic Array Hardware Architectures, https://arxiv.org/abs/2502.10063
  • Cornelius Brand, Radu Curticapean, Baitian Li, Kevin Pratt, 28 May 2025, Faster Convolutions: Yates and Strassen Revisited, https://arxiv.org/abs/2505.22410
  • Benoit Jacob, 17 Jul 2025, Strassen Matrix Multiplication from a 3-dimensional Volume Form, https://arxiv.org/abs/2507.13510

Winograd Matrix Multiplication

Research papers on using the Winograd or the Coppersmith-Winograd algorithm:

FFT Matrix Multiplication

Research papers using the Fast Fourier Transformation (FFT) for matrix multiplication:

  • Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
  • Alexandre Siron, Sean Molesky, 25 Jun 2024, A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication, https://arxiv.org/abs/2406.17981
  • Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308

Approximate Matrix Multiplication

Approximate Matrix Multiplication (AMM) is a variety of complicated model optimization techniques that replace matrix multiplications with various approximations that avoid the cost of arithmetic multiplication, trading off some accuracy. These methods are usually distinct from quantization methods, are not specific to certain subclasses of matrices, and evoke more advanced mathematics in the theory of matrices.

Note that these algorithms apply at the high-level of how matrices are multiplied with other matrices or with vectors (e.g. avoiding some vector dot products), whereas there are also low-level optimizations of the arithmetic operation of multiplying two numbers (see advanced mathematics). These two classes of research are not the same, and are actually orthogonal to each other.

Research papers on approximate matrix multiplication include:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: