Aussie AI

SIMD CPU Parallelization

Last Updated 25 April, 2026

by David Spuler, Ph.D.

SIMD Parallelization: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, Ph.D., March 1st 2026 (updated), List of 600+ Low-Latency C++ Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-low-latency-techniques
David Spuler, 2025, Grace CPU Optimizations, Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels", https://www.aussieai.com/book/grace-cpu-optimizations
David Spuler, March 2024, Chapter 16. Hardware Acceleration, in book "Generative AI in C++", https://www.aussieai.com/book/ch16-hardware-assembler
David Spuler, March 2024, Chapter 17. AVX Intrinsics, in book "Generative AI in C++", https://www.aussieai.com/book/ch17-avx-intrinsics
David Spuler, March 2024, Chapter 30. Vectorization, in book "Generative AI in C++", https://www.aussieai.com/book/ch30-vectorization
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Research on SIMD CPU Parallelization

Research papers include:

Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
Z Zhong, January 22nd, 2024, Enhancing SIMD Assembly Language Development with Visualization Techniques, Masters Thesis, Department of Computer Science and Communications Engineering,, Master of Engineering, Waseda University, Japan, https://waseda.repo.nii.ac.jp/record/2001309/files/t5121F099.pdf
Mike H.B. Gray, 2024, Implementation of Floating-Point Arithmetic Coding Using x86-64 AVX-256 Assembly Language, https://www.opastpublishers.com/open-access-articles/implementation-of-floatingpoint-arithmetic-coding-using-x8664-avx256-assembly-language.pdf
Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma, 16 Dec 2024, Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads, https://arxiv.org/abs/2412.11702
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Sarah Butcher & Alex McMurray, Jan 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
huizhou92 Oct 3, 2024, SwissTable: A High-Performance Hash Table Implementation, https://dev.to/huizhou92/swisstable-a-high-performance-hash-table-implementation-1knc (Hash tables via a type of linear probing, unlike std::unordered_map, but with a parallel array of extra metadata, allowing fast SIMD optimizations of hashing, with the downside of more frequent resizes than chaining.)
Abseil, 2017, Swiss Tables Design Notes, https://abseil.io/about/design/swisstables https://github.com/abseil/abseil-cpp
Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí, 13 Jun 2025, The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference, https://arxiv.org/abs/2506.11728 (Full coverage of GEMM kernel implementations on CPU SIMD instruction sets, covering AVX, Arm Neon/SVE2, AMX, MMX, and more.)
Jonathan Bentz, Tony Scudiero, Jon Waxman and Rob Armstrong, Aug 06, 2025 What’s New and Important in CUDA Toolkit 13.0, https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/
Yibo He, Shuoran Zhao, Jiaming Huang, Yingjie Fu, Hao Yu, Cunjian Huang, Tao Xie, 21 Jul 2025, SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation, https://arxiv.org/abs/2507.15224
Tejas Chaudhari, Akarsh J., Tanushree Dewangan, Mukul Lokhande, and Santosh Kumar Vishvakarma, 18 Aug 2025, XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads, https://arxiv.org/abs/2508.13049
tpoisonooo, May 2025, Matrix Multiplication Optimization, https://deepwiki.com/tpoisonooo/how-to-optimize-gemm/2-matrix-multiplication-optimization https://github.com/tpoisonooo/how-to-optimize-gemm/tree/master
Titopoulos, V., Alexandridis, K. & Dimitrakopoulos, G. Vectorized FlashAttention with low-cost exponential computation in RISC-V vector processors. J Supercomput 82, 189 (2026). https://doi.org/10.1007/s11227-026-08322-x https://link.springer.com/article/10.1007/s11227-026-08322-x
David Spuler, Ph.D., March 1st 2026 (updated), List of 600+ Low-Latency C++ Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-low-latency-techniques
David Spuler, 2025, Grace CPU Optimizations, Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels", https://www.aussieai.com/book/grace-cpu-optimizations
Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang, 21 Apr 2025 (v2), Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge, https://arxiv.org/abs/2312.05693 https://github.com/shawnricecake/agile-quant
David Spuler, March 2024, Chapter 16. Hardware Acceleration, in book "Generative AI in C++", https://www.aussieai.com/book/ch16-hardware-assembler
David Spuler, March 2024, Chapter 17. AVX Intrinsics, in book "Generative AI in C++", https://www.aussieai.com/book/ch17-avx-intrinsics
David Spuler, March 2024, Chapter 30. Vectorization, in book "Generative AI in C++", https://www.aussieai.com/book/ch30-vectorization
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf