Aussie AI
SIMD CPU Parallelization
-
Last Updated 28 September, 2025
-
by David Spuler, Ph.D.
Research on SIMD CPU Parallelization
Research papers include:
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
- C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
- David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
- Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
- Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
- Z Zhong, January 22nd, 2024, Enhancing SIMD Assembly Language Development with Visualization Techniques, Masters Thesis, Department of Computer Science and Communications Engineering,, Master of Engineering, Waseda University, Japan, https://waseda.repo.nii.ac.jp/record/2001309/files/t5121F099.pdf
- Mike H.B. Gray, 2024, Implementation of Floating-Point Arithmetic Coding Using x86-64 AVX-256 Assembly Language, https://www.opastpublishers.com/open-access-articles/implementation-of-floatingpoint-arithmetic-coding-using-x8664-avx256-assembly-language.pdf
- Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
- NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
- Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
- Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma, 16 Dec 2024, Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads, https://arxiv.org/abs/2412.11702
- Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
- Sarah Butcher & Alex McMurray, Jan 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
- C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
- huizhou92 Oct 3, 2024, SwissTable: A High-Performance Hash Table Implementation, https://dev.to/huizhou92/swisstable-a-high-performance-hash-table-implementation-1knc (Hash tables via a type of linear probing, unlike std::unordered_map, but with a parallel array of extra metadata, allowing fast SIMD optimizations of hashing, with the downside of more frequent resizes than chaining.)
- Abseil, 2017, Swiss Tables Design Notes, https://abseil.io/about/design/swisstables https://github.com/abseil/abseil-cpp
- Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí, 13 Jun 2025, The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference, https://arxiv.org/abs/2506.11728 (Full coverage of GEMM kernel implementations on CPU SIMD instruction sets, covering AVX, Arm Neon/SVE2, AMX, MMX, and more.)
- Jonathan Bentz, Tony Scudiero, Jon Waxman and Rob Armstrong, Aug 06, 2025 What’s New and Important in CUDA Toolkit 13.0, https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/
- Yibo He, Shuoran Zhao, Jiaming Huang, Yingjie Fu, Hao Yu, Cunjian Huang, Tao Xie, 21 Jul 2025, SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation, https://arxiv.org/abs/2507.15224
- Tejas Chaudhari, Akarsh J., Tanushree Dewangan, Mukul Lokhande, and Santosh Kumar Vishvakarma, 18 Aug 2025, XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads, https://arxiv.org/abs/2508.13049
- tpoisonooo, May 2025, Matrix Multiplication Optimization, https://deepwiki.com/tpoisonooo/how-to-optimize-gemm/2-matrix-multiplication-optimization https://github.com/tpoisonooo/how-to-optimize-gemm/tree/master
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home