Aussie AI
FP64 Arithmetic
-
Last Updated 15 October, 2025
-
by David Spuler, Ph.D.
Research on FP64 Arithmetic
Research papers include:
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Chenhui Xu, Dancheng Liu, Amir Nassereldine, Jinjun Xiong, 16 May 2025, FP64 is All You Need: Rethinking Failure Modes in Physics-Informed Neural Networks, https://arxiv.org/abs/2505.10949
- Josh Covington, February 2, 2024, FP64 vs FP32 vs FP16 and Multi-Precision: Understanding Precision in Computing, https://www.velocitymicro.com/blog/fp64-vs-fp32-vs-fp16-and-multi-precision-understanding-precision-in-computing/
- NVIDIA, Feb 2023, Train With Mixed Precision, https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html https://docs.nvidia.com/deeplearning/performance/pdf/Training-Mixed-Precision-User-Guide.pdf
- Piotr Luszczek, Vijay Gadepally, LaToya Anderson, William Arcand, David Bestor, William Bergeron, Alex Bonn, Daniel J. Burrill, Chansup Byun, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Julia Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Charles Yee, Jeremy Kepner, 28 Sep 2025, Performance and Numerical Aspects of Decompositional Factorizations with FP64 Floating-Point Emulation in INT8, https://arxiv.org/abs/2509.23565 https://ieee-hpec.org/wp-content/uploads/2025/09/127.pdf
- Rohail T., October 3, 2025, Decompositional Factorizations with FP64 Emulation in INT8 Demonstrate Performance and Numerical Profiles on Hopper GPUs, https://quantumzeitgeist.com/performance-decompositional-factorizations-fp64-emulation-int8-numerical-profiles-hopper/
- Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, and Toshiyuki Imamura. DGEMM using tensor cores, and its accurate and reproducible versions. In Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, and Hatem Ltaief, editors, High Performance Computing, pages 230–248. Springer International Publishing, Cham, 2020. https://pmc.ncbi.nlm.nih.gov/articles/PMC7295351/ https://pmc.ncbi.nlm.nih.gov/articles/PMC7295351/pdf/978-3-030-50743-5_Chapter_12.pdf
- Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 38(4):297–313, 2024. https://dl.acm.org/doi/abs/10.1177/10943420241239588
- Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura, 8 Aug 2025 (v2), High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines https://arxiv.org/abs/2508.03984
- Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, 10 Jan 2025, Batched DGEMMs for scientific codes running on long vector architectures, https://arxiv.org/abs/2501.06175
- Hiroyuki Ootomo, Katsuhisa Ozaki, Rio Yokota, 30 Mar 2024 (v4), DGEMM on Integer Matrix Multiplication Unit, https://arxiv.org/abs/2306.11975
- Tom Cornebize, Arnaud Legrand, 11 Dec 2019, DGEMM performance is data-dependent, https://arxiv.org/abs/1912.05381
- Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein, 18 Feb 2020 (v2), Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs, https://arxiv.org/abs/1905.03136
- Cole Brower, Samuel Rodriguez Bernabeu, Jeff Hammond, John Gunnels, Sotiris S. Xanthea, Martin Ganahl, Andor Menczer, Örs Legeza, 6 Oct 2025, Mixed-precision ab initio tensor network state methods adapted for NVIDIA Blackwell technology via emulated FP64 arithmetic, https://arxiv.org/abs/2510.04795
- M. D. Lepinzan, G. Lacopo, D. Goz, G. Taffoni, P. Monaco, P. J. Elahi, U. Varetto, M. Cytowski, 3 Oct 2025, Accelerating cosmological simulations on GPUs: a portable approach using OpenMP, https://arxiv.org/abs/2510.02873
- Brian Curless, Michael Gowanlock, 28 Aug 2025, Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores, https://arxiv.org/abs/2508.21230
- Daichi Mukunoki, 25 Sep 2025 (v3), DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme, https://arxiv.org/abs/2508.00441
- Katsuhisa Ozaki, Yuki Uchino, Toshiyuki Imamura, 27 Apr 2025 (v3), Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique, https://arxiv.org/abs/2504.08009
- Paul Hübner, Andong Hu, Ivy Peng, Stefano Markidis, 25 Mar 2025 (v2), Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency, https://arxiv.org/abs/2502.05317
- Vivek Bharadwaj, Austin Glover, Aydin Buluc, James Demmel, 8 May 2025 (v4), An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks, https://arxiv.org/abs/2501.13986
- Anwar Hossain Zahid, Ignacio Laguna, Wei Le, 11 Oct 2024, Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs, https://arxiv.org/abs/2410.09172
- Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura, 20 Sep 2024, Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit, https://arxiv.org/abs/2409.13313
- Samuel Rodriguez, July 2025, Floating Point Emulation in NVDIA Math Libraries: Optimizing Floating Point Precision, CERN, July 1-2, 2025, Geneve, Switzerland, https://indico.cern.ch/event/1538409/contributions/6521976/attachments/3096181/5485165/cern-talk.pdf
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home