Aussie AI Blog

AGI might require higher precision FP32 or FP64

Feb 2nd, 2026

by David Spuler, Ph.D.

AGI and FP32 or FP64

I have a weird theory about getting to AGI: it might require the use of higher-precision arithmetic such as FP32 and FP64. There's no specific computational results that I can point to, but it's just a feeling, based on this:

AI failings are (now) often subtle and nuanced.
Subtlety requires accuracy in choosing the rights words of an answer.
The later model layers often choose between two or more tokens that are both reasonably good.

Certainly, the industry is currently based on quantization to avoid the extensive costs of GPU computation. BF16 has become the "de facto standard" for LLM training (see Lee et al, March 2025, https://arxiv.org/abs/2405.18710). Most of inference compute is heading to lower bit-count quantization, like INT8 or FP4/FP8. There's no palpable industry demand for FP32, let alone FP64. INT4 quantization is commonly used in inference, and has a reasonable trade-off between accuracy and cost. Indeed, the Blackwell and Rubin generations of NVIDIA GPUs are adding features like native FP4/FP8 tensor cores, without much mention of enhancements to FP32 or FP64.

Is FP64 required? How much precision might be required to encode the nuances of AGI. FP64 is widely used in scientific computations, but hardly at all in AI. However, there's starting to be a lot of papers on FP64 compute. Another wrinkle is the variety of papers looking at using lower-end data types such as INT8 to perform "FP64 emulation". Native FP64 capabilities are available in data center GPUs, but maybe these emulations are desirable for on-device AI. Also, there's the analogous CUDA C++ optimization known as "BF16x9 FP32 emulation" on Blackwell GPUs, which performs a single FP32 multiplication faster by "emulation" with 9 FP16 tensor cores in parallel.

Of course, AGI is going to require a lot of improvements, not just more bits in compute. Improvements to training data, tool integrations, and training algorithms are all ongoing. But I'm just wondering if maybe they need a few more bits for that!

References

Joonhyung Lee, Jeongin Bae, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee, 25 Mar 2025 (v2), To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability, https://arxiv.org/abs/2405.18710 ("BrainFloat16 (BF16) precision has become the de facto standard for LLM training")
Cole Brower, Samuel Rodriguez Bernabeu, Jeff Hammond, John Gunnels, Sotiris S. Xanthea, Martin Ganahl, Andor Menczer, Örs Legeza, 6 Oct 2025, Mixed-precision ab initio tensor network state methods adapted for NVIDIA Blackwell technology via emulated FP64 arithmetic, https://arxiv.org/abs/2510.04795
Daichi Mukunoki, 25 Sep 2025 (v3), DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme, https://arxiv.org/abs/2508.00441
Samuel Rodriguez, July 2025, Floating Point Emulation in NVDIA Math Libraries: Optimizing Floating Point Precision, CERN, July 1-2, 2025, Geneve, Switzerland, https://indico.cern.ch/event/1538409/contributions/6521976/attachments/3096181/5485165/cern-talk.pdf