Aussie AI

CUDA C++ BF16x9 Emulation in Blackwell

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

CUDA C++ BF16x9 Emulation in Blackwell

What is BF16x9?

BF16x9 is a new floating-point emulation trick that only became viable with the Blackwell architecture, using BF16, which stands for “Brain Float” in 16-bits. Using the specialized BF16 Tensor Cores, it’s actually faster to “emulate” the standard FP32 arithmetic by using three BF16 variables and nine BF16 operations, with the computations done in parallel using 16 bit ops in the BF16 Tensor Cores. Yes, that’s not a typo; I really did mean to write nine. But, no, it doesn’t mean that 9xBF16 operations are faster than 1xFP32 operation, because the nine ops are done in parallel. The BF16 operations are about 1.25-1.75 times faster than FP32, so there is some gain in speed.

References

Cole Brower, John Gunnels and Graham Lopez, Oct 24, 2025, Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS, https://developer.nvidia.com/blog/unlocking-tensor-core-performance-with-floating-point-emulation-in-cublas/
Samuel Rodriguez July 1-2, 2025, Floating Point Emulation in NVDIA Math Libraries, CERN, Geneve, Switzerland, https://indico.cern.ch/event/1538409/contributions/6521976/attachments/3096181/5485165/cern-talk.pdf
Jonathan Bentz, Tony Scudiero, Jon Waxman and Rob Armstrong, Aug 06, 2025 What’s New and Important in CUDA Toolkit 13.0, https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

CUDA C++ BF16x9 Emulation in Blackwell