Aussie AI
CUDA C++ BF16x9 Emulation in Blackwell
-
Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
-
by David Spuler
CUDA C++ BF16x9 Emulation in Blackwell
What is BF16x9?
BF16x9 is a new floating-point emulation trick that only became viable with the Blackwell architecture, using BF16, which stands for “Brain Float” in 16-bits. Using the specialized BF16 Tensor Cores, it’s actually faster to “emulate” the standard FP32 arithmetic by using three BF16 variables and nine BF16 operations, with the computations done in parallel using 16 bit ops in the BF16 Tensor Cores. Yes, that’s not a typo; I really did mean to write nine. But, no, it doesn’t mean that 9xBF16 operations are faster than 1xFP32 operation, because the nine ops are done in parallel. The BF16 operations are about 1.25-1.75 times faster than FP32, so there is some gain in speed.
References
- Cole Brower, John Gunnels and Graham Lopez, Oct 24, 2025, Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS, https://developer.nvidia.com/blog/unlocking-tensor-core-performance-with-floating-point-emulation-in-cublas/
- Samuel Rodriguez July 1-2, 2025, Floating Point Emulation in NVDIA Math Libraries, CERN, Geneve, Switzerland, https://indico.cern.ch/event/1538409/contributions/6521976/attachments/3096181/5485165/cern-talk.pdf
- Jonathan Bentz, Tony Scudiero, Jon Waxman and Rob Armstrong, Aug 06, 2025 What’s New and Important in CUDA Toolkit 13.0, https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: CUDA C++ Optimization |
|
The new CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |