Aussie AI

Hopper and Blackwell optimizations

  • Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
  • by David Spuler

Hopper and Blackwell optimizations

New and Upcoming NVIDIA GPU Architectures

NVIDIA has an extensive pipeline of new GPU and CPU architectures, and also combinations of these two series into superchips and rack architectures for incredible compute levels. Recently, NVIDIA has moved to an annual cadence of new hardware upgrades, but there are interim releases rolling out at various times through every year.

What’s new?

First, let’s look at the latest systems that are already available for you to put on your credit card. As of writing in July 2025, we have these GPU and CPU chips already available:

  • Hopper H100 GPU (September 2022)
  • Grace CPU (September 2024)
  • Hopper H200 GPU (November 2024)
  • Grace Hopper superchip (2024)
  • Blackwell B100 GPU (November 2024)
  • Blackwell B200 GPU (2025)
  • DGX B200 rack (2025) with 8 B200 GPUs and 2 Intel Xeon CPUs
  • Grace Blackwell superchip (2025) with B200 GPU and Grace CPU
  • HGX B200 rack (2025) with 8 B200 GPUs
  • DGX Spark “Project Digits” desktop system (January 2025)
  • GB200 NVL72 rack (2025) with 72 B200 GPUs and 36 Grace CPUs

Note that there wasn’t a new Blackwell CPU version, so the Grace Blackwell superchip still uses the Grace CPU like its preceding Grace Hopper superchip.

Looking further ahead, by using a crystal ball and also some NVIDIA press releases, here are the announced plans from NVIDIA for additional Blackwell releases:

  • Blackwell Ultra B300 GPU (late 2025)
  • DGX GB300 rack (late 2025) with 8 B300 GPUs and 2 Intel Xeon 6776P CPUs
  • GB300 NVL72 Grace Blackwell rack (late 2025) with 72 B300 GPUs and 36 Grace CPUs

Hopper GPUs

The Hopper architecture series launched in September, 2022, with the release of the "H100 Tensor Core GPU." It was shortly followed by the H200 GPU and a new line of CPUs called the “Grace CPU” (based on the ARM CPU architecture).

New features of the Hopper architecture of GPUs allowed major advances in workloads for AI and scientific computing. Some of the main features included:

  • Massive throughput of over 1,000 TFLOPS in FP8 compute.
  • Memory of 80GB using HBM2e (PCIe) or HBM3.
  • HBM3e memory at up to 3.35 TB/seconds bandwidth.
  • NVLink 4th-generation interconnect support at 900 GB/seconds.
  • Manufacturing size of 5nm process

Lower-level specs about the compute capabilities:

  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • Streaming Multiprocessors: 114
  • L1 cache: 256 KB (per SM)
  • L2 cache: 50MB

The Hopper H200 GPU debuted in November, 2024, with more specs:

  • Memory: 141 GB of HBM3e RAM
  • Memory bandwidth: 4.8 TB/second
  • Superchip: GH200 superchip with Grace CPU

Rack system configurations included:

  • DGX H100 rack: 8 H100s, NVLink 4.0, NVSwitch.
  • DGX SuperPODs (extending DGX H100 racks).
  • HGX H100 rack: 4/8 H100 GPUs, SXM5 or PCIe, NVLink, NVSwitch.
  • HGX H200 rack: adds 141 GB RAM per GPU and 4.8 TB/second bandwidth.
  • HGX H200 PODs: further extending the racks.
  • GH200 Superchip rack: H100 GPUs and Grace CPUs, and NVLink-C2C at 900 GB/second.
  • GH200 NVL rack systems: powered by the GH200 superchip.

More detailed programming features for CUDA C++ on Hopper H100/H200 GPUs:

  • FP8 Tensor Cores — use FP8 quantization for training or inference.
  • Mixed precision support — FP8, INT8, FP32, and FP64
  • Transformer Engine with FP8 support.
  • Multi-GPU optimizations and connectivity upgrades.
  • Confidential computing hardware-native privacy support.
  • Dynamic Programming eXtension (DPX) GPU hardware instructions.
  • Multi-Instance GPU (MIG) partitioning into 7 logical instances.
  • Unified Cache Hierarchy merging L1/texture/shared memory caches.

These new features in Hopper H100/H200 GPUs are also further enhanced in Blackwell architectures.

Blackwell Specs

Blackwell’s architecture brings a whole lot more performance, extending well beyond the Hopper H100 architecture. Blackwell B200 specs include:

  • Memory: 192GB (2x96) of HBM3e GPU RAM (up from 80GB in H100)
  • Bandwidth: 8TB/second memory bandwidth (up from 3.35 in Hopper)
  • Manufacturing: 4nm process

GB200 NVL72 Rack System. The NVL72 rack system becomes the GB200 NVL72 version with “Grace Blackwell” combination, with these features:

  • 72 Blackwell B200 GPUs
  • 36 Grace CPUs
  • NVLink-C2C (Chip-to-Chip) 900 GB/second connectivity

Blackwell New CPU? Note that there’s no new CPU in the Blackwell series available or announced at this stage. Hence, there’s no new “David CPU” and the Blackwell GPUs work in superchips and racks with a “Grace Blackwell” combination, which uses the Grace CPU from the previous Grace Hopper series. There is a planned “Vera CPU” for the Vera Rubin architecture series in 2026.

Blackwell Ultra Specs

The Blackwell Ultra B300 in late 2025 improves upon the Blackwell B200 GPUs, and also extends the NVL72 rack system. Feature enhancements include:

  • 1.5x more FLOPS than basic Blackwell
  • FP4 Tensor Cores
  • Support for FP4, MXFP4 and MXFP6 formats
  • 288GB of HBM3e GPU RAM (up from 192GB)

When combined into the latest NVL72 rack system, the GB300 NVL72 rack system includes:

  • “Grace Blackwell” superchip architecture
  • 72 Blackwell Ultra B300 GPUs
  • 36 Grace CPUs
  • NVLink connectivity with 130 TB/second bandwidth.

This leads to some very high benchmarks for the whole rack:

  • Memory: total of 40 TB coherent memory — via GPU-to-GPU shared memory connectivity.
  • Compute: totals 1.1 exaflops of dense FP4 inference — 1.5 times GB200 rack.

This is the same number of CPUs and GPUs as the GB200 NVL72 rack, but has the newer Blackwell Ultra GPUs and faster connectivity.

New Blackwell CUDA C++ Optimizations

The Blackwell B200 and Blackwell Ultra B300 GPUs add new capabilities that are available via C++ APIs and intrinsics. Some of the major CUDA C++ optimization methods include:

  • Block-scaled quantization — per-block scaling of quantization from FP32 to FP4/FP8, and dequantization.
  • Block Floating-Point (BFP) — the underpinning floating-point data format in block-scaled quantization.
  • Grouped GEMM APIs — multiple matrix processing steps.
  • FP4/FP8 computation capabilities — use FP4 or FP8 quantization for inference or training.
  • Shared Epilogues — similar to “fused epilogues” but across multiple matrices.
  • BF16x9 FP32 Emulation — faster way to do FP32 multiplication by nine 16-bit ops in parallel!
  • Unified Cache Hierarchy — combined cache for L1 cache, texture memory cachce, and shared memory cache.
  • Persistent L2 cache — programmatic marking of L2 cache memory to persist across kernel launches.
  • UMMA (Universal Matrix Multiply Accumulate) instructions for mixed-precision GEMM.

These optimizations are “new to Blackwell” or at least significantly extended with new features in Blackwell’s architecture. However, that’s not the full list.

What’s Enhanced by Blackwell?

Well, everything. The optimizations are not just the new ones. With all the old CUDA C++ strategies, you can run more threads, using more memory, and it still runs faster on Blackwell. There’s also faster intra-GPU connectivity between SMs, and faster inter-GPU network connectivity in multi-GPU rack systems.

Maybe I could even say that the CUDA C++ optimizations are not enhanced by Blackwell, because now you need them less.

Some of the CUDA C++ programming techniques that are not exactly “new to Blackwell” but are certainly relevant for further code optimizations, include:

  • Thread block clusters — significantly enhanced in Blackwell with faster SM-to-SM connectivity.
  • CUDA cluster size maximum — 16 blocks per thread block cluster in Blackwell.
  • Dynamic cluster launching — the thread block cluster size chosen at runtime.
  • Multi-GPU optimization methods — faster connectivity makes it enhanced on NVL72 racks.
  • Distributed Shared Memory — based on inter-GPU shared memory in thread block clusters.
  • Multicast TMA loads — — faster Thread Memory Accelerator (TMA) features.

The overall effect of all this:

  • Faster LLM attention and FFN kernels
  • Faster MatMul/GEMM for matrix multiplication
  • CUTLASS kernel enhancements implementing these primitives.

For further improvements, upgrade from Blackwell to the NVIDIA Vera Rubin architecture.

References

  1. Athena Elafrou, Guillaume Thomas Collignon, March 18, 2024, Introduction to CUDA Performance Optimization, GPU Technology Conference 2024, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
  2. Igor Terentyev, March 18, 2024 Advanced Performance Optimization in CUDA [S62192], GPU Technology Conference 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62192/
  3. Eduardo Alvarez, Omri Almog, Eric Chung, Simon Layton, Dusan Stosic, Ronny Krashinsky and Kyle Aubrey, Jun 24, 2025, Introducing NVFP4 for Efficient and Accurate Low-Precision Inference, https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ (Mostly about FP4 and block-floating point support in Blackwell.)
  4. NVIDIA, May 2025, FP4 intrinsics, in CUDA Math API Reference Manual, Release 12.9, https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__FP4.html, https://docs.nvidia.com/cuda/pdf/CUDA_Math_API.pdf (Covers intrinsics declared in cuda_fp4.h header file.)
  5. Sean Hollister Mar 19, 2025, NVIDIA announces Blackwell Ultra GB300 and Vera Rubin, its next AI ‘superchips’, The Verge, https://www.theverge.com/news/631835/nvidia-blackwell-ultra-ai-chip-gb300
  6. Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh, 29 May 2025 (v2), Quartet: Native FP4 Training Can Be Optimal for Large Language Models, https://arxiv.org/abs/2505.14669, https://github.com/IST-DASLab/Quartet (Paper on FP4 training with Blackwell GPUs.)
  7. Brian Wang, March 20, 2025, NVIDIA Ultra, Rubin and Feynman Chips and Data Center Roadmap, https://www.nextbigfuture.com/2025/03/nvidia-ultra-rubin-and-feynman-chips-and-data-center-roadmap.html
  8. Itay Ozery, Oct 15, 2024, Powering Next-Generation AI Networking with NVIDIA SuperNICs, https://developer.nvidia.com/blog/powering-next-generation-ai-networking-with-nvidia-supernics/
  9. NVIDIA, January 06, 2025, NVIDIA Puts Grace Blackwell on Every Desk and at Every AI Developer’s Fingertips, NVIDIA Press Release, https://investor.nvidia.com/news/press-release-details/2025/NVIDIA-Puts-Grace-Blackwell-on-Every-Desk-and-at-Every-AI-Developers-Fingertips/default.aspx
  10. Daniel Sims, December 6, 2024, Next-gen NVIDIA GPU “Rubin” is ahead of schedule, uses 3nm manufacturing and HBM4: Availability now expected in the second half of 2025, https://www.techspot.com/news/105852-nvidia-blackwell-ai-successor-rubin-moves-forward-six.html

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Optimization

CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization