Aussie AI

Rubin and Feynman GPU Optimizations

  • Bonus Material from "CUDA C++ Optimization: Coding Faster GPU Kernels"
  • by David Spuler

Rubin and Feynman GPU Optimizations

Future NVDIA GPU Architectures

These are the upcoming releases in the Vera Rubin architectures including the Rubin GPUs and Vera CPUs:

  • Rubin R100 GPU (2026 or maybe late 2025)
  • Rubin R200 GPU (2026)
  • Vera CPU (2026)
  • Vera Rubin superchip (2026)
  • Vera Rubin NVL144 rack (2026) with R200 GPUs and Vera CPUs
  • Rubin Ultra R300 GPU (2027)
  • Vera Rubin Ultra NVL576 rack (2027) with R300 GPUs and Vera CPUs

There is even less certainty to the dates for the following release of Richard Feyman architectures:

  • Richard Feynman architecture GPUs (F100/F200?) (2028)
  • Feynman rack (2028) with Feynman GPU and Vera CPU (i.e., not a new Feynman CPU).

Rubin Specs and CUDA Optimizations

Some of the technical specs about the Rubin GPU:

  • Memory: 288 GB of HBM4 memory (versus Blackwell’s 192GB).
  • Compute: 50 PFLOPS of FP4 compute (compared to Blackwell’s 20 PFLOPS).
  • NVL144 rack: total of 3.6 exaflops of FP4 compute.
  • Connectivity: increased bandwidth with 8-Hi HBM4 stacks.
  • Manufacturing: 3nm chip with increased transistor density and lower power consumption.

Although we don’t know all the new capabilities in the hardware, and there will also be new software coming between now and then, let’s make some reasonable assumptions:

  • Native FP4 tensor cores (so as to achieve the above-mentioned FP4 rates).
  • Hardware-native FP4 computations.
  • FP4 natively packed with two FP4 4-bit values in an 8-bit byte.

Note that the CUDA Math API Reference, version 12.9, already has these:

  • FP4 data types: __nv_fp4_e2m1, __nv_fp4x2_e2m1, __nv_fp4x4_e2m1
  • FP4 conversion functions: __nv_cvt_float_to_fp4() and __nv_cvt_fp4_to_halfraw()

The CUDA C++ optimizations likely to be unique to Rubin include:

  • CUDA C++ hardware-native FP4 intrinsics (e.g., __nv_fp4 intrinsics).
  • Hardware-supported full-speed FP4 matrix math in tensor cores.
  • Use FP4 natively for attention modules and FFN blocks (and with fused kernels).
  • Train directly in FP4 with Quantization Aware Training (QAT).

CUDA C++ optimizations that wouldn’t be new to Rubin, but would nevertheless be enhanced may include:

  • Even larger LLMs fit into GPU RAM (esp. when Rubin Ultra arrives with 1 TB of HBM4e memory).
  • Fused epilogues for FP4 GEMM kernels (activations, bias, quantization/dequantization).
  • Grouped GEMMs with native FP4 for both training amd inference.
  • Shared Epilogues in Grouped GEMM kernels.
  • Thread Block Clusters (multiple block features).
  • Multicast Thread Memory Accelerator (TMA) loads (multi-block shared memory enhancements)
  • Distributed Shared Memory (via multicast TMA).
  • Persistent L2 cache (save L2 cache data between kernel launches).
  • Unified Cache Hierarchy (merging L1/texture/shared memory caches)

This release would likely involve optimizations and features such as:

  • CUTLASS FP4 kernels based on native FP4 intrinsics.
  • Transformer Engine v3 (maybe, since we’re at v2.4.0 now).
  • CX9 SuperNIC in NVL144/NL576 rack configurations for faster CUDA graphs and “tail launches.”
  • FP4 quantization pipelines with 4-bit aware inference or training using custom FP4 kernels or CUTLASS FP4 kernels.

Some of this might be correct. Your mileage may vary.

Feynman Fantasy Specs and Optimizations

Well, the Rubin ideas were based on specs and announced compute thresholds. But some thoughts on Feynman are closer to guesswork. Here is a set of specs, some known, and some estimated:

  • Memory: 1 or 2 TB with HBM5 support (estimated).
  • Compute: 100 PFLOPS of FP4 compute (doubling Rubin R300’s).
  • Connectivity: increased bandwidth beyond 8-Hi HBM4 stacks.
  • Manufacturing: 3nm chip or maybe even a 2nm version (high density, low power).
  • Feynman rack: with Feynman GPU and a Vera CPU, extending NVL576 (exact details unknown).

Some hypothetical possibilities for CUDA C++ coding optimizations in 2028 include:

  • FP2 native tensor cores?
  • Quantum acceleration with quantum-aware scheduling?
  • Neural cache prefetching?
  • Multi-Reticle Warp Fusion?
  • Cluster-of-Clusters Kernels?
  • Dynamic shared epilogues?
  • CX10 Interconnect Optimization (after CX9 in Rubin)?
  • Adaptive routing and congestion-aware scheduling?

There’s no harm in dreaming.

References

  1. Sean Hollister Mar 19, 2025, NVIDIA announces Blackwell Ultra GB300 and Vera Rubin, its next AI ‘superchips’, The Verge, https://www.theverge.com/news/631835/nvidia-blackwell-ultra-ai-chip-gb300
  2. Wikipedia, July 2025 (accessed), Rubin microarchitecture, https://en.wikipedia.org/wiki/Rubin_%28microarchitecture%29
  3. Glenn’s Digital Garden, Jun 19, 2025, NVIDIA R200, https://glennklockwood.com/garden/processors/R200
  4. Glenn’s Digital Garden, Mar 21, 2025, NVIDIA R300, https://glennklockwood.com/garden/processors/R300 Sebastian Moss, March 18, 2025, NVIDIA’s Rubin Ultra NVL576 rack expected to be 600kW, coming second half of 2027: And you just started getting ready for 130kW, https://www.datacenterdynamics.com/en/news/nvidias-rubin-ultra-nvl576-rack-expected-to-be-600kw-coming-second-half-of-2027/
  5. Jarred Walton, March 19, 2025, NVIDIA announces Rubin GPUs in 2026, Rubin Ultra in 2027, Feynman also added to roadmap, https://www.tomshardware.com/pc-components/gpus/nvidia-announces-rubin-gpus-in-2026-rubin-ultra-in-2027-feynam-after
  6. Anthony Garreffa, Mar 18, 2025, NVIDIA unveils next-gen Feynman GPU In GTC 2025 roadmap, should use HBM5 memory in 2028, https://www.tweaktown.com/news/104025/nvidia-unveils-next-gen-feynman-gpu-in-gtc-2025-roadmap-should-use-hbm5-memory-2028/index.html
  7. Brian Wang, March 20, 2025, NVIDIA Ultra, Rubin and Feynman Chips and Data Center Roadmap, https://www.nextbigfuture.com/2025/03/nvidia-ultra-rubin-and-feynman-chips-and-data-center-roadmap.html
  8. Benj Edwards, 19 Mar 2025, NVIDIA announces “Rubin Ultra” and “Feynman” AI chips for 2027 and 2028: CEO Jensen Huang says new chips will power robots and billions of AI agents, https://arstechnica.com/ai/2025/03/nvidia-announces-rubin-ultra-and-feynman-ai-chips-for-2027-and-2028/
  9. Hassan Mujtaba, Mar 18, 2025, NVIDIA Rubin & Rubin Ultra With Next-Gen Vera CPUs Start Arriving Next Year: Up To 1 TB HBM4 Memory, 4-Reticle Sized GPUs, 100PF FP4 & 88 CPU Cores, https://wccftech.com/nvidia-rubin-rubin-ultra-next-gen-vera-cpus-next-year-1-tb-hbm4-memory-4-reticle-sized-gpus-100pf-fp4-88-cpu-cores/
  10. Daniel Sims, December 6, 2024, Next-gen NVIDIA GPU “Rubin” is ahead of schedule, uses 3nm manufacturing and HBM4: Availability now expected in the second half of 2025, https://www.techspot.com/news/105852-nvidia-blackwell-ai-successor-rubin-moves-forward-six.html

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Optimization

CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization