Aussie AI

Grace CPU Optimizations

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Grace CPU Optimizations

Remember CPUs? They used to be important before all this fast GPU stuff.

And still are, actually. NVIDIA has its own line of CPUs based on the Arm architecture, which can be used for CPU-GPU architectures. The CPUs from NVIDIA:

Grace CPU (2023) — ARM-based CPU architecture.
Vera CPU (2026) — the successor to Grace CPUs.

Combined CPU and GPU systems from NVIDIA include:

Grace Hopper superchip (2023) — Grace CPU + Hopper GPU
Grace Blackwell systems (2024) — Grace CPU + Blackwell GPU
Vera Rubin superchip (2026) — Vera CPU + Rubin GPU

Rack products with multiple of both CPUs and GPUs include:

GB300 NVL72 rack (2025) — with 72 Blackwell Ultra B300 GPUs and 36 Grace CPUs.
GB200 NVL72 rack (2025) — combines 72 Blackwell B200 GPUs and 36 Grace CPUs.
DGX Spark “Project Digits” (January 2025) -- high-end desktop GPU system.

Each Grace CPU chip has the following specifications relevant to C++ programming with host code:

SIMD SVE2 — 4x128-bit per core.
Cores — 72 Arm Neoverse V2 Cores (144 for superchip).
L1 cache size — 64K instruction cache and 64k data cache.
L2 cache size — 1MB per core.
L3 cache size — 114MB or 228MB for superchip.
RAM memory — various options from 120GB to 960GB.

Arm CPU SIMD Vectorization

CPUs actually had parallelism before GPUs via Single-Instruction Multiple Data (SIMD) operations. The Grace CPU is based on the Arm chip platform, which has two different instruction sets for SIMD vectorization:

Arm Neon
Scalable Vector Extensions (SVE)

The Grace CPU actually has SVE2 instructions, which are more advanced. Even so, if you want to count how many floating-point computations the CPU and GPU can do in parallel per clock cycle, you need this:

CPU — fingers and toes.
GPU — a calculator.

Personally, I recommend one of those old-school HP postfix calculators, rather than the one on Windows.

The CPU does have a higher clock speed to go with these SIMD instructions, which helps it compete against the GPU's massive throughput, but it's still not a fair fight. And just to be doubly unfair, note that the GPU also has its own set of SIMD hardware instructions, which you can access using the float2, float3, and float4 types in CUDA C++. Both CPU and GPU have extra grunt in Instruction-Level Parallelism (ILP) and out-of-order execution, but now you've distracted me from the SIMD discussion.

These SIMD instructions in the Grace CPU are actually hardware opcodes. There are four 128-bit special SIMD registers available for vectorized instructions. However, you can access them more easily via C++ intrinsic functions (in host code), so you don't even need to learn assembly language. For bonus fun, you can also use "inline assembly" instructions in C++ to run longer sequences of Arm assembly code, which is often somewhat faster. I used to say that assembly code was less readable, but, really, have you seen the latest updates to modern C++ syntax?

AI CPU-GPU Optimizations

A number of AI kernels run together better on both CPU and GPU chips. The CPU has always been responsible for things like:

Overall AI algorithms (top-level of training or inference).
Keeping the GPU on the straight-and-narrow.
Send data to the GPU and receiving data back.
Communicating with other servers and scheduling transfers.
Overlapping communications with GPU compute.
Multi-GPU algorithm management and synchronization.

However, the CPU can be more actively involved in running the actual kernels. Some of the optimization methods include:

GPU to CPU "offloading" of activations or KV caches.
Combined GPU-CPU compute kernels.

But it's not like the GPU needs any help.

Offloading Optimizations

The traditional use of the term "offloading" is to refer to transfering computation to a higher-powered system, and this is what the term usually means in "edge computing" (offloading to the cloud servers). However, offloading in CUDA C++ kernel vernacular usually refers to GPU-to-CPU downloading of data for the CPU to handle (i.e., the opposite meaning). The efficiency goals of GPU-to-CPU offloading may include:

Freeing up GPU RAM
Sharing the compute load

Values computed by the GPU can be offloaded to the CPU. The benefit can be simply that the CPU has much greater RAM availability, since the GPU's VRAM is critical to hold LLM weights and related computations. Data that is offloaded from GPU to CPU may include:

KV cache computations
Activation computations

Generally, the KV cache is bigger than activations, so KV cache offloading is a more commonly used technique. The KV cache grows linearly with the number of input tokens, so it can become a memory hog on the GPU.

Indeed, there's a great deal of research on "KV cache compression" to make them smaller, but these techniques can only do so much. The GPU has a continual stream of new KV caches as it processes multiple inference queries from users. After a while, the GPU gets sick of storing all that garbage, and sends it to the CPU to deal with.

KV cache offloading to CPU is not only about memory size. There's also the need for the KV caches to be shared across multiple GPUs for optimizations such as "prefix KV caching" or "global KV caches." The CPU may be better placed to manage this data sharing, leaving the GPU to run more compute-bound kernels.

Unified memory management can be helpful in facilitating data offloading or mixed CPU-GPU algorithms. However, it's still sending bandwidth internally, so you need to have a high-powered interconnect between CPU and GPU. Fortunately, this type of network technology is available in both superchip and rack versions.

Shared compute is a less common reason for offloading to CPU. If the CPU is not doing much, there's some extra processing power that can be used to take partial load off the GPU. There are various ways for the CPU to participate in a joint computation with the GPU. However, don't get too excited, because the GPU looks down its nose at the limited parallelization in the CPU, and the compute load is not evenly shared. Nevertheless, using both processors is "optimal" and there are numerous research papers on this.

References

Grace CPU. General references on the Grace CPU architecture include:

Karin Sevegnani and Giuseppe Fiameni, May 27, 2025, Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper, https://developer.nvidia.com/blog/advanced-optimization-strategies-for-llm-training-on-nvidia-grace-hopper/ (Covers CPU offloading, unified memory, automatic mixed-precision and FP8 training.)
Graham Lopez, Robert Jensen, Arthy Sundaram and Barton Fiske, Nov 16, 2023, Unlock the Power of NVIDIA Grace and NVIDIA Hopper Architectures with Foundational HPC Software, https://developer.nvidia.com/blog/unlock-the-power-of-nvidia-grace-and-nvidia-hopper-architectures-with-foundational-hpc-software/
NVIDIA, July 2025 (accessed), Grace Performance Tuning Guide, https://docs.nvidia.com/grace-perf-tuning-guide/index.html (Examines SIMD Arm Neon and SVE2 vectorization and atomics, amongst other tuning topics.)
Ricardo Jesus and Michèle Weiland, 2024, Evaluating and optimising compiler code generation for NVIDIA Grace, In Proceedings of the 53rd International Conference on Parallel Processing (ICPP '24), Association for Computing Machinery, New York, NY, USA, 691–700, https://doi.org/10.1145/3673038.3673104, https://dl.acm.org/doi/10.1145/3673038.3673104, PDF: https://dl.acm.org/doi/pdf/10.1145/3673038.3673104 (Compares compilers including Arm Compiler for Linux (ACFL), GNU GCC, LLVM, and NVIDIA HPC Compiler (NVHPC) on Grace CPUs.)
Greg Glockner, Jul 12, 2024, Boosting Mathematical Optimization Performance and Energy Efficiency on the NVIDIA Grace CPU, https://developer.nvidia.com/blog/boosting-mathematical-optimization-performance-and-energy-efficiency-on-the-nvidia-grace-cpu/
NVIDIA, July 2025 (accessed), NVIDIA Grace CPU Benchmarking Guide, https://nvidia.github.io/grace-cpu-benchmarking-guide/, https://github.com/NVIDIA/grace-cpu-benchmarking-guide
Akshay Subramaniam, March 2025, Get the Most Performance From Grace Hopper, https://www.nvidia.com/en-us/on-demand/session/gtc25-s72687/
Graham Lopez, Robert Jensen, Arthy Sundaram and Barton Fiske, Nov 16, 2023, Unlock the Power of NVIDIA Grace and NVIDIA Hopper Architectures with Foundational HPC Software, https://developer.nvidia.com/blog/unlock-the-power-of-nvidia-grace-and-nvidia-hopper-architectures-with-foundational-hpc-software/
Ben Bajarin, Austin Lyons, March 17, 2025, A Deeper Look at Grace – NVIDIA’s Custom Arm-based Super Chip, https://creativestrategies.com/research/a-deeper-look-at-grace-nvidias-custom-arm-based-super-chip/
Ashen Wolve DEFI, Jul 14, 2024, NVIDIA Grace CPU Outperforms AMD EPYC in Optimization and Energy Efficiency, https://medium.com/@AshenWolveDEFI/nvidia-grace-cpu-outperforms-amd-epyc-in-optimization-and-energy-efficiency-e1243bf019c3

GPU-to-CPU offloading. Research papers include:

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu, Dec 2023, Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, https://arxiv.org/abs/2311.03687 (Benchmarks model speed for training, fine-tuning and inference with various optimizations such as ZeRO, quantization, offloading/recomputation, and Flash Attention.)
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
Xunyi Zhao, Lionel Eyraud-Dubois, Théotime Le Hellard, Julia Gusak, Olivier Beaumont, 24 July, 2024, OFFMATE: full fine-tuning of LLMs on a single GPU by re-materialization and offloading, https://hal.science/hal-04660745/document
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
R. Narmeen, P. Mach, Z. Becvar and I. Ahmad, 16 August 2024, Joint Exit Selection and Offloading Decision for Applications Based on Deep Neural Networks, IEEE Internet of Things Journal, doi: 10.1109/JIOT.2024.3444898, https://doi.org/10.1109/JIOT.2024.3444898 https://ieeexplore.ieee.org/abstract/document/10638073
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Offloading the memory-bound processing of KV caches in attention kernels during decoding to bandwidth-focused GPUs, while reserving compute-bound computations like FFNs and prefill for powerful GPUs.)
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
Xiaoniu Song, Zihang Zhong, Rong Chen, 29 Oct 2024, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2 Nov 2024, NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference, https://arxiv.org/abs/2411.01142
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen, 23 Dec 2024, Fast and Live Model Auto Scaling with O(1) Host Caching, https://arxiv.org/abs/2412.17246
Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh. 3 Jan 2025, Efficient LLM Inference with Activation Checkpointing and Hybrid Caching, https://arxiv.org/abs/2501.01792 (Recomputation of the KV cache from stored activations.)
Xunyi Zhao, 2024, Optimizing Memory Usage when Training Deep Neural Networks, Computer Science, Université de Bordeaux, France, https://theses.hal.science/tel-04890912/file/ZHAO_XUNYI_2024.pdf
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang, 7 Feb 2025, fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving, https://arxiv.org/abs/2502.05370
Hongsun Jang, Siung Noh, Changmin Shin, Jaewon Jung, Jaeyong Song, Jinho Lee, 14 Feb 2025, INF^2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing, https://arxiv.org/abs/2502.09921
Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar, 18 Feb 2025, HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading, https://arxiv.org/abs/2502.12574
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, Chun Jason Xue, 4 Mar 2025, FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference, https://arxiv.org/abs/2503.03777
Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu, 26 Mar 2025, Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation https://arxiv.org/abs/2503.20552
Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han, 20 Mar 2025, SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs, https://arxiv.org/abs/2503.16163
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983
Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang, 21 May 2025 (v3), SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices, https://arxiv.org/abs/2505.10259 https://github.com/MobiSense/SpecOffload-public