Aussie AI

Hardware Acceleration

  • Last Updated 1 January, 2026
  • by David Spuler, Ph.D.

It all started with the "math coprocessor" chips back in the 1990s. The modern-day version is the Graphics Processing Unit (GPU). As the name suggests, they were originally intended to handle graphics calculations, and are certainly still used for floating point calculations in gaming boxes to display the amazingly fast 3D first-person views that are found in games such as FortNite and MineCraft. However, the role of GPUs has broadened to become that of a general mathematical calculation engine, which has found extensive use in two other massive trends: cryptographic calculations (e.g. bitcoin mining), and the matrix calculations inherent to neural networks and Transformer engines for AI. Such chips are more accurately called "General Purpose GPUs" or GPGPUs, but lately they are all simply called GPUs.

Hardware acceleration is by far the most successful method of optimization for AI engines to date. As the number of floating point operations used by AI models has grown into the billions, the fastest GPU chips have kept up with numerous improvements to hardware acceleration algorithms. The primary advancements have included raw on-chip speed increases to reduce response time, increased on-chip memory size and performance, and the use of parallelization and pipelining methods for improved throughput.

Types of AI Hardware Acceleration

There are various types of hardware acceleration that can make a model run faster.

  • Graphics Processing Unit (GPU)
  • Application-Specific Integrated Circuit (ASIC)
  • Field-Programmable Gate Array (FPGA)
  • Central Processing Unit (CPU)
  • Neural Processing Unit (NPU)

Specific hardware acceleration architectural techniques include:

  • General Purpose GPUs (GPGPUs)
  • Caches (on-chip memory caching)
  • Multi-core CPUs
  • Multi-threaded CPUs
  • Single-Instruction Multiple Data (SIMD)
  • Non-Uniform Memory Access (NUMA)

Software Integrations to Hardware Accelerators

Software interfaces to hardware accelaration:

  • BLAS (Basic Linear Algebra Subroutines)
  • CUDA (NVIDIA's proprietary Compute Unified Device Architecture)
  • AVX (Advanced Vector Extensions; also AVX2, AVX-512 and AVX10)
  • OpenCL
  • cuBLAS (NVIDIA GPU BLAS version in CUDA)

Software Strategies for Hardware Acceleration

General software acceleration strategies for maximizing the benefits from hardware-accelerated computation:

  • Pipelining. This refers to keeping the GPU busy with a stream of data to chomp through, and avoiding "bubbles" in the pipeline, which is time when the GPU has nothing to do.
  • Partitioning and dataflow management. This is the software technique of organizing data so it's ready to send quickly to the GPU, usually in contiguous memory.
  • Cache management. Judicious use of the various levels of cache memory can help with pipelining efficiency.
  • Parallelizing. It's all parallel, isn't it? This point refers to writing the overarching algorithms in a parallelism-friendly manner, ensuring that nothing waits for nobody.
  • Deep learning compilers. The full stack of software acceleration to maximize hardware.

Other software acceleration issues that are closely related to hardware efficiency:

For many other optimization strategies that are orthogonal to hardware acceleration, and can be used to further optimize a model, see the complete list of AI acceleration techniques.

Survey Papers on AI Hardware Accelerators

Papers that review hardware acceleration frameworks:

AI Announcements from Hardware Vendors

Hardware-Acceleration Research

Various papers on hardware acceleration, out of thousands, include:

GPU Research

Research papers on various GPU issues:

  • Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
  • Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
  • Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
  • Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
  • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023. https://arxiv.org/abs/2303.06865
  • Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu, 21 Feb 2024, Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, https://arxiv.org/abs/2402.13499
  • David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
  • Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
  • Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
  • Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
  • David Spuler, March 2024, GPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-gpu-hardware-acceleration
  • Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
  • Florian Douetteau, September 7, 2024, Get ready for a tumultuous era of GPU cost volatility, https://venturebeat.com/ai/get-ready-for-a-tumultuous-era-of-gpu-cost-volitivity/
  • M Davies, I McDougall, S Anandaraj, D Machchhar, April 2024, A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 20–36, https://doi.org/10.1145/3620665.3640367 https://dl.acm.org/doi/abs/10.1145/3620665.3640367 (Benchmarking analysis of GPU execution extending MLPerf.)
  • Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
  • Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
  • Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
  • Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
  • Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Bagus Hanindhito and Lizy K. John. 2024. Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE '24). Association for Computing Machinery, New York, NY, USA, 178–189. https://doi.org/10.1145/3629526.3653835 https://dl.acm.org/doi/abs/10.1145/3629526.3653835 PDF: https://lca.ece.utexas.edu/pubs/Hanindhito_AcceleratingMLWorkloads.pdf
  • C. Wang, P. Song, H. Zhao, F. Zhang, J. Wang and L. Zhang, "High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach," 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558334. https://ieeexplore.ieee.org/abstract/document/10558334
  • Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
  • Vasily Volkov, August 12, 2016, Understanding Latency Hiding on GPUs, Ph.D. Thesis, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
  • Z. Chen et al., "An Empirical Study on the Power Consumption of LLMs with Different GPU Platforms," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 8640-8642, doi: 10.1109/BigData62323.2024.10825662. https://ieeexplore.ieee.org/abstract/document/10825662
  • Burcu Canakci, Junyi Liu, Xingbo Wu, Nathanaël Cheriere, Paolo Costa, Sergey Legtchenko, Dushyanth Narayanan, Ant Rowstron, 17 Jan 2025, Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? https://arxiv.org/abs/2501.10187
  • Sama Bali, Jan 15, 2025 GPU Memory Essentials for AI Performance, https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/
  • W. Choi, J. Jeong, H. Jang and J. Ahn, "GPU-centric Memory Tiering for LLM Serving with NVIDIA Grace Hopper Superchip," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3533588. https://ieeexplore.ieee.org/abstract/document/10852027
  • Rohan Yadav, Michael Garland, Alex Aiken, Michael Bauer, 9 Apr 2025, Task-Based Tensor Computations on Modern GPUs, https://arxiv.org/abs/2504.07004
  • Burkhard Ringlein, Thomas Parnell, Radu Stoica, 15 May 2025 (v2), GPU Performance Portability needs Autotuning, https://arxiv.org/abs/2505.03780
  • Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang, 22 Jul 2025, Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training, https://arxiv.org/abs/2507.16274
  • Enrico Santi and Fabio Tardivo and Agostino Dovier and Andrea Formisano, 24 Jul 2025, GPU Accelerated Compact-Table Propagation, https://arxiv.org/abs/2507.18413
  • Sina Baghal, 6 Aug 2025, Solving Pasur Using GPU-Accelerated Counterfactual Regret Minimization, https://arxiv.org/abs/2508.06559
  • Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang, 10 Aug 2025, Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative, https://arxiv.org/abs/2508.07329
  • Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg, 9 Aug 2025, TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree, https://arxiv.org/abs/2508.07014
  • Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg, 10 Aug 2025, FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities, https://arxiv.org/abs/2508.07315
  • Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu, 28 Jul 2025, LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems, https://arxiv.org/abs/2507.21276
  • Martin B\"ockling, Heiko Paulheim, 1 Aug 2025, gpuRDF2vec -- Scalable GPU-based RDF2vec, https://arxiv.org/abs/2508.01073
  • Zicong Ye, Kunming Zhang, Guoming Tang, 3 Aug 2025, AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization, https://arxiv.org/abs/2508.01744
  • Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia, 7 Aug 2025, Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving, https://arxiv.org/abs/2507.06608
  • Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral, 11 Aug 2025, Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving, https://arxiv.org/abs/2508.08343
  • Iman Khadir, Shane Stevenson, Henry Li, Kyle Krick, Abram Burrows, David Hall, Stan Posey, Samuel S.P. Shen, 12 Aug 2025, Democracy of AI Numerical Weather Models: An Example of Global Forecasting with FourCastNetv2 Made by a University Research Lab Using GPU, https://arxiv.org/abs/2504.17028
  • Yashasvi Makin and Rahul Maliakkal, 28 Jul 2025, Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures, https://arxiv.org/abs/2508.13163
  • Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun, 19 Aug 2025, Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU, https://arxiv.org/abs/2506.06095
  • Lun Ai, 19 Aug 2025, Boolean Matrix Logic Programming on the GPU, https://arxiv.org/abs/2408.10369
  • Jacob Aguirre, Diego Cifuentes, Vincent Guigues, Renato D.C. Monteiro, Victor Hugo Nascimento, Arnesh Sujanani, 21 Aug 2025, A User Manual for cuHALLaR: A GPU Accelerated Low-Rank Semidefinite Programming Solver, https://arxiv.org/abs/2508.15951
  • Martin Andrews, Sam Witteveen, 22 Aug 2025, GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization, https://arxiv.org/abs/2506.20807
  • Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, Ziyue Yang, 21 Aug 2025, MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications, https://arxiv.org/abs/2504.09014
  • Mohsen Sheibanian, Pouya Shaeri, Alimohammad Beigi, Ryan T. Woo, Aryan Keluskar, 23 Aug 2025, Tri-Accel: Curvature-Aware Precision-Adaptive and Memory-Elastic Optimization for Efficient GPU Usage, https://arxiv.org/abs/2508.16905
  • Ritvik Chaturvedi, 25 Aug 2025, Practical GPU Choices for Earth Observation: ResNet-50 Training Throughput on Integrated, Laptop, and Cloud Accelerators, https://arxiv.org/abs/2508.18206
  • Haolin Jin, Mengbai Xiao, Yuan Yuan, Xiao Zhang, Dongxiao Yu, Guanghui Zhang, Haoliang Wang, 23 Jul 2025, DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs, https://arxiv.org/abs/2507.17245
  • Murat Temiz and Vemund Bakken, 14 Aug 2025, Electromagnetic Simulations of Antennas on GPUs for Machine Learning Applications, https://arxiv.org/abs/2508.10713
  • Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Vijay Ganesh, Oscar Hernandez, Ada Sedova, 22 Aug 2025, Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability, https://arxiv.org/abs/2503.17173
  • Yuebo Luo, Shiyang Li, Junran Tao, Kiran Thorat, Xi Xie, Hongwu Peng, Nuo Xu, Caiwen Ding, Shaoyi Huang, 22 Aug 2025, DR-CircuitGNN: Training Acceleration of Heterogeneous Circuit Graph Neural Network on GPUs, https://arxiv.org/abs/2508.16769
  • Trinayan Baruah, Kaustubh Shivdikar, Sara Prescott, and David Kaeli, 25 Aug 2025, Characterizing the Behavior of Training Mamba-based State Space Models on GPUs, https://arxiv.org/abs/2508.17679
  • Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, Harry Xu, 3 Sep 2025, ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving, https://arxiv.org/abs/2410.01228
  • Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, P{\i}nar T\"oz\"un, 26 Aug 2025, CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator, https://arxiv.org/abs/2508.19073
  • Arya Tschand, Muhammad Awad, Ryan Swann, Kesavan Ramakrishnan, Jeffrey Ma, Keith Lowery, Ganesh Dasika, Vijay Janapa Reddi, 27 Aug 2025, SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization, https://arxiv.org/abs/2508.20258
  • Avinash Maurya, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae, 2 Sep 2025, MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall, https://arxiv.org/abs/2509.02480
  • David Cortes, Carlos Juiz, Belen Bermejo, 3 Sep 2025, Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial, https://arxiv.org/abs/2509.03263
  • Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, Alex Aiken, 9 Sep 2025, Astra: A Multi-Agent System for GPU Kernel Performance Optimization, https://arxiv.org/abs/2509.07506
  • Mahmudul Islam Masum, Miad Islam, Arif I. Sarwat, 9 Sep 2025, Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s, https://arxiv.org/abs/2509.07928
  • MSR Avinash, 7 Sep 2025, Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study, https://arxiv.org/abs/2509.12229
  • Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, Yanzhi Wang, 15 Sep 2025, Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs, https://arxiv.org/abs/2509.11480
  • Daniil Shmelev, Cristopher Salvi, 12 Sep 2025, pySigLib - Fast Signature-Based Computations on CPU and GPU, https://arxiv.org/abs/2509.10613
  • Guy Tel-Zur, 15 Sep 2025, A GPU-Accelerated RAG-Based Telegram Assistant for Supporting Parallel Processing Students, https://arxiv.org/abs/2509.11947
  • Ziqi Zhao and Vivek Sarin, 14 Oct 2025, nuGPR: GPU-Accelerated Gaussian Process Regression with Iterative Algorithms and Low-Rank Approximations, https://arxiv.org/abs/2510.12128
  • Marcin Spoczynski, Marcela S. Melara, 27 Oct 2025, Scalable GPU-Based Integrity Verification for Large Machine Learning Models, https://arxiv.org/abs/2510.23938
  • Udit Saxena, 23 Oct 2025, Scalable GPU-Accelerated Euler Characteristic Curves: Optimization and Differentiable Learning for PyTorch, https://arxiv.org/abs/2510.20271
  • Min Si and Pavan Balaji and Yongzhou Chen and Ching-Hsiang Chu and Adi Gangidi and Saif Hasan and Subodh Iyengar and Dan Johnson and Bingzhe Liu and Jingliang Ren and Ashmitha Jeevaraj Shetty and Greg Steinbrecher and Xinfeng Xie and Yulun Wang and Bruce Wu and Jingyi Yang and Mingran Yang and Minlan Yu and Cen Zhao and Wes Bland and Denis Boyda and Suman Gumudavelli and Cristian Lumezanu and Rui Miao and Zhe Qu and Venkat Ramesh and Maxim Samoylov and Jan Seidel and Feng Tian and Qiye Tan and Shuqiang Zhang and Yimeng Zhao and Shengbao Zheng and Art Zhu and Hongyi Zeng, 23 Oct 2025, Collective Communication for 100k+ GPUs, https://arxiv.org/abs/2510.20171
  • Tushar Nayan (1), Ziqi Zhang (2), Ruimin Sun (1) ((1) Florida International University, (2) University of Illinois Urbana-Champaign), 22 Oct 2025, SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment, https://arxiv.org/abs/2510.19979
  • Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank W\"urthwein, 22 Oct 2025, Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs, https://arxiv.org/abs/2507.00418
  • Aleksandra Franz, Hao Wei, Luca Guastoni, Nils Thuerey, 20 Oct 2025, PICT -- A Differentiable, GPU-Accelerated Multi-Block PISO Solver for Simulation-Coupled Learning Tasks in Fluid Dynamics, https://arxiv.org/abs/2505.16992
  • Palak (Microsoft Research India), Tella Rajashekhar Reddy (Microsoft Research India), Bhaskar Kataria (Cornell University USA), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 18 Oct 2025, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
  • Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava, 19 Sep 2025, 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float, https://arxiv.org/abs/2504.11651
  • Armin Gerami, Ramani Duraiswami, 24 Oct 2025, Transformer Based Linear Attention with Optimized GPU Kernel Implementation, https://arxiv.org/abs/2510.21956
  • Minh Nguyen, 14 Oct 2025, SpareCodeSearch: Searching for Code Context When You Have No Spare GPU, https://arxiv.org/abs/2510.12948
  • Robert Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, and Russell Bent, 26 Sep 2025, Nonlinear Optimization with GPU-Accelerated Neural Network Constraints, https://arxiv.org/abs/2509.22462
  • Yinglong Zou, Juan Zhai, Chunrong Fang, Zhenyu Chen, 26 Sep 2025, GPU Temperature Simulation-Based Testing for In-Vehicle Deep Learning Frameworks, https://arxiv.org/abs/2509.15815
  • Ankur Lahiry, Ayush Pokharel, Banooqa Banday, Seth Ockerman, Amal Gueroudji, Mohammad Zaeed, Tanzima Z. Islam, Line Pouchard, 21 Oct 2025, A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces, https://arxiv.org/abs/2510.18300
  • Nir Ailon, Akhiad Bercovich, Yahel Uffenheimer, Omri Weinstein, 21 Oct 2025, Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs, https://arxiv.org/abs/2503.12211
  • Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng, 25 Sep 2025, Toward Robust and Efficient ML-Based GPU Caching for Modern Inference, https://arxiv.org/abs/2509.20979
  • Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang, 26 Sep 2025, Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM, https://arxiv.org/abs/2509.22832
  • Honghui Du, QiZhi He, 27 Sep 2025, JAX-MPM: A Learning-Augmented Differentiable Meshfree Framework for GPU-Accelerated Lagrangian Simulation and Geophysical Inverse Modeling, https://arxiv.org/abs/2507.04192
  • Ahmad Raeisi, Mahdi Dolati, Sina Darabi, Sadegh Talebi, Patrick Eugster, and Ahmad Khonsari, 17 Oct 2025, GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters, https://arxiv.org/abs/2510.15652
  • Xinyuan Song, Guangji Bai, Liang Zhao, 25 Sep 2025, StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory, https://arxiv.org/abs/2510.03246
  • Zerui Wang, Qinghao Hu, Ana Klimovic, Tianwei Zhang, Yonggang Wen, Peng Sun, Dahua Lin, 2 Oct 2025, Semantic-Aware Scheduling for GPU Clusters with Large Language Models, https://arxiv.org/abs/2510.03334
  • Alireza Nik, Michael A. Riegler, P{\aa}l Halvorsen, 6 Oct 2025, Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption, https://arxiv.org/abs/2502.11723
  • Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic, 9 Oct 2025, Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs, https://arxiv.org/abs/2510.08726
  • Zhihong Wu, Lishuang Wang, Kebin Sun, Zhuozhao Li, Ran Cheng, 10 Oct 2025, Enabling Population-Level Parallelism in Tree-Based Genetic Programming for GPU Acceleration, https://arxiv.org/abs/2501.17168
  • Chao Wang, Zhizhao Wen, Ruoxin Zhang, Puyang Xu, Yifan Jiang, 23 Oct 2025, GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer, https://arxiv.org/abs/2510.20985
  • Zhuojin Li, Marco Paolieri, Leana Golubchik, 24 Oct 2025, Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution, https://arxiv.org/abs/2510.21081
  • Jiabo Shi and Dimitrios Pezaros and Yehia Elkhatib, 23 Oct 2025, xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads, https://arxiv.org/abs/2510.21048
  • Javed I. Khan an Henry Uwabor Moye, 9 Sep 2025, A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU, https://arxiv.org/abs/2509.18114
  • Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler, 23 Sep 2025, Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs, https://arxiv.org/abs/2509.18886
  • Guilin Zhang, Wulan Guo, Ziqi Tan, Srinivas Vippagunta, Suchitra Raman, Shreeshankar Chatterjee, Ju Lin, Shang Liu, Mary Schladenhauffen, Jeffrey Luo, Hailong Jiang, 22 Oct 2025, Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation, https://arxiv.org/abs/2510.19689
  • Paul Biberstein, Ziyang Li, Joseph Devietti, Mayur Naik, 29 Sep 2025, Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming, https://arxiv.org/abs/2503.21937
  • Adam Filipek, 7 Oct 2025, TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation, https://arxiv.org/abs/2510.05485
  • Emre Adabag, Marcus Greiff, John Subosits, Thomas Lew, 7 Oct 2025, Differentiable Model Predictive Control on the GPU, https://arxiv.org/abs/2510.06179
  • Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover, 16 Oct 2025, Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References, https://arxiv.org/abs/2510.14719

Multi-GPU Research

Research papers on various multi-GPU inference and scheduling issues:

GPU Software Platforms

The main GPU software acceleration frameworks include:

  • CUDA (NVIDIA)
  • ROCm (AMD)
  • Triton (open source, originally by Meta)
  • OneAPI (Intel)
  • Vulkan
  • SYCL

CPU Execution of AI Workloads

Although GPUs are the mainstay of LLM execution, there is increasing focus on using CPUs for inference. This arises from the need to run on-device inference for AI phones and AI PCs, some of which may have an NPU, or some that may only have limited SIMD capabilities such as x86 AVX intrinsics.

Research on CPU execution of LLMs:

Neural Processing Unit (NPU)

An NPU is a hardware component designed specifically for AI workloads. The NPU is typically built into the CPU, or an add-on hardware component, but is inherently much less capable than a full GPU. Nevertheless, the NPU is the basis for hardware acceleration on AI phones and also some AI PCs.

FPGA

Research papers on FPGA hardware:

  • Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
  • Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • D. Gupta, A. Purohit and R. Naresh, "FPGA for High-Frequency Trading: Reducing Latency in Financial Systems," 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 19-25, doi: 10.1109/ICACRS62842.2024.10841781. https://ieeexplore.ieee.org/abstract/document/10841781
  • Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng, 15 Feb 2025, Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA, https://arxiv.org/abs/2502.10659
  • Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying Li, Tulika Mitra, 23 Feb 2025, TerEffic: Highly Efficient Ternary LLM Inference on FPGA, https://arxiv.org/abs/2502.16473
  • Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
  • 24 Apr 2025 (v2), TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs, https://arxiv.org/abs/2504.16266
  • Richie Li, Sicheng Chen, 20 May 2025 (v3), Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer, https://arxiv.org/abs/2503.16731
  • Yaman Umuroglu, Davide Conficconi, Lahiru Rasnayake, Thomas B. Preusser, Magnus Sjalander, 11 Jun 2019 (v2), Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing, https://arxiv.org/abs/1901.00370 (Use of bitserial MatMul with FPGA chips.)

ASIC

Research papers on ASIC hardware:

  • Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: