Aussie AI

Hardware Acceleration

Last Updated 1 January, 2026

by David Spuler, Ph.D.

It all started with the "math coprocessor" chips back in the 1990s. The modern-day version is the Graphics Processing Unit (GPU). As the name suggests, they were originally intended to handle graphics calculations, and are certainly still used for floating point calculations in gaming boxes to display the amazingly fast 3D first-person views that are found in games such as FortNite and MineCraft. However, the role of GPUs has broadened to become that of a general mathematical calculation engine, which has found extensive use in two other massive trends: cryptographic calculations (e.g. bitcoin mining), and the matrix calculations inherent to neural networks and Transformer engines for AI. Such chips are more accurately called "General Purpose GPUs" or GPGPUs, but lately they are all simply called GPUs.

Hardware acceleration is by far the most successful method of optimization for AI engines to date. As the number of floating point operations used by AI models has grown into the billions, the fastest GPU chips have kept up with numerous improvements to hardware acceleration algorithms. The primary advancements have included raw on-chip speed increases to reduce response time, increased on-chip memory size and performance, and the use of parallelization and pipelining methods for improved throughput.

Types of AI Hardware Acceleration

There are various types of hardware acceleration that can make a model run faster.

Graphics Processing Unit (GPU)
Application-Specific Integrated Circuit (ASIC)
Field-Programmable Gate Array (FPGA)
Central Processing Unit (CPU)
Neural Processing Unit (NPU)

Specific hardware acceleration architectural techniques include:

General Purpose GPUs (GPGPUs)
Caches (on-chip memory caching)
Multi-core CPUs
Multi-threaded CPUs
Single-Instruction Multiple Data (SIMD)
Non-Uniform Memory Access (NUMA)

Software Integrations to Hardware Accelerators

Software interfaces to hardware accelaration:

BLAS (Basic Linear Algebra Subroutines)
CUDA (NVIDIA's proprietary Compute Unified Device Architecture)
AVX (Advanced Vector Extensions; also AVX2, AVX-512 and AVX10)
OpenCL
cuBLAS (NVIDIA GPU BLAS version in CUDA)

Software Strategies for Hardware Acceleration

General software acceleration strategies for maximizing the benefits from hardware-accelerated computation:

Pipelining. This refers to keeping the GPU busy with a stream of data to chomp through, and avoiding "bubbles" in the pipeline, which is time when the GPU has nothing to do.
Partitioning and dataflow management. This is the software technique of organizing data so it's ready to send quickly to the GPU, usually in contiguous memory.
Cache management. Judicious use of the various levels of cache memory can help with pipelining efficiency.
Parallelizing. It's all parallel, isn't it? This point refers to writing the overarching algorithms in a parallelism-friendly manner, ensuring that nothing waits for nobody.
Deep learning compilers. The full stack of software acceleration to maximize hardware.

Other software acceleration issues that are closely related to hardware efficiency:

Model compression. Reduced total data size by making the whole model smaller in size (e.g. see quantization, pruning, model compression strategies).
Lower-precision data: Using smaller byte sizes for data (i.e. quantization, end-to-end integer-only computations)
Dataflow reduction: Reducing the amount of data being copied around, such as via caching and data reuse.
Memory reduction and memory management algorithms. There's a lot of data being pumped through memory; see memory optimizations

For many other optimization strategies that are orthogonal to hardware acceleration, and can be used to further optimize a model, see the complete list of AI acceleration techniques.

Survey Papers on AI Hardware Accelerators

Papers that review hardware acceleration frameworks:

Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, Jeremy Kepner, AI and ML Accelerator Survey and Trends, Oct 2022, https://arxiv.org/abs/2210.04055
C Åleskog, H Grahn, A Borg, 2022, Recent Developments in Low-Power AI Accelerators: A Survey, Algorithms 2022, 15, 419. https://doi.org/10.3390/a15110419, https://www.mdpi.com/1999-4893/15/11/419, PDF: https://www.mdpi.com/1999-4893/15/11/419/pdf
Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., AI Accelerator Survey and Trends. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 19–23 September 2022; pp. 1–9. http://dx.doi.org/10.1109/HPEC49654.2021.9622867, https://ieeexplore.ieee.org/document/9622867
Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D., 2021, A systematic literature review on hardware implementation of artificial intelligence algorithms. J. Supercomput. 2021, 77, 1897–1938. http://dx.doi.org/10.1007/s11227-020-03325-8 https://link.springer.com/article/10.1007/s11227-020-03325-8
Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., 2019, Survey and Benchmarking of Machine Learning Accelerators. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–9. http://dx.doi.org/10.1109/HPEC.2019.8916327, https://arxiv.org/abs/1908.11348
Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J., 2020, Survey of Machine Learning Accelerators. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Greater Boston Area, MA, USA, 22–24 September 2020; pp. 1–12, http://dx.doi.org/10.1109/HPEC43674.2020.9286149, https://arxiv.org/abs/2009.00993
Li, W.; Liewig, M., A survey of AI accelerators for edge environment. In Proceedings of the World Conference on Information Systems and Technologies, Budva, Montenegro, 7–10 April 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–44. https://link.springer.com/chapter/10.1007/978-3-030-45691-7_4
Lin, W.; Adetomi, A.; Arslan, T., Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions. Electronics 2021, 10, 2048. http://dx.doi.org/10.3390/electronics10172048, https://www.mdpi.com/2079-9292/10/17/2048
Seo, J.s.; Saikia, J.; Meng, J.; He, W.; Suh, H.s.; Anupreetham; Liao, Y.; Hasssan, A.; Yeo, I., Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs. IEEE-Solid-State Circuits Mag. 2022, 14, 65–79. https://ieeexplore.ieee.org/document/9864008
M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, https://www.mdpi.com/1999-5903/12/7/113/pdf
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891
L Capogrosso, F Cunico, DS Cheng, F Fummi, 2023, A Machine Learning-oriented Survey on Tiny Machine Learning arXiv preprint arXiv:2309.11932, https://arxiv.org/pdf/2309.11932.pdf
S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, A survey on RISC-V-based machine learning ecosystem, Information, vol. 14, no. 2, p. 64, 2023, https://www.mdpi.com/2078-2489/14/2/64 PDF: https://www.academia.edu/98345984/A_Survey_on_RISC_V_Based_Machine_Learning_Ecosystem
R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
C. S. Lindsey and T. Lindblad, “Survey of Neural Network Hardware,” in SPIE 2492, Applications and Science of Artificial Neural Networks, S. K. Rogers and D. W. Ruck, Eds., vol. 2492. International Society for Optics and Photonics, apr 1995, pp. 1194–1205. http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=1001095, https://www.semanticscholar.org/paper/Survey-of-neural-network-hardware-Lindsey-Lindblad/0729ff5b500565a23fc9faf7dba6df4f465a3b4b (An AI hardware survey paper from back in 1995.)
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265

AI Announcements from Hardware Vendors

Neal Vaidya, Nick Comly, Joe DeLaere, Ankit Patel and Fred Oh, Sep 08, 2023, NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs, NVIDIA Technical Blog, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
Michael Kan, July 2023, Intel CEO: Get Ready for the 'AI PC', PCMag UK, https://uk.pcmag.com/laptops/147984/intel-ceo-get-ready-for-the-ai-pc
Jesse Clayton, May 23, 2023, NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI, NVIDIA Blog, https://blogs.nvidia.com/blog/2023/05/23/microsoft-build-nvidia-ai-windows-rtx/
Dan Robinson, 26 July 2023, Intel adds fresh x86 and vector instructions for future chips, The Register, https://www.theregister.com/2023/07/26/intel_x86_vector_instructions/
Mann, Tobias, August 15, 2023, Intel's AVX10 promises benefits of AVX-512 without baggage, The Register, https://www.theregister.com/2023/08/15/avx10_intel_interviews/
Intel, July 2023, Intel® Advanced Vector Extensions 10 Architecture Specification, Revision 1.0, https://cdrdv2.intel.com/v1/dl/getContent/784267
NVIDIA, 2023, CUDA Zone, NVIDIA Developer, https://developer.nvidia.com/cuda-zone
Kevin Okemwa, Oct 2023, Microsoft may debut its first AI chip at Ignite 2023 to mitigate cost, https://www.windowscentral.com/software-apps/microsoft-may-debut-its-first-ai-chip-at-ignite-2023-to-mitigate-cost
Kyle Wiggers, October 7, 2023, OpenAI said to be considering developing its own AI chips TechCrunch, https://techcrunch.com/2023/10/06/openai-said-to-be-considering-developing-its-own-ai-chips/
John Timmer, Oct 21, 2023, IBM has made a new, highly efficient AI processor, Ars Technica, https://arstechnica.com/science/2023/10/ibm-has-made-a-new-highly-efficient-ai-processor/

Hardware-Acceleration Research

Various papers on hardware acceleration, out of thousands, include:

Vikram Jain, Marian Verhelst, Towards Heterogeneous Multi-core Systems-on-Chip for Edge Machine Learning: Journey from Single-core Acceleration to Multi-core Heterogeneous Systems, Springer Nature, 15 Sept 2023, https://link.springer.com/book/10.1007/978-3-031-38230-7
Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/
Ruizhe Zhao; Wayne Luk; Xinyu Niu; Huifeng Shi; Haitao Wang, 2017, Hardware acceleration for machine learning, 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), https://ieeexplore.ieee.org/abstract/document/7987595/, PDF: https://www.doc.ic.ac.uk/~wl/papers/17/vlsi17rz.pdf
A Auten, M Tomei, R Kumar, 2020, Hardware acceleration of graph neural networks 2020 57th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/9218751, PDF: http://rakeshk.web.engr.illinois.edu/dac20.pdf
Nabavinejad, S.M.; Baharloo, M.; Chen, K.C.; Palesi, M.; Kogel, T.; Ebrahimi, M., An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 268–282. http://dx.doi.org/10.1109/JETCAS.2020.3022920, https://ieeexplore.ieee.org/abstract/document/9189825 (On-chip interconnection optimizations.)
Gobieski, G.; Atli, A.O.; Mai, K.; Lucia, B.; Beckmann, N. Snafu: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1027–1040. http://dx.doi.org/10.1109/ISCA52012.2021.00084
Singh, S.; Sarma, A.; Jao, N.; Pattnaik, A.; Lu, S.; Yang, K.; Sengupta, A.; Narayanan, V.; Das, C.R., NEBULA: A Neuromorphic Spin-Based Ultra-Low Power Architecture for SNNs and ANNs. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 363–376. http://dx.doi.org/10.1109/ISCA45697.2020.00039
Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar, Ahmad Salehi, Todd Hastings, Feb 2023, SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs, https://arxiv.org/abs/2302.07036, Code: https://github.com/uky-UCAT/SC_ONN_SIM.git
S Moon, HG Mun, H Son, JY Sim, 2023, Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and Latency Waste, IEEE Journal of Solid-State Circuits, https://ieeexplore.ieee.org/abstract/document/10268412
Shashank Prasanna, Oct 21, 2020, A complete guide to AI accelerators for deep learning inference — GPUs, AWS Inferentia and Amazon Elastic Inference, Towards Data Science, https://towardsdatascience.com/a-complete-guide-to-ai-accelerators-for-deep-learning-inference-gpus-aws-inferentia-and-amazon-7a5d6804ef1c (Good introduction to hardware acceleration, but from 2020, which is a few years ago now.)
Robert Clausecker, Daniel Lemire. Dec 2022, Transcoding Unicode Characters with AVX-512 Instructions, https://arxiv.org/abs/2212.05098 (The use of AVX-512 bitwise operations to convert Unicode and UTF8 bytes much faster in parallel.)
Matthew Kolbe, 2023, Lightning Talk: How to Leverage SIMD Intrinsics for Massive Slowdowns, CppNow, https://www.youtube.com/watch?v=GleC3SZ8gjU (Discusses how using the C++ intrinsics can actually worsen speed versus allowing the compiler to optimize it automatically.)
Arjun Sha, February 22, 2024, Meet Groq, a Lightning Fast AI Accelerator that Beats ChatGPT and Gemini, https://beebom.com/groq-lpu-chip-ai-platform-beats-chatgpt-gemini/ (Groq is a startup that runs LLM inference on a special hardware chip called LPU for fast inference.)
CNBC, Apr 10, 2024 Meta debuts new generation of AI chip, CNBC, https://www.cnbc.com/2024/04/10/meta-debuts-new-generation-of-ai-chip.html
Kif Leswing, April 9, 2024, Intel unveils latest AI chip as Nvidia competition heats up, CNBC, https://www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html (Intel Gaudi 3 chip for high-end datacenter usage, completing with NVIDIA H100.)
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Tamador Mohaidat, Kasem Khalil, 2023, A Survey on Neural Network Hardware Accelerators IEEE Transactions on Artificial Intelligence, Aug. pp. 1-21, vol. 1, https://www.computer.org/csdl/journal/ai/5555/01/10472723/1ViYSMvUFI4
Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
Arnab Raha, Raymond Sung, Soumendu Ghosh, Praveen Kumar Gupta, Deepak A. Mathaikutty, Umer I. Cheema, Kevin Hyland, Cormac Brick, Vijay Raghunathan, Efficient Hardware Acceleration of Emerging Neural Networks for Embedded Machine Learning: An Industry Perspective, 2023, In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_5 https://link.springer.com/chapter/10.1007/978-3-031-19568-6_5
S Tuli, NK Jha, 2023, TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference, IEEE Transactions on Computer-Aided Design, https://ieeexplore.ieee.org/abstract/document/10144614/ https://arxiv.org/pdf/2303.14882
S Kalapothas, M Galetakis, G Flamis, F Plessas, 2023, A Survey on RISC-V-Based Machine Learning Ecosystem, Information, https://www.mdpi.com/2078-2489/14/2/64 PDF: https://www.mdpi.com/2078-2489/14/2/64/pdf
V Sze, YH Chen, J Emer, A Suleiman, 2017, Hardware for machine learning: Challenges and opportunities, 2017 IEEE Custom Integrated Circuits Conference (CICC) https://ieeexplore.ieee.org/abstract/document/7993626/ https://arxiv.org/pdf/1612.07625
L Du, Y Du, 2017, Hardware accelerator design for machine learning, https://books.google.com/books?hl=en&lr=&id=EG-QDwAAQBAJ&oi=fnd&pg=PA1&dq=hardware+acceleration+machine+learning&ots=UXH17LVbVv&sig=apwSxxHT82TQJg4H_rzceL9NSMU https://www.researchgate.net/publication/327781400_Hardware_Accelerator_Design_for_Machine_Learning
S Bavikadi, A Dhavlle, A Ganguly, 2022, A survey on machine learning accelerators and evolutionary hardware platforms, IEEE Design & Test ( Volume: 39, Issue: 3, June 2022), https://ieeexplore.ieee.org/abstract/document/9739030/
Z Pan, P Mishra, 2022, Hardware acceleration of explainable machine learning, 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), https://ieeexplore.ieee.org/abstract/document/9774739/ https://par.nsf.gov/servlets/purl/10354093
Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Adapatable inference for a CNN by dynamic modification of FPGA-accelerated hardware integrations.)
Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, November 2021. ISSN 0001-0782. doi: 10.1145/3467017. https://doi.org/10.1145/3467017
R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
Y. Liao, “Neural Networks in Hardware: A Survey,” Department of Computer Science, University of California, Tech. Rep., 2001. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.460.3235
V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, dec 2017. https://doi.org/10.1109/JPROC.2017.2761740
H. F. Langroudi, T. Pandit, M. Indovina, and D. Kudithipudi, “Digital Neuromorphic Chips for Deep Learning Inference: A Comprehensive Study,” in Applications of Machine Learning, M. E. Zelinski, T. M. Taha, J. Howe, A. A. Awwal, and K. M. Iftekharuddin, Eds. SPIE, sep 2019, p. 9. https://doi.org/10.1117/12.2529407
J. L. Hennessy and D. A. Patterson, “A New Golden Age for Computer Architecture,” Communications of the ACM, vol. 62, no. 2, pp. 48–60, jan 2019. http://dl.acm.org/citation.cfm?doid=3310134.3282307
W. J. Dally, Y. Turakhia, and S. Han, “Domain-Specific Hardware Accelerators,” Communications of the ACM, vol. 63, no. 7, pp. 48–57, jun 2020. https://dl.acm.org/doi/10.1145/3361682
R. Smith, “NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder,” Mar 2022. https://www.anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-h100-accelerator-announced
B. Funk, “NVIDIA Jetson AGX Orin: The Next-Gen Platform That Will Power Our AI Robot Overlords Unveiled,” mar 2022. https://hothardware.com/news/nvidia-jetson-agx-orin
93] “NVIDIA Tesla P100.” https://www.nvidia.com/en-us/data-center/tesla-p100/
N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks,” Communications of the ACM, vol. 61, no. 9, pp. 50–59, aug 2018. http://doi.acm.org/10.1145/3154484
Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, mar 2020. https://doi.org/10.1016/j.eng.2020.01.007
E. Wang, J. J. Davis, R. Zhao, H.-C. C. Ng, X. Niu, W. Luk, P. Y. K. Cheung, and G. A. Constantinides, “Deep Neural Network Approximation for Custom Hardware,” ACM Computing Surveys, vol. 52, no. 2, pp. 1–39, may 2019. https://dl.acm.org/doi/10.1145/3309551
S. Khan and A. Mann, “AI Chips: What They Are and Why They Matter,” Georgetown Center for Security and Emerging Technology, Tech. Rep., apr 2020. https://cset.georgetown.edu/research/ai-chips-what-they-are-and-why-they-matter/
U. Rueckert, “Digital Neural Network Accelerators,” in NANO-CHIPS 2030: On-Chip AI for an Efficient Data-Driven World, B. Murmann and B. Hoefflinger, Eds. Springer, Cham, 2020, ch. 12, pp. 181–202. https://link.springer.com/chapter/10.1007%2F978- 3-030-18338-7 12
T. Rogers and M. Khairy, “An Academic’s Attempt to Clear the Fog of the Machine Learning Accelerator War — SIGARCH,” aug 2021. https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/
F. P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A Survey on Silicon Photonics for Deep Learning,” ACM Journal on Emerging Technologies in Computing Systems, vol. 17, no. 4, oct 2021. https://dl.acm.org/doi/10.1145/3459009
KAA Fuad, L Chen, 2023, A Survey on Sparsity Exploration in Transformer-Based Accelerators https://www.mdpi.com/2079-9292/12/10/2299
Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
S Bhowmick, 2023 Optimizing Transformer Inference on FPGA: A Study on Hardware Acceleration using Vitis HLS, Thesis, PDF: https://aaltodoc.aalto.fi/bitstream/handle/123456789/123155/master_Bhowmick_Soujanya_2023.pdf?sequence=1&isAllowed=y
David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Flegar G, Scheidegger F, Novaković V, Mariani G, Tom´s AE, Malossi ACI, Quintana-Ortí ES, 2019, FloatX: a C++ library for customized floating-point arithmetic. ACM Trans Math Softw 45(4):40, https://dl.acm.org/doi/10.1145/3368086
H.-Y. Wang and T.-S. Chang, 2022, “Row-wise accelerator for vision transformer,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 399–402. https://arxiv.org/abs/2205.03998
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Lucas Mearian, 05 Jun 2024, Can Intel’s new chips compete with Nvidia in the AI universe? https://www.computerworld.com/article/2138358/can-intels-new-chips-compete-with-nvidia-in-the-ai-universe.html
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer Oct 2014, cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Zhang, X., Wang, Q., and Chothia, Z., Openblas. 2014. http://xianyi.github.io/OpenBLAS
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E., 2014, Cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. http://arxiv.org/abs/1410.0759.
Intel. 2020. Intel® math kernel library for deep learning networks, https://github.com/oneapi-src/oneDNN. [Online; accessed 3-Mar2021].
Christoforos Kachris, 18 Jan 2024, A Survey on Hardware Accelerators for Large Language Models, https://arxiv.org/abs/2401.09890
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Mike Murphy, 26 Aug 2024, Enhancing enterprise AI with the IBM Spyre Accelerator, https://research.ibm.com/blog/spyre-for-z
Tiernan Ray, Aug. 27, 2024, AI startup Cerebras debuts 'world's fastest inference' service - with a twist: The AI computer maker claims its inference service is dramatically faster and makes new kinds of 'agentic' AI possible, https://www.zdnet.com/article/ai-startup-cerebras-debuts-worlds-fastest-inference-with-a-twist/
https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
James Wang, August 27, 2024, Introducing Cerebras Inference: AI at Instant Speed, https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
Marius Hobbhahn, Lennart Heim, Gökçe Aydos, Nov 09, 2023, Trends in Machine Learning Hardware, https://epochai.org/blog/trends-in-machine-learning-hardware
Frederic Lardinois, September 9, 2024, Apple announces its new A18 and A18 Pro iPhone chips, https://techcrunch.com/2024/09/09/apple-announces-its-new-a18-iphone-chip/
Nick Evanson, September 2, 2024, OpenAI plans to build its own AI chips on TSMC's forthcoming 1.6 nm A16 process node, https://www.yahoo.com/tech/openai-plans-build-own-ai-120921975.html
Matthew S. Smith, Sep 2024, Challengers Are Coming for Nvidia’s Crown. In AI’s Game of Thrones, don’t count out the upstarts, https://spectrum.ieee.org/nvidia-ai
Sean Hollister, Sep 10, 2024. AMD is turning its back on flagship gaming GPUs to chase AI first, https://www.theverge.com/2024/9/9/24240173/amd-udna-gpu-ai-gaming-rdna-cdna-jack-huynh
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
Yu-Ching Hu, September 2024, Efficient Accelerator-Rich Computers for Future Applications, Ph.D. Thesis, Computer Science, https://escholarship.org/content/qt68w3z4vq/qt68w3z4vq.pdf
Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
Arjun Kharpal, Nov 8 2024, How Samsung fell behind in the AI boom leading to a $126 billion wipeout, https://www.cnbc.com/2024/11/08/how-samsung-fell-behind-in-the-ai-boom-behind-rival-sk-hynix.html (About Samsung's HBM memory chips.)
Maxwell Zeff, November 20, 2024, Nvidia’s CEO defends his moat as AI labs change how they improve their AI models, https://techcrunch.com/2024/11/20/nvidias-ceo-defends-his-moat-as-ai-labs-change-how-they-improve-their-ai-models/
Don Clark, Dec. 3, 2024, The Furious Contest to Unseat Nvidia as King of A.I. Chips: Amazon, Advanced Micro Devices and several start-ups are beginning to offer credible alternatives to Nvidia’s chips, especially for a phase of A.I. development known as “inferencing.” https://www.nytimes.com/2024/12/03/technology/nvidia-ai-chips.html
Andy Patrizio, Dec 02, 2024, MRDIMM: Why your next server will have a new kind of memory, MRDIMM promises faster memory with no hardware or software changes. https://www.networkworld.com/article/3615543/mrdimm-why-your-next-server-will-have-a-new-kind-of-memory.html
NVIDIA, Dec 2024, Jetson Orin Nano Super Developer Kit: The most affordable generative AI supercomputer. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Maxwell Zeff, January 7, 2025, Nvidia CEO says his AI chips are improving faster than Moore’s Law, https://techcrunch.com/2025/01/07/nvidia-ceo-says-his-ai-chips-are-improving-faster-than-moores-law/
Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai, 6 Feb 2025, WaferLLM: A Wafer-Scale LLM Inference System, https://arxiv.org/abs/2502.04563 (GEMM on a wafer mesh.)
Victor Tangermann, Feb 28, 2025, Sam Altman Says OpenAI Has Run Out of GPUs: "This isn't how we want to operate...", https://futurism.com/sam-altman-openai-run-out-of-gpus
Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini, 15 May 2025 (v2), Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities, https://arxiv.org/abs/2505.06085
Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei, 14 May 2025, Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, https://arxiv.org/abs/2505.09343

GPU Research

Research papers on various GPU issues:

Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023. https://arxiv.org/abs/2303.06865
Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu, 21 Feb 2024, Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, https://arxiv.org/abs/2402.13499
David Spuler, March 2024, Chapter 16. Hardware Acceleration, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Dina Genkina, Aug 29, 2024, AI Inference Competition Heats Up First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI, IEEE Spectru, https://spectrum.ieee.org/new-inference-chips
David Spuler, March 2024, GPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-gpu-hardware-acceleration
Latent Space, Sep 03, 2024 Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation, https://www.latent.space/p/nyla
Florian Douetteau, September 7, 2024, Get ready for a tumultuous era of GPU cost volatility, https://venturebeat.com/ai/get-ready-for-a-tumultuous-era-of-gpu-cost-volitivity/
M Davies, I McDougall, S Anandaraj, D Machchhar, April 2024, A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 20–36, https://doi.org/10.1145/3620665.3640367 https://dl.acm.org/doi/abs/10.1145/3620665.3640367 (Benchmarking analysis of GPU execution extending MLPerf.)
Peter Guest, Oct 6, 2023, Graphcore Was the UK's AI Champion—Now It’s Scrambling to Survive, https://www.wired.com/story/graphcore-uk-ai-champion-scrambling-to-stay-afloat/ (An article about GraphCore's struggles against NVIDIA and GPUs with its IPUs.)
Etched, June 25, 2024 Etched is Making the Biggest Bet in AI, https://www.etched.com/announcing-etched
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Kif Leswing, Oct 10 2024, AMD launches AI chip to rival Nvidia’s Blackwell, https://www.cnbc.com/2024/10/10/amd-launches-mi325x-ai-chip-to-rival-nvidias-blackwell-.html
Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Bagus Hanindhito and Lizy K. John. 2024. Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE '24). Association for Computing Machinery, New York, NY, USA, 178–189. https://doi.org/10.1145/3629526.3653835 https://dl.acm.org/doi/abs/10.1145/3629526.3653835 PDF: https://lca.ece.utexas.edu/pubs/Hanindhito_AcceleratingMLWorkloads.pdf
C. Wang, P. Song, H. Zhao, F. Zhang, J. Wang and L. Zhang, "High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach," 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558334. https://ieeexplore.ieee.org/abstract/document/10558334
Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
Vasily Volkov, August 12, 2016, Understanding Latency Hiding on GPUs, Ph.D. Thesis, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
Z. Chen et al., "An Empirical Study on the Power Consumption of LLMs with Different GPU Platforms," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 8640-8642, doi: 10.1109/BigData62323.2024.10825662. https://ieeexplore.ieee.org/abstract/document/10825662
Burcu Canakci, Junyi Liu, Xingbo Wu, Nathanaël Cheriere, Paolo Costa, Sergey Legtchenko, Dushyanth Narayanan, Ant Rowstron, 17 Jan 2025, Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? https://arxiv.org/abs/2501.10187
Sama Bali, Jan 15, 2025 GPU Memory Essentials for AI Performance, https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/
W. Choi, J. Jeong, H. Jang and J. Ahn, "GPU-centric Memory Tiering for LLM Serving with NVIDIA Grace Hopper Superchip," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3533588. https://ieeexplore.ieee.org/abstract/document/10852027
Rohan Yadav, Michael Garland, Alex Aiken, Michael Bauer, 9 Apr 2025, Task-Based Tensor Computations on Modern GPUs, https://arxiv.org/abs/2504.07004
Burkhard Ringlein, Thomas Parnell, Radu Stoica, 15 May 2025 (v2), GPU Performance Portability needs Autotuning, https://arxiv.org/abs/2505.03780
Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang, 22 Jul 2025, Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training, https://arxiv.org/abs/2507.16274
Enrico Santi and Fabio Tardivo and Agostino Dovier and Andrea Formisano, 24 Jul 2025, GPU Accelerated Compact-Table Propagation, https://arxiv.org/abs/2507.18413
Sina Baghal, 6 Aug 2025, Solving Pasur Using GPU-Accelerated Counterfactual Regret Minimization, https://arxiv.org/abs/2508.06559
Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang, 10 Aug 2025, Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative, https://arxiv.org/abs/2508.07329
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg, 9 Aug 2025, TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree, https://arxiv.org/abs/2508.07014
Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg, 10 Aug 2025, FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities, https://arxiv.org/abs/2508.07315
Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu, 28 Jul 2025, LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems, https://arxiv.org/abs/2507.21276
Martin B\"ockling, Heiko Paulheim, 1 Aug 2025, gpuRDF2vec -- Scalable GPU-based RDF2vec, https://arxiv.org/abs/2508.01073
Zicong Ye, Kunming Zhang, Guoming Tang, 3 Aug 2025, AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization, https://arxiv.org/abs/2508.01744
Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia, 7 Aug 2025, Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving, https://arxiv.org/abs/2507.06608
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral, 11 Aug 2025, Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving, https://arxiv.org/abs/2508.08343
Iman Khadir, Shane Stevenson, Henry Li, Kyle Krick, Abram Burrows, David Hall, Stan Posey, Samuel S.P. Shen, 12 Aug 2025, Democracy of AI Numerical Weather Models: An Example of Global Forecasting with FourCastNetv2 Made by a University Research Lab Using GPU, https://arxiv.org/abs/2504.17028
Yashasvi Makin and Rahul Maliakkal, 28 Jul 2025, Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures, https://arxiv.org/abs/2508.13163
Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun, 19 Aug 2025, Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU, https://arxiv.org/abs/2506.06095
Lun Ai, 19 Aug 2025, Boolean Matrix Logic Programming on the GPU, https://arxiv.org/abs/2408.10369
Jacob Aguirre, Diego Cifuentes, Vincent Guigues, Renato D.C. Monteiro, Victor Hugo Nascimento, Arnesh Sujanani, 21 Aug 2025, A User Manual for cuHALLaR: A GPU Accelerated Low-Rank Semidefinite Programming Solver, https://arxiv.org/abs/2508.15951
Martin Andrews, Sam Witteveen, 22 Aug 2025, GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization, https://arxiv.org/abs/2506.20807
Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, Ziyue Yang, 21 Aug 2025, MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications, https://arxiv.org/abs/2504.09014
Mohsen Sheibanian, Pouya Shaeri, Alimohammad Beigi, Ryan T. Woo, Aryan Keluskar, 23 Aug 2025, Tri-Accel: Curvature-Aware Precision-Adaptive and Memory-Elastic Optimization for Efficient GPU Usage, https://arxiv.org/abs/2508.16905
Ritvik Chaturvedi, 25 Aug 2025, Practical GPU Choices for Earth Observation: ResNet-50 Training Throughput on Integrated, Laptop, and Cloud Accelerators, https://arxiv.org/abs/2508.18206
Haolin Jin, Mengbai Xiao, Yuan Yuan, Xiao Zhang, Dongxiao Yu, Guanghui Zhang, Haoliang Wang, 23 Jul 2025, DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs, https://arxiv.org/abs/2507.17245
Murat Temiz and Vemund Bakken, 14 Aug 2025, Electromagnetic Simulations of Antennas on GPUs for Machine Learning Applications, https://arxiv.org/abs/2508.10713
Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Vijay Ganesh, Oscar Hernandez, Ada Sedova, 22 Aug 2025, Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability, https://arxiv.org/abs/2503.17173
Yuebo Luo, Shiyang Li, Junran Tao, Kiran Thorat, Xi Xie, Hongwu Peng, Nuo Xu, Caiwen Ding, Shaoyi Huang, 22 Aug 2025, DR-CircuitGNN: Training Acceleration of Heterogeneous Circuit Graph Neural Network on GPUs, https://arxiv.org/abs/2508.16769
Trinayan Baruah, Kaustubh Shivdikar, Sara Prescott, and David Kaeli, 25 Aug 2025, Characterizing the Behavior of Training Mamba-based State Space Models on GPUs, https://arxiv.org/abs/2508.17679
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, Harry Xu, 3 Sep 2025, ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving, https://arxiv.org/abs/2410.01228
Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, P{\i}nar T\"oz\"un, 26 Aug 2025, CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator, https://arxiv.org/abs/2508.19073
Arya Tschand, Muhammad Awad, Ryan Swann, Kesavan Ramakrishnan, Jeffrey Ma, Keith Lowery, Ganesh Dasika, Vijay Janapa Reddi, 27 Aug 2025, SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization, https://arxiv.org/abs/2508.20258
Avinash Maurya, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae, 2 Sep 2025, MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall, https://arxiv.org/abs/2509.02480
David Cortes, Carlos Juiz, Belen Bermejo, 3 Sep 2025, Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial, https://arxiv.org/abs/2509.03263
Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, Alex Aiken, 9 Sep 2025, Astra: A Multi-Agent System for GPU Kernel Performance Optimization, https://arxiv.org/abs/2509.07506
Mahmudul Islam Masum, Miad Islam, Arif I. Sarwat, 9 Sep 2025, Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s, https://arxiv.org/abs/2509.07928
MSR Avinash, 7 Sep 2025, Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study, https://arxiv.org/abs/2509.12229
Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, Yanzhi Wang, 15 Sep 2025, Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs, https://arxiv.org/abs/2509.11480
Daniil Shmelev, Cristopher Salvi, 12 Sep 2025, pySigLib - Fast Signature-Based Computations on CPU and GPU, https://arxiv.org/abs/2509.10613
Guy Tel-Zur, 15 Sep 2025, A GPU-Accelerated RAG-Based Telegram Assistant for Supporting Parallel Processing Students, https://arxiv.org/abs/2509.11947
Ziqi Zhao and Vivek Sarin, 14 Oct 2025, nuGPR: GPU-Accelerated Gaussian Process Regression with Iterative Algorithms and Low-Rank Approximations, https://arxiv.org/abs/2510.12128
Marcin Spoczynski, Marcela S. Melara, 27 Oct 2025, Scalable GPU-Based Integrity Verification for Large Machine Learning Models, https://arxiv.org/abs/2510.23938
Udit Saxena, 23 Oct 2025, Scalable GPU-Accelerated Euler Characteristic Curves: Optimization and Differentiable Learning for PyTorch, https://arxiv.org/abs/2510.20271
Min Si and Pavan Balaji and Yongzhou Chen and Ching-Hsiang Chu and Adi Gangidi and Saif Hasan and Subodh Iyengar and Dan Johnson and Bingzhe Liu and Jingliang Ren and Ashmitha Jeevaraj Shetty and Greg Steinbrecher and Xinfeng Xie and Yulun Wang and Bruce Wu and Jingyi Yang and Mingran Yang and Minlan Yu and Cen Zhao and Wes Bland and Denis Boyda and Suman Gumudavelli and Cristian Lumezanu and Rui Miao and Zhe Qu and Venkat Ramesh and Maxim Samoylov and Jan Seidel and Feng Tian and Qiye Tan and Shuqiang Zhang and Yimeng Zhao and Shengbao Zheng and Art Zhu and Hongyi Zeng, 23 Oct 2025, Collective Communication for 100k+ GPUs, https://arxiv.org/abs/2510.20171
Tushar Nayan (1), Ziqi Zhang (2), Ruimin Sun (1) ((1) Florida International University, (2) University of Illinois Urbana-Champaign), 22 Oct 2025, SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment, https://arxiv.org/abs/2510.19979
Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank W\"urthwein, 22 Oct 2025, Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs, https://arxiv.org/abs/2507.00418
Aleksandra Franz, Hao Wei, Luca Guastoni, Nils Thuerey, 20 Oct 2025, PICT -- A Differentiable, GPU-Accelerated Multi-Block PISO Solver for Simulation-Coupled Learning Tasks in Fluid Dynamics, https://arxiv.org/abs/2505.16992
Palak (Microsoft Research India), Tella Rajashekhar Reddy (Microsoft Research India), Bhaskar Kataria (Cornell University USA), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 18 Oct 2025, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava, 19 Sep 2025, 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float, https://arxiv.org/abs/2504.11651
Armin Gerami, Ramani Duraiswami, 24 Oct 2025, Transformer Based Linear Attention with Optimized GPU Kernel Implementation, https://arxiv.org/abs/2510.21956
Minh Nguyen, 14 Oct 2025, SpareCodeSearch: Searching for Code Context When You Have No Spare GPU, https://arxiv.org/abs/2510.12948
Robert Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, and Russell Bent, 26 Sep 2025, Nonlinear Optimization with GPU-Accelerated Neural Network Constraints, https://arxiv.org/abs/2509.22462
Yinglong Zou, Juan Zhai, Chunrong Fang, Zhenyu Chen, 26 Sep 2025, GPU Temperature Simulation-Based Testing for In-Vehicle Deep Learning Frameworks, https://arxiv.org/abs/2509.15815
Ankur Lahiry, Ayush Pokharel, Banooqa Banday, Seth Ockerman, Amal Gueroudji, Mohammad Zaeed, Tanzima Z. Islam, Line Pouchard, 21 Oct 2025, A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces, https://arxiv.org/abs/2510.18300
Nir Ailon, Akhiad Bercovich, Yahel Uffenheimer, Omri Weinstein, 21 Oct 2025, Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs, https://arxiv.org/abs/2503.12211
Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng, 25 Sep 2025, Toward Robust and Efficient ML-Based GPU Caching for Modern Inference, https://arxiv.org/abs/2509.20979
Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang, 26 Sep 2025, Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM, https://arxiv.org/abs/2509.22832
Honghui Du, QiZhi He, 27 Sep 2025, JAX-MPM: A Learning-Augmented Differentiable Meshfree Framework for GPU-Accelerated Lagrangian Simulation and Geophysical Inverse Modeling, https://arxiv.org/abs/2507.04192
Ahmad Raeisi, Mahdi Dolati, Sina Darabi, Sadegh Talebi, Patrick Eugster, and Ahmad Khonsari, 17 Oct 2025, GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters, https://arxiv.org/abs/2510.15652
Xinyuan Song, Guangji Bai, Liang Zhao, 25 Sep 2025, StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory, https://arxiv.org/abs/2510.03246
Zerui Wang, Qinghao Hu, Ana Klimovic, Tianwei Zhang, Yonggang Wen, Peng Sun, Dahua Lin, 2 Oct 2025, Semantic-Aware Scheduling for GPU Clusters with Large Language Models, https://arxiv.org/abs/2510.03334
Alireza Nik, Michael A. Riegler, P{\aa}l Halvorsen, 6 Oct 2025, Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption, https://arxiv.org/abs/2502.11723
Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic, 9 Oct 2025, Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs, https://arxiv.org/abs/2510.08726
Zhihong Wu, Lishuang Wang, Kebin Sun, Zhuozhao Li, Ran Cheng, 10 Oct 2025, Enabling Population-Level Parallelism in Tree-Based Genetic Programming for GPU Acceleration, https://arxiv.org/abs/2501.17168
Chao Wang, Zhizhao Wen, Ruoxin Zhang, Puyang Xu, Yifan Jiang, 23 Oct 2025, GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer, https://arxiv.org/abs/2510.20985
Zhuojin Li, Marco Paolieri, Leana Golubchik, 24 Oct 2025, Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution, https://arxiv.org/abs/2510.21081
Jiabo Shi and Dimitrios Pezaros and Yehia Elkhatib, 23 Oct 2025, xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads, https://arxiv.org/abs/2510.21048
Javed I. Khan an Henry Uwabor Moye, 9 Sep 2025, A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU, https://arxiv.org/abs/2509.18114
Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler, 23 Sep 2025, Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs, https://arxiv.org/abs/2509.18886
Guilin Zhang, Wulan Guo, Ziqi Tan, Srinivas Vippagunta, Suchitra Raman, Shreeshankar Chatterjee, Ju Lin, Shang Liu, Mary Schladenhauffen, Jeffrey Luo, Hailong Jiang, 22 Oct 2025, Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation, https://arxiv.org/abs/2510.19689
Paul Biberstein, Ziyang Li, Joseph Devietti, Mayur Naik, 29 Sep 2025, Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming, https://arxiv.org/abs/2503.21937
Adam Filipek, 7 Oct 2025, TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation, https://arxiv.org/abs/2510.05485
Emre Adabag, Marcus Greiff, John Subosits, Thomas Lew, 7 Oct 2025, Differentiable Model Predictive Control on the GPU, https://arxiv.org/abs/2510.06179
Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover, 16 Oct 2025, Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References, https://arxiv.org/abs/2510.14719

Multi-GPU Research

Research papers on various multi-GPU inference and scheduling issues:

Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen, Z Zhang, et al., 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf
A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543 https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance, including the techniques of KV caching.)
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Tal Ben-Nun, Ely Levy, Amnon Barak, Eri Rubin, 2024, Memory access patterns: the missing piece of the multi-GPU puzzle, SC15: International Conference for High-Performance Computing, Networking, Storage and Analysis, Year: 2015, Pages: 1-12, DOI Bookmark: 10.1145/2807591.2807611, https://www.computer.org/csdl/proceedings-article/sc/2015/2807611/12OmNzaQoh1
Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://digitalassets.lib.berkeley.edu/techreports/ucb/incoming/EECS-2024-108.pdf
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
Hajer Ayadi, Jimmy X. Huang, Aijun An, Yiming Shao, Hao Zhou, and Hossein Pourmodheji. 2023. TAMG: Topology-Aware Multi-GPU Allocation via Deep Reinforcement Learning. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering (CASCON '23). IBM Corp., USA, 185–190. https://dl.acm.org/doi/10.5555/3615924.3615946
Jiri Kraus, March 2024, Multi GPU Programming Models for HPC and AI, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61339/
M. Gil et al., "TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs," in IEEE Embedded Systems Letters, doi: 10.1109/LES.2024.3497587. https://ieeexplore.ieee.org/abstract/document/10753458/
Y Wang, B Li, MTI Ziad, L Eeckhout, J Yang, A Jaleel, Jan 2025, OASIS: Object-Aware Page Management for Multi-GPU Systems https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf
Arissa Wongpanich, Tayo Oguntebi, Jose Baiocchi Paredes, Yu Emma Wang, Phitchaya Mangpo Phothilimthana, Ritwika Mitra, Zongwei Zhou, Naveen Kumar, Vijay Janapa Reddi, 10 Feb 2025, Machine Learning Fleet Efficiency: Analyzing and Optimizing Large-Scale Google TPU Systems with ML Productivity Goodput, https://arxiv.org/abs/2502.06982
Youhe Jiang, Ran Yan, Binhang Yuan, 11 Feb 2025, HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment, https://arxiv.org/abs/2502.07903
Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 170–182. https://doi.org/10.1145/3710848.3710868 https://dl.acm.org/doi/abs/10.1145/3710848.3710868
M. Suvarna, O. Tehrani, 2 Apr 2025, GigaAPI for GPU Parallelization, https://arxiv.org/abs/2504.01266
Xiaoxiang Shi, Colin Cai, Junjia Du, 16 Jul 2025 (v4), Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving, https://arxiv.org/abs/2507.06608
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
11 Aug 2025, Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral, Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving, https://arxiv.org/abs/2508.08343
Aleksa Gordić, August 29, 2025, Inside vLLM: Anatomy of a High-Throughput LLM Inference System: From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale https://www.aleksagordic.com/blog/vllm

GPU Software Platforms

The main GPU software acceleration frameworks include:

CUDA (NVIDIA)
ROCm (AMD)
Triton (open source, originally by Meta)
OneAPI (Intel)
Vulkan
SYCL

CPU Execution of AI Workloads

Although GPUs are the mainstay of LLM execution, there is increasing focus on using CPUs for inference. This arises from the need to run on-device inference for AI phones and AI PCs, some of which may have an NPU, or some that may only have limited SIMD capabilities such as x86 AVX intrinsics.

Research on CPU execution of LLMs:

Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Djip007, May 2024, llamafile 0.8.6 CPU benchmark #450, https://github.com/Mozilla-Ocho/llamafile/discussions/450 (Running llamafile at 20 tokens per second on a non-GPU commodity CPU.)
J Cañete, F Bravo-Marquez, 2024, Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish, https://felipebravom.com/publications/starsem2024.pdf (Optimizing small models on CPU and GPU for the Spanish language, mostly using distillation.)
Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
V Vanhoucke, A Senior, MZ Mao, 2011, Improving the speed of neural networks on CPUs, Google Research, https://research.google/pubs/pub37631.pdf
David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Intel. 2018, Intel math kernel library for deep neural networks (intel mkl-dnn). https://github.com/intel/mkl-dnn
Xianyi Zhang, Qian Wang, and Zaheer Chothia. 2014, Openblas. http://xianyi.github.io/OpenBLAS
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui, 16 May 2024, Distributed Inference Performance Optimization for LLMs on CPUs, https://arxiv.org/abs/2407.00029
Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
David Spuler, March 2024, CPU Hardware Acceleration, in Generative AI in C++, https://www.aussieai.com/book/ch16-cpu-hardware-acceleration
Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
Anonymous authors, 2024, Distributed Inference Performance Optimizations for LLMs on CPUs, ICLR 2024, https://openreview.net/pdf?id=oEbILBMvDS
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, ´ S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, ´ Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on CPUs. In NIPS Workshop, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 PDF: https://citeseerx.ist.psu.edu/doc/10.1.1.308.2766
Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Daon Park and Bernhard Egger. 2024. Improving Throughput-oriented LLM Inference with CPU Computations. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT '24). Association for Computing Machinery, New York, NY, USA, 233–245. https://doi.org/10.1145/3656019.3676949 https://dl.acm.org/doi/abs/10.1145/3656019.3676949 (Combining CPU and GPU computations.)
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
S Na, G Jeong, BH Ahn, J Young, T Krishna, H Kim, 2024, Understanding Performance Implications of LLM Inference on CPUs, https://seonjinna.github.io/assets/pdf/iiswc24_CPULLM.pdf
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, Dongsheng Li, 25 Dec 2024, Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference, https://arxiv.org/abs/2412.18934
Dibakar Gope, David Mansell, Danny Loh, Ian Bratt, 23 Dec 2024, Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs, https://arxiv.org/abs/2501.00032 https://github.com/ggerganov/llama.cpp
Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah, 18 Feb 2025, SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs, https://arxiv.org/abs/2502.12444
Jaewoo Song, Fangzhen Lin, 7 Mar 2025, SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs, https://arxiv.org/abs/2503.07657
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
Runzhen Xue, Hao Wu, Mingyu Yan, Ziheng Xiao, Guangyu Sun, Xiaochun Ye and Dongrui Fan, 14 Aug 2025, Multi-objective Optimization in CPU Design Space Exploration: Attention is All You Need, https://arxiv.org/abs/2410.18368
Juntao Zhao, Jiuru Li, Chuan Wu, 19 May 2025, Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving, https://arxiv.org/abs/2507.18454
Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang, 10 Aug 2025, Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative, https://arxiv.org/abs/2508.07329
Abhishek Dey, Saurabh Srivastava, Gaurav Singh, Robert G. Pettit, 5 Sep 2025, Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference, CPU, and Operations), https://arxiv.org/abs/2509.04721
Daniil Shmelev, Cristopher Salvi, 12 Sep 2025, pySigLib - Fast Signature-Based Computations on CPU and GPU, https://arxiv.org/abs/2509.10613
Yefan Zeng, Shengyu Duan, Rishad Shafik, Alex Yakovlev, 17 Oct 2025, Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization, https://arxiv.org/abs/2510.15653
Zhuojin Li, Marco Paolieri, Leana Golubchik, 24 Oct 2025, Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution, https://arxiv.org/abs/2510.21081
Jiabo Shi and Dimitrios Pezaros and Yehia Elkhatib, 23 Oct 2025, xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads, https://arxiv.org/abs/2510.21048
Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler, 23 Sep 2025, Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs, https://arxiv.org/abs/2509.18886
Enrique Galvez (ALSOC), Adrien Cassagne (ALSOC), Alix Munier (ALSOC), Manuel Bouyer, 30 Sep 2025, Benchmarking Deep Learning Convolutions on Energy-constrained CPUs, https://arxiv.org/abs/2509.26217

Neural Processing Unit (NPU)

An NPU is a hardware component designed specifically for AI workloads. The NPU is typically built into the CPU, or an add-on hardware component, but is inherently much less capable than a full GPU. Nevertheless, the NPU is the basis for hardware acceleration on AI phones and also some AI PCs.

Ken Yeung, May 21, 2024, Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs, https://venturebeat.com/ai/microsoft-introduces-phi-silica-a-3-3b-parameter-model-made-for-copilot-pc-npus/
Minseok Seo, Xuan Truong Nguyen, Seok Joong Hwang, Yongkee Kwon, Guhyun Kim, Chanwook Park, Ilkon Kim, Jaehan Park, Jeongbin Kim, Woojae Shin, Jongsoon Won, Haerang Choi, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong, April 2024, IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Pages 545–560, https://doi.org/10.1145/3620666.3651324 https://dl.acm.org/doi/abs/10.1145/3620666.3651324
William Gallagher, Apr 16, 2024, When to expect every Mac to get the AI-based M4 processor, Apple Insider, https://appleinsider.com/articles/24/04/14/when-to-expect-every-mac-to-get-the-ai-based-m4-processor
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Donghyeon Han, Hoi-Jun Yoo, 2023, On-Chip Training NPU - Algorithm, Architecture and SoC Design, Springer (27 July 2023), https://www.amazon.com/dp/B0C6CTPB9K/
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu, 9 Mar 2024, AutoDroid: LLM-powered Task Automation in Android (v4), https://arxiv.org/abs/2308.15272 Code: https://autodroid-sys.github.io/ (Integrates both on-device Vicuna and cloud-based GPT-4/GPT-3.5 into an Android phone app called AutoDroid.)
Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
Sam Rutherford, Wed, Oct 25, 2023, The Snapdragon X Elite is Qualcomm's most powerful chip to date https://www.engadget.com/the-snapdragon-x-elite-is-qualcomms-most-powerful-chip-to-date-190004830.html
Steve Dent, Thu, Mar 28, 2024, Microsoft Copilot AI will soon run locally on PCs, https://www.engadget.com/microsoft-copilot-ai-will-soon-run-locally-on-pcs-130642514.html
Matthijs Hollemans, April 2024 (accessed), The Neural Engine — what do we know about it? https://github.com/hollance/neural-engine
Victor Hristov Sep 17, 2022 (updated), A16 Bionic explained: what's new in Apple's Pro-grade mobile chip? https://www.phonearena.com/news/A16-Bionic-explained-whats-new_id142438
Mustafa Aljadery, 2024 (accessed), Lightning Whisper MLX, https://github.com/mustafaaljadery/lightning-whisper-mlx (Whisper model optiized for Apple MLX hardware acceleration.)
Jeff Butts, Feb 16th, 2023, What Is the Apple Neural Engine and What Does It Do? https://www.macobserver.com/tips/deep-dive/what-is-apple-neural-engine/
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Dr. Tehseen Zia, June 20, 2024, The Rise of Neural Processing Units: Enhancing On-Device Generative AI for Speed and Sustainability, https://www.unite.ai/the-rise-of-neural-processing-units-enhancing-on-device-generative-ai-for-speed-and-sustainability/
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, 27 April 2024, Tandem Processor: Grappling with Emerging Operators in Neural Networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Pages 1165 - 1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu. 2024. WiP: Efficient LLM Prefilling with Mobile NPU. In Proceedings of the Workshop on Edge and Mobile Foundation Models (EdgeFM '24). Association for Computing Machinery, New York, NY, USA, 33–35. https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066
Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, Tianshi Chen, 24 Sep 2024, Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM, https://arxiv.org/abs/2409.15654
Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
Anthony Fei, Mohamed S. Abdelfattah, 15 Dec 2024, NITRO: LLM Inference on Intel Laptop NPUs, https://arxiv.org/abs/2412.11053 https://github.com/abdelfattah-lab/nitro
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 445–462. https://doi.org/10.1145/3669940.3707239 https://dl.acm.org/doi/abs/10.1145/3669940.3707239 (Offloading chunked prefill computations to NPUs.)
Yichun Yin, Wenyong Huang, Kaikai Song, Yehui Tang, Xueyu Wu, Wei Guo, Peng Guo, Yaoyuan Wang, Xiaojun Meng, Yasheng Wang, (many more authors), 11 Apr 2025 (v2), Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs, https://arxiv.org/abs/2504.07866 (135B model trained on 13.2T tokens using Ascend NPUs.)
Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun, 1 Aug 2025, EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices, https://arxiv.org/abs/2508.00370
Feilong Chen, Yijiang Liu, Yi Huang, Hao Wang, Miren Tian, Ya-Qi Yu, Minghui Liao and Jihao Wu, 15 Sep 2025, MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs, https://arxiv.org/abs/2509.11662
Lennart Bamberg, Filippo Minnella, Roberto Bosio, Fabrizio Ottati, Yuebin Wang, Jongmin Lee, Luciano Lavagno, Adam Fuks, 17 Sep 2025, eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations, https://arxiv.org/abs/2509.14388
Anastasios Fanariotis, Theofanis Orphanoudakis and Vasilis Fotopoulos, 22 Sep 2025, Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers, https://arxiv.org/abs/2509.17533
Zixu Hao, Jianyu Wei, Tuowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao and Ju Ren, 27 Sep 2025, Scaling LLM Test-Time Compute with Mobile NPU on Smartphones, https://arxiv.org/abs/2509.23324
Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia, 7 Oct 2025, From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs, https://arxiv.org/abs/2510.05632
Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, Andre Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang, 16 Oct 2025, From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR, https://arxiv.org/abs/2510.14871
Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah, 16 Nov 2025 (v3), P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats, https://arxiv.org/abs/2511.06838 https://github.com/yc2367/P3-LLM.git

FPGA

Research papers on FPGA hardware:

Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
D. Gupta, A. Purohit and R. Naresh, "FPGA for High-Frequency Trading: Reducing Latency in Financial Systems," 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 19-25, doi: 10.1109/ICACRS62842.2024.10841781. https://ieeexplore.ieee.org/abstract/document/10841781
Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng, 15 Feb 2025, Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA, https://arxiv.org/abs/2502.10659
Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying Li, Tulika Mitra, 23 Feb 2025, TerEffic: Highly Efficient Ternary LLM Inference on FPGA, https://arxiv.org/abs/2502.16473
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
24 Apr 2025 (v2), TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs, https://arxiv.org/abs/2504.16266
Richie Li, Sicheng Chen, 20 May 2025 (v3), Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer, https://arxiv.org/abs/2503.16731
Yaman Umuroglu, Davide Conficconi, Lahiru Rasnayake, Thomas B. Preusser, Magnus Sjalander, 11 Jun 2019 (v2), Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing, https://arxiv.org/abs/1901.00370 (Use of bitserial MatMul with FPGA chips.)

ASIC

Research papers on ASIC hardware:

Beom Jin Kang, Hae In Lee, Seok Kyu Yoon, Young Chan Kim, Sang Beom Jeong, Seong Jun O, Hyun Kim, October 2024, A survey of FPGA and ASIC designs for transformer inference acceleration and optimization, Journal of Systems Architecture, Volume 155, 103247, https://www.sciencedirect.com/science/article/abs/pii/S138376212400184X
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)