Aussie AI

Serving and Deployment

Last Updated 25 April, 2026

by David Spuler, Ph.D.

Serving

Serving is the practical matter of how to architecture the full production application around the LLM. Other components may include a web server, application server, RAG datastore, retriever, load balancer, and more. Furthermore, there are some techniques that affect the speed of inference:

Batching
Prefill versus decoding phase
Scheduling
Load balancing
Frameworks (backend)

LLM Serving: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, March 2024, AI Tech Stack, in Generative AI in C++, https://www.aussieai.com/book/ch5-ai-tech-stack
David Spuler, March 2024, Load Balancing, in Generative AI in C++, https://www.aussieai.com/book/ch7-load-balancer
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
David Spuler, Ph.D., March 12th, 2026, Scaling Your AI Wrapper Architecture, Aussie AI Blog, https://www.aussieai.com/blog/scaling-ai-wrapper-architectures
David Spuler, Ph.D., March 1st 2026 (updated), List of 600+ Low-Latency C++ Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-low-latency-techniques
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
David Spuler, Ph.D., September 22, 2025, List of CUDA C++ Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-cuda-optimization-techniques
David Spuler, Michael Sharpe, June 2025, RAG Deployment, Chapter 12, "RAG Optimization: Accurate and Efficient LLM Applications", https://www.aussieai.com/book/rag-book-12-rag-deployment
David Spuler, Ph.D., January 23, 2025 Low Latency Programming, Aussie AI Blog, https://www.aussieai.com/blog/low-latency-programming
David Spuler, Ph.D., December 9th, 2024 Humans are the Top Layer of the AI Stack, Aussie AI Blog, https://www.aussieai.com/blog/human-top-layer
David Spuler, Ph.D., December 9th, 2024 Reasoning is the New AI Middleware, Aussie AI Blog, https://www.aussieai.com/blog/reasoning-middleware
David Spuler, Ph.D., December 9th, 2024 The AI Application Layer, Aussie AI Blog, https://www.aussieai.com/blog/application-layer
David Spuler, Michael Sharpe, 2025, Architectures of AI Projects, Chapter 12, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-12-architectures-project
David Spuler, Michael Sharpe, 2025, Deployment Architecture, Chapter 18, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-18-deployment-architecture
David Spuler, March 2024, Chapter 5. https://www.aussieai.com/book/ch5-design-architectures Design Choices & Architectures, in book "Generative AI in C++",
David Spuler, March 2024, Chapter 7. Deployment Architecture, in book "Generative AI in C++", https://www.aussieai.com/book/ch7-deployment-architecture
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, in book "Generative AI in C++", https://www.aussieai.com/book/ch54-ensemble-research
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Research on LLM Serving

Recently, there has been an explosion of papers about the practical aspects of deployment, orchestration, and serving of LLM inference. Here's some of the papers:

Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 18 May 2024, The CAP Principle for LLM Serving, https://arxiv.org/abs/2405.11299
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer, 2024, One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving, https://haoran-qiu.com/pdf/qlm-preprint.pdf
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang, 8 Aug 2024, Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning, https://arxiv.org/abs/2408.04323
Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris, 5 Aug 2024, SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving, https://arxiv.org/abs/2408.05235
Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park, 10 Aug 2024, LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale, https://arxiv.org/abs/2408.05499
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, July 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, Proceedings of the 41st July 2024, International Conference on Machine Learning, PMLR 235:11905-11917, 2024, https://proceedings.mlr.press/v235/duan24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/duan24a/duan24a.pdf Code: https://github.com/hao-ai-lab/MuxServe.
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Baolin Li, April 2024, Making Machine Learning on HPC Systems Cost-Effective and Carbon-Friendly, Ph.D. Thesis, The Department of Electrical and Computer Engineering, Computer Engineering, Northeastern University, Boston, Massachusetts, https://repository.library.northeastern.edu/files/neu:4f248m902/fulltext.pdf
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408
Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer, 15 Jul 2024 (v2), Learned Best-Effort LLM Serving, https://arxiv.org/abs/2401.07886
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica, 3 Nov 2024, SkyServe: Serving AI Models across Regions and Clouds with Spot Instances, https://arxiv.org/abs/2411.01438
R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
Haiying Shen, Tanmoy Sen, 10 Nov 2024, EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving, https://arxiv.org/abs/2411.06364
Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan, 24 Nov 2024, Ensuring Fair LLM Serving Amid Diverse Applications, https://arxiv.org/abs/2411.15997
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Ao Shen, Zhiyao Li, Mingyu Gao, 27 Nov 2024, FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving, https://arxiv.org/abs/2411.18424
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
He, Y., Xu, M., Wu, J., Zheng, W., Ye, K., Xu, C. (2025). UELLM: A Unified and Efficient Approach for Large Language Model Inference Serving. In: Gaaloul, W., Sheng, M., Yu, Q., Yangui, S. (eds) Service-Oriented Computing. ICSOC 2024. Lecture Notes in Computer Science, vol 15404. Springer, Singapore. https://doi.org/10.1007/978-981-96-0805-8_16 https://link.springer.com/chapter/10.1007/978-981-96-0805-8_16
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli, 14 Jan 2025, PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving, https://arxiv.org/abs/2501.08192
Desen Sun, Zepeng Zhao, Yuke Wang, 16 Jan 2025, PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving, https://arxiv.org/abs/2501.09253
Can Wang, Dianbo Sui, Bolin Zhang, Xiaoyu Liu, Jiabao Kang, Zhidong Qiao, Zhiying Tu, Jan 2025, A Framework for Effective Invocation Methods of Various LLM Services, Proceedings of the 31st International Conference on Computational Linguistics, pages 6953–6965, January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.464.pdf
Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, Neeraja J. Yadwadkar, 8 Jan 2025, iServe: An Intent-based Serving System for LLMs, https://arxiv.org/abs/2501.13111 (Flexible LLM serving based on prioritizing latency versus cost.)
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan, 27 Jan 2025 (v2), DeepFlow: Serverless Large Language Model Serving at Scale, https://arxiv.org/abs/2501.14417
Ting Sun, Penghan Wang, Fan Lai, 15 Jan 2025, HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location, https://arxiv.org/abs/2501.14808
Xiaozhe Yao, Qinghao Hu, Ana Klimovic, 1 Nov 2024 (v2), DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs, https://arxiv.org/abs/2312.05215 (Serve multiple fine-tuned models with full parameters by using deltas/diffs, rather than PEFT or multi-LoRA.)
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca, 2 Feb 2025, Towards Efficient Large Multimodal Model Serving, https://arxiv.org/abs/2502.00937 (Disaggregating or "decoupling" the different stages of multimodal LLM inference, not only prefill and decoding, but also the multimodal-specific bottlenecks in cross-attention and image encoding.)
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 586–602. https://doi.org/10.1145/3669940.3707215 https://dl.acm.org/doi/abs/10.1145/3669940.3707215
Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta, 7 Feb 2025. LLM Query Scheduling with Prefix Reuse and Latency Constraints, https://arxiv.org/abs/2502.04677
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki, 13 Feb 2025, ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments, https://arxiv.org/abs/2502.09334
Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan, 20 Feb 2025, Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale, https://arxiv.org/abs/2502.14617
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, API rate limits, history database management, caching, and other real-world deployment issues.)
Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim, 6 Mar 2025, ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput, https://arxiv.org/abs/2503.04253
Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica, 24 Mar 2025, Jenga: Effective Memory Management for Serving LLM with Heterogeneity, https://arxiv.org/abs/2503.18292
AK Kakolyris, D Masouros, P Vavaroutsos, S Xydis, April 2025, throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving, https://microlab.ntua.gr/wp-content/uploads/2025/03/throttLLeM_HPCA25.pdf https://github.com/WilliamBlaskowicz/throttLL-eM
Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin, 15 May 2025, ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production, https://arxiv.org/abs/2505.09999
Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park, 13 May 2025, Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony, https://arxiv.org/abs/2505.08944
Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo, 19 Apr 2025, Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management, https://arxiv.org/abs/2505.03756
Azam Ikram, Xiang Li, Sameh Elnikety, Saurabh Bagchi, 30 Apr 2025 (v2), Ascendra: Dynamic Request Prioritization for Efficient LLM Serving, https://arxiv.org/abs/2504.20828
Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, Fan Lai, 24 Apr 2025, Tempo: Application-aware LLM Serving with Mixed SLO Requirements, https://arxiv.org/abs/2504.20068
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Nicholas D. Lane, Binhang Yuan, 4 Jun 2025, Cascadia: A Cascade Serving System for Large Language Models, https://arxiv.org/abs/2506.04203
Xiannan Hu, Tianyou Zeng, Xiaoming Yuan, Liwei Song, Guangyuan Zhang, Bangzheng He, 6 Jun 2025, BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures, https://arxiv.org/abs/2506.05871
Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, Chengzhong Xu, 24 Jul 2025, Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling, https://arxiv.org/abs/2507.18006
Minxian Xu, Junhan Liao, Jingfeng Wu, Yiyuan He, Kejiang Ye, Chengzhong Xu, 24 Jul 2025, Cloud Native System for LLM Inference Serving, https://arxiv.org/abs/2507.18007
Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye, 23 Jul 2025, BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving, https://arxiv.org/abs/2507.17120
Jianmin Hu, Minxian Xu, Kejiang Ye, Chengzhong Xu, 23 Jul 2025, BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs, https://arxiv.org/abs/2507.17133 (MoE serving optimization.)
Bodun Hu, Shuozhe Li, Saurabh Agarwal, Myungjin Lee, Akshay Jajoo, Jiamin Li, Le Xu, Geon-Woo Kim , Donghyun Kim, Hong Xu, Amy Zhang, Aditya Akella, Aug 2025, StitchLLM:Serving LLMs, One Block at a Time, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26887–26903 July 27- August 1, 2025, https://aclanthology.org/2025.acl-long.1305.pdf
Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu, July 2025, Weaver: Efficient Multi-LLM Serving with Attention Offloading, 2025 USENIX Annual Technical Conference. July 7–9, 2025, Boston, MA, USA, https://www.usenix.org/conference/atc25/presentation/gao https://www.usenix.org/system/files/atc25-gao.pdf
Wenxin Zhang, Yueying Li, Tianyi Peng, Ciamac C. Moallemi, July 2025, Tail-Optimized Caching for LLM Inference, https://openreview.net/pdf?id=R3DICTGOkJ
Xiaoxiang Shi, Colin Cai, Junjia Du, 16 Jul 2025 (v4), Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving, https://arxiv.org/abs/2507.06608
Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li, 2 Jul 2025, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices, https://arxiv.org/abs/2507.01438
Jiangsu Du, Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Kaiyi Wu, Zhiguang Chen, Yutong Lu, 25 Apr 2025, EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration, https://arxiv.org/abs/2504.18154
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
Wei Da, Evangelia Kalyvianaki, 5 Aug 2025, Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling, https://arxiv.org/abs/2508.03611
Yicheng Feng, Xin Tan, Kin Hang Sew, Yimin Jiang, Yibo Zhu, Hong Xu, 5 Aug 2025, Frontier: Simulating the Next Generation of LLM Inference Systems, https://arxiv.org/abs/2508.03148
Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samard\v{z}i\'c, 21 Jul 2025, TorchAO: PyTorch-Native Training-to-Serving Model Optimization, https://arxiv.org/abs/2507.16099
Juntao Zhao, Jiuru Li, Chuan Wu, 19 May 2025, Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving, https://arxiv.org/abs/2507.18454
Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, Liguang Xie, 17 Jul 2025, PolyServe: Efficient Multi-SLO Serving at Scale, https://arxiv.org/abs/2507.17769
Chang Xiao, Brenda Yang, 23 Jul 2025, Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving, https://arxiv.org/abs/2504.17999
Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong, 11 Aug 2025, Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation, https://arxiv.org/abs/2508.07675
Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 27 Jul 2025, TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput, https://arxiv.org/abs/2406.14066
Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu, 26 Jul 2025, MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism, https://arxiv.org/abs/2504.02263
Francisco Dur\'an, Matias Martinez, Patricia Lago, Silverio Mart\'inez-Fern\'andez, 30 Jul 2025, Insights into resource utilization of code small language models serving with runtime engines and execution providers, https://arxiv.org/abs/2412.15441
Lingyu Jiang, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, Kazunori D Yamada, 3 Aug 2025, KANMixer: Can KAN Serve as a New Modeling Core for Long-term Time Series Forecasting?, https://arxiv.org/abs/2508.01575
Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park, 4 Aug 2025, Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving, https://arxiv.org/abs/2507.10178
Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He, 6 Aug 2025, FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design, https://arxiv.org/abs/2508.04405
Meixuan Wang, Yinyu Ye, Zijie Zhou, 8 Aug 2025, LLM Serving Optimization with Variable Prefill and Decode Lengths, https://arxiv.org/abs/2508.06133
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral, 11 Aug 2025, Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving, https://arxiv.org/abs/2508.08343
Mohammed Saqr, Kamila Misiejuk, Sonsoles L\'opez-Pernas, 3 Aug 2025, Human-AI collaboration or obedient and often clueless AI in instruct, serve, repeat dynamics?, https://arxiv.org/abs/2508.10919
Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, and Dingwen Tao, 15 Aug 2025, ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism, https://arxiv.org/abs/2507.10069
Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan, 21 Aug 2025, HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling, https://arxiv.org/abs/2508.15919
Zhixiang Wei, James Yen, Jingyi Chen, Ziyang Zhang, Zhibai Huang, Chen Chen, Xingzi Yu, Yicheng Gu, Chenggang Wu, Yun Wang, Mingyuan Xia, Jie Wu, Hao Wang, Zhengwei Qi, 19 Aug 2025, Equinox: Holistic Fair Scheduling in Serving Large Language Models, https://arxiv.org/abs/2508.16646
Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin, 24 Aug 2025, TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving, https://arxiv.org/abs/2508.17219
Wenbo Sun, Qiming Guo, Wenlu Wang, Rihan Hai, 25 Aug 2025, TranSQL+: Serving Large Language Models with SQL on Low-Resource Hardware, https://arxiv.org/abs/2502.02818
Yifan Yu, Yu Gan, Nikhil Sarda, Lillian Tsai, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler, 4 Sep 2025, IC-Cache: Efficient Large Language Model Serving via In-context Caching, https://arxiv.org/abs/2501.12689
Fangzhou Wu, Sandeep Silwal, 2 Sep 2025, Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving, https://arxiv.org/abs/2509.02718
Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee, 26 Aug 2025, Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics, https://arxiv.org/abs/2508.18736
Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, and Eunjoo Joen, 1 Sep 2025, DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving, https://arxiv.org/abs/2509.01083
Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo, 1 Sep 2025, LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving, https://arxiv.org/abs/2509.01229
Xiaoniu Song, Zihang Zhong, Rong Chen, Haibo Chen, 1 Sep 2025, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
Fei Fang, Yifan Hua, Shengze Wang, Ruilin Zhou, Yi Liu, Chen Qian, Xiaoxue Zhang, 30 Aug 2025, GenTorrent: Scaling Large Language Model Serving with An Overlay Network, https://arxiv.org/abs/2504.20101
Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee, 8 Sep 2025, FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving, https://arxiv.org/abs/2509.06261
Aleksa Gordić, August 29, 2025, Inside vLLM: Anatomy of a High-Throughput LLM Inference System: From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale https://www.aleksagordic.com/blog/vllm
Hamid Ahmad, Heiko Paulheim, Rita T. Sousa, 9 Sep 2025, Bio-KGvec2go: Serving up-to-date Dynamic Biomedical Knowledge Graph Embeddings, https://arxiv.org/abs/2509.07905
Jiahuan Yu (1), Aryan Taneja (1), Junfeng Lin (2), Minjia Zhang (1) ((1) University of Illinois Urbana-Champaign, (2) Tsinghua University), 5 Sep 2025, VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving, https://arxiv.org/abs/2509.04827
Ira Ceka, Feitong Qiao, Anik Dey, Aastha Valecha, Gail Kaiser, Baishakhi Ray, 11 Sep 2025, Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection, https://arxiv.org/abs/2412.12039
Dong Liu, Yanxuan Yu, 28 Aug 2025, TinyServe: Query-Aware Cache Selection for Efficient LLM Serving, https://arxiv.org/abs/2509.12211
Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruofan Wu, Jeff J. Ma, Ang Chen, Mosharaf Chowdhury, 2 Oct 2025, TetriServe: Efficient DiT Serving for Heterogeneous Image Generation, https://arxiv.org/abs/2510.01565
Kevin Kuo, Chhavi Yadav, Virginia Smith, 14 Oct 2025, Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice, https://arxiv.org/abs/2510.12595
Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh, 27 Oct 2025, REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving, https://arxiv.org/abs/2506.01374
Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank W\"urthwein, 22 Oct 2025, Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs, https://arxiv.org/abs/2507.00418
Tianhua Xia, Sai Qian Zhang, 16 Oct 2025, Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing, https://arxiv.org/abs/2510.16040
Xingyu Fan, Feifei Li, Wenhui Que, Hailong Li, 22 Sep 2025, One Agent to Serve All: a Lite-Adaptive Stylized AI Assistant for Millions of Multi-Style Official Accounts, https://arxiv.org/abs/2509.17788
Shiju Zhao and Junhao Hu and Rongxiao Huang and Jiaqi Zheng and Guihai Chen, 20 Sep 2025, MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving, https://arxiv.org/abs/2502.01960
Xinyu Wang, Jonas M. K\"ubler, Kailash Budhathoki, Yida Wang, Matth\"aus Kleindessner, 27 Oct 2025, Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving, https://arxiv.org/abs/2510.23346
Kayhan Behdin, Qingquan Song, Sriram Vasudevan, Jian Sheng, Xiaojing Ma, Z Zhou, Chuanrui Zhu, Guoyao Li, Chanh Nguyen, Sayan Ghosh, Hejian Sang, Ata Fatahi Baarzi, Sundara Raman Ramachandran, Xiaoqing Wang, Qing Lan, Vinay Y S, Qi Guo, Caleb Johnson, Zhipeng Wang, Fedor Borisyuk, 25 Oct 2025, Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search, https://arxiv.org/abs/2510.22101
Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder, 26 Oct 2025, Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems, https://arxiv.org/abs/2502.14305
Rongxin Cheng and Yuxin Lai and Xingda Wei and Rong Chen and Haibo Chen, 8 Oct 2025, KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving, https://arxiv.org/abs/2412.18169
Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen, 3 Oct 2025, TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling, https://arxiv.org/abs/2510.02758
Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu, 21 Oct 2025, Reasoning Language Model Inference Serving Unveiled: An Empirical Study, https://arxiv.org/abs/2510.18672
Yue Duan, Lei Qi, Yinghuan Shi, Yang Gao, 25 Sep 2025, An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering, https://arxiv.org/abs/2509.20976
Yuanyuan Yang, Ruimin Zhang, Jamie Morgenstern, Haifeng Xu, 26 Sep 2025, T-TAMER: Provably Taming Trade-offs in ML Serving, https://arxiv.org/abs/2509.22992
Yiheng Tao, Yihe Zhang, Matthew T. Dearing, Xin Wang, Yuping Fan, Zhiling Lan, 25 Sep 2025, PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank, https://arxiv.org/abs/2510.03243
Yufei Li, Yu Fu, Yue Dong, Cong Liu, 28 Sep 2025, MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment, https://arxiv.org/abs/2510.03283
Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang, 4 Oct 2025, Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading, https://arxiv.org/abs/2502.05370
Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, Zhihao Jia, 23 Oct 2025, FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees, https://arxiv.org/abs/2402.18789
Sayan Mandal and Hua Jiang, 11 Oct 2025, Grounded AI for Code Review: Resource-Efficient Large-Model Serving in Enterprise Pipelines, https://arxiv.org/abs/2510.10290
Gunjun Lee and Jiwon Kim and Jaiyoung Park and Younjoo Lee and Jung Ho Ahn, 9 Oct 2025, From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill, https://arxiv.org/abs/2510.08055
Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu, Jiayi Yao, Siddhant Ray, Samuel Shen, Yihua Cheng, Ganesh Ananthanarayanan, Junchen Jiang, 28 Aug 2025, AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving, https://arxiv.org/abs/2509.00105
Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai, 7 Oct 2025, Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting, https://arxiv.org/abs/2510.05497
Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang, 6 Oct 2025, Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving, https://arxiv.org/abs/2510.05245
Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia, 7 Oct 2025, From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs, https://arxiv.org/abs/2510.05632
Jungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim, 16 Oct 2025, MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving, https://arxiv.org/abs/2510.14557
Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park, 26 Feb 2026, LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure, https://arxiv.org/abs/2602.23036
Jiawei Xu, Chia Xin Liang, Ziqian Bi, Xiaoming Li, Danyang Zhang, Zhenyu Yu, Dec 2025, A Comprehensive Survey on Large Language Models: From Pre-training to Autonomous Agents, https://www.researchgate.net/profile/Ziqian_Bi/publication/399059225_A_Comprehensive_Survey_on_Large_Language_Models_From_Pre-training_to_Autonomous_Agents/links/694c94a07e61d05b5312836f/A-Comprehensive-Survey-on-Large-Language-Models-From-Pre-training-to-Autonomous-Agents.pdf
Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang, 6 Mar 2026, MoEless: Efficient MoE LLM Serving via Serverless Computing, https://arxiv.org/abs/2603.06350
Zizhao Mo, Junlin Chen, Huanle Xu, Chengzhong Xu, 17 Mar 2026 (v2), Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking, https://arxiv.org/abs/2603.12831
David Spuler, Ph.D., March 12th, 2026, Scaling Your AI Wrapper Architecture, Aussie AI Blog, https://www.aussieai.com/blog/scaling-ai-wrapper-architectures
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
David Spuler, Michael Sharpe, June 2025, RAG Deployment, Chapter 12, "RAG Optimization: Accurate and Efficient LLM Applications", https://www.aussieai.com/book/rag-book-12-rag-deployment
David Spuler, Ph.D., January 23, 2025 Low Latency Programming, Aussie AI Blog, https://www.aussieai.com/blog/low-latency-programming
Youhe Jiang, Ran Yan, You Peng, Wenshuang Li, Taiyi Wang, Fangcheng Fu, Binhang Yuan, 8 Apr 2026, Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics, https://arxiv.org/abs/2604.07144
David Spuler, Michael Sharpe, 2025, Deployment Architecture, Chapter 18, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-18-deployment-architecture
David Spuler, March 2024, Chapter 7. Deployment Architecture, in book "Generative AI in C++", https://www.aussieai.com/book/ch7-deployment-architecture
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Deployment

Research on LLM deployment:

Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
JH Jones, May 2024, A Quantitative Comparison of Pre-Trained Model Registries to Traditional Software Package Registries, Masters Thesis, Electrical and Computer Engineering, Purdue University, https://hammer.purdue.edu/articles/thesis/A_Quantitative_Comparison_of_Pre-Trained_Model_Registries_to_Traditional_Software_Package_Registries/25686447/1 PDF: https://hammer.purdue.edu/ndownloader/files/46096152
Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Cohere Toolkit, https://github.com/cohere-ai/cohere-toolkit (A set of open source components for RAG architectures.)
Ahmed Menshawy, Zeeshan Nawaz, Mahmoud Fahmy, April 2024, Navigating Challenges and Technical Debt in Large Language Models Deployment, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, Pages 192–199, https://doi.org/10.1145/3642970.3655840 https://dl.acm.org/doi/abs/10.1145/3642970.3655840 PDF Slides: https://www.cl.cam.ac.uk/research/srg/netos/euromlsys2024/slides/P_5_27.pdf
Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury, 25 Apr 2024, Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services, https://arxiv.org/abs/2404.16283 (Scheduling GPU activity for multiple queries to ensure good UI experience for text-streaming outputs like chatbots.)
Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
Konstantinos Papaioannou, Thaleia Dimitra Doudali, April 2024, The Importance of Workload Choice in Evaluating LLM Inference Systems, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, April 2024, Pages 39–46, https://doi.org/10.1145/3642970.3655823 https://dl.acm.org/doi/abs/10.1145/3642970.3655823
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, 2023, Deep Learning Workload Scheduling in GPU Datacenters: A Survey, ACM Computing Surveys, PDF: https://dl.acm.org/doi/pdf/10.1145/3638757
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Meenu Mary John; Helena Holmström Olsson; Jan Bosch, 2020, AI Deployment Architecture: Multi-Case Study for Key Factor Identification, 2020 27th Asia-Pacific Software Engineering Conference (APSEC), https://ieeexplore.ieee.org/abstract/document/9359253
Meenu Mary John, Helena Holmström Olsson, Jan Bosch, 2020, Architecting AI Deployment: A Systematic Review of State-of-the-Art and State-of-Practice Literature, ICSOB 2020: Software Business, pp 14–29, https://link.springer.com/chapter/10.1007/978-3-030-67292-8_2
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
Ali Rahmanian, Doctoral Thesis, April 2024, Edge Orchestration for Latency-Sensitive Applications, Department of Computing Science, Umea University, Sweden, https://www.diva-portal.org/smash/get/diva2:1849510/FULLTEXT02.pdf
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, 2 Apr 2024, MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving, https://arxiv.org/abs/2404.02015
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, https://arxiv.org/abs/2403.07648 12 Mar 2024, Characterization of Large Language Model Development in the Datacenter, (Analysis of deployment and LLOps issues in a 6-month production deployment.)
Apple, June 2022, Deploying Transformers on the Apple Neural Engine, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
Andrew Starc, Feb 22 2024, Mantel Group survey reveals AI challenges of large Australian businesses, CRN, https://www.crn.com.au/news/mantel-group-survey-reveals-ai-challenges-of-large-australian-businesses-605376
Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, Austin Z. Henley, 21 Dec 2023, Building Your Own Product Copilot: Challenges, Opportunities, and Needs, https://arxiv.org/abs/2312.14231
Jacob Robbins, January 4, 2024, Why generative AI orchestration startups are poised for growth in 2024, Pitch Book, https://pitchbook.com/news/articles/generative-ai-orchestration-startups-venture-capital-unicorns
Eberhard Hechler , Martin Oberhofer , Thomas Schaeck, 2020, Deploying AI in the Enterprise, Book, https://link.springer.com/book/10.1007/978-1-4842-6206-1
Teresa Tung, June 2023, 7 architecture considerations for generative AI, Accenture, https://www.accenture.com/us-en/blogs/cloud-computing/7-generative-ai-architecture-considerations
Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Kirill Kolodiazhnyi, May 15, 2020, Hands-On Machine Learning with C++: Build, train, and deploy end-to-end machine learning and deep learning pipelines, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335/
Deci Engineering Team, September 28, 2021, 5 Factors that Impact the Inference Pipeline in Production + Hardware Usage Metrics, https://deci.ai/blog/optimize-inference-pipeline-production/
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
Waleed Kadous, August 23, 2023, Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper, https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper Code: https://github.com/anyscale/factuality-eval
Louis-François Bouchard, Louie Peters, May 2024, Chapter 11: Deployment, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
Aarushi Kansal, Chapter 7: Monitoring, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Intel, Jul 24, 2024, Generative AI Fundamentals: Deploying LLMs with OpenVINO™, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/generative-ai-fundamentals-deploying-llms-with-openvino-3057861f6feb
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Abhinand, Aug 20, 2024, Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably, https://abhinand05.medium.com/self-hosting-llama-3-1-70b-or-any-70b-llm-affordably-2bd323d72f8d
Dom Couldwell, Sep 03, 2024 Dealing with ‘day two’ issues in generative AI deployments, https://www.infoworld.com/article/3493255/dealing-with-day-two-issues-in-generative-ai-deployments.html
Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
Evan Schuman, 01 May 2024, LLM deployment flaws that catch IT by surprise, https://www.computerworld.com/article/2095216/llm-deployment-flaws-that-catch-it-by-surprise.html
Michael Nuñez, September 10, 2024, Is Anthropic’s new ‘Workspaces’ feature the future of enterprise AI management? https://venturebeat.com/ai/is-anthropics-new-workspaces-feature-the-future-of-enterprise-ai-management/
Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, 19 May 2022 (v3), Challenges in Deploying Machine Learning: a Survey of Case Studies, ACM Comput. Surv., Vol. 55, No. 6, Article 114, December 2022. https://doi.org/10.1145/3533378 https://arxiv.org/abs/2011.09926 https://dl.acm.org/doi/fullHtml/10.1145/3533378#Bib0005
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Michael J. Zellinger, Matt Thomson, 3 Oct 2024, Efficiently Deploying LLMs with Controlled Risk, https://arxiv.org/abs/2410.02173
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
Fan Yang, Zehao Wang∗, Haoyu Zhang, Zhenhua Zhu, Xinhao Yang, Guohao Dai, Yu Wang, Oct 2024, Efficient Deployment of Large Language Model across Cloud-Device Systems, https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/f06a14c1-4d6d-441d-b4e4-82545ac5781b.pdf
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Alina Mailach, Sebastian Simon, Johannes Dorn, Norbert Siegmund, 13 Nov 2024, Practitioners' Discussions on Building LLM-based Applications for Production, https://arxiv.org/abs/2411.08574
Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Kailai Sun, Xinwei Wang, Xi Miao, and Qianchuan Zhao. 2025. A review of AI edge devices and lightweight CNN and LLM deployment. Neurocomput. 614, C (Jan 2025). https://doi.org/10.1016/j.neucom.2024.128791 https://dl.acm.org/doi/abs/10.1016/j.neucom.2024.128791
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, API rate limits, history database management, caching, and other real-world deployment issues.)
The Latency Gambler, May 10, 2025, Scaling to 1 Million Users: The Architecture I Wish I Knew Sooner, https://medium.com/@kanishks772/scaling-to-1-million-users-the-architecture-i-wish-i-knew-sooner-39c688ded2f1
John Edwards, Jul 22, 2025 7 things you need to know about AI and the data center, https://www.cio.com/article/222623/7-things-to-know-about-ai-in-the-data-center.html
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
Annie S. Chen, Govind Chada, Laura Smith, Archit Sharma, Zipeng Fu, Sergey Levine, Chelsea Finn, 22 Jul 2025, Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment, https://arxiv.org/abs/2311.01059
Rodrigo Moreira and Larissa F. Rodrigues Moreira and Fl\'avio de Oliveira Silva, 23 Jul 2025, Performance Evaluation and Threat Mitigation in Large-scale 5G Core Deployment, https://arxiv.org/abs/2507.17850
Aidan Furlong, Xingang Zhao, Robert Salko, Xu Wu, 18 Jul 2025, Development and Deployment of Hybrid ML Models for Critical Heat Flux Prediction in Annulus Geometries, https://arxiv.org/abs/2507.14332
Anton Abilov, Ke Zhang, Hemank Lamba, Elizabeth M. Olson, Joel R. Tetreault, Alejandro Jaimes, 21 Jul 2025, Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work, https://arxiv.org/abs/2507.15823
Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang, 10 Aug 2025, Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative, https://arxiv.org/abs/2508.07329
Stephan Rabanser, 11 Aug 2025, Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning, https://arxiv.org/abs/2508.07556
Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, 28 Jul 2025, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, https://arxiv.org/abs/2507.20526
Yixin Song,Zhenliang Xue,Dongliang Wei,Feiyang Chen,Jianxiang Gao,Junchen Liu,Hangyu Liang,Guangshuo Qin,Chengrong Tian,Bo Wen,Longyu Zhao,Xinrui Zheng,Zeyu Mi,Haibo Chen, 28 Jul 2025, SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment, https://arxiv.org/abs/2507.20984
Hugo Retief, Kayathri, Vigneswaran, Surajit Ghosh, Mariangel Garcia Andarcia, Chris Dickens, 28 Jul 2025, Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa, https://arxiv.org/abs/2502.19989
Arianna Stropeni, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Marco Fabris, Gian Antonio Susto, 28 Jul 2025, Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression, https://arxiv.org/abs/2505.07119
Alex Durkin, Jasper Stolte, Matthew Jones, Raghuraman Pitchumani, Bei Li, Christian Michler, Mehmet Mercang\"oz, 30 Jul 2025, Safe Deployment of Offline Reinforcement Learning via Input Convex Action Correction, https://arxiv.org/abs/2507.22640
Wentao Zhang, Yilei Zhao, Chuqiao Zong, Xinrun Wang, Bo An, 4 Aug 2025, FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment, https://arxiv.org/abs/2508.02292
Dai Li, Kevin Course, Wei Li, Hongwei Li, Jie Hua, Yiqi Chen, Zhao Zhu, Rui Jian, Xuan Cao, Bi Xue, Yu Shi, Jing Qian, Kai Ren, Matt Ma, Qunshu Zhang, Rui Li, 4 Aug 2025, Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment, https://arxiv.org/abs/2508.02929
Zakariya Ba Alawi, 6 Aug 2025, A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs, https://arxiv.org/abs/2508.04035
Mengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang, 7 Aug 2025, Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks, https://arxiv.org/abs/2508.05783
Bill Tang, \c{C}a\u{g}{\i}l Ko\c{c}yi\u{g}it, Eric Rice, Phebe Vayanos, 11 Aug 2025, Learning Optimal and Fair Policies for Online Allocation of Scarce Societal Resources from Data Collected in Deployment, https://arxiv.org/abs/2311.13765
Nikola Pi\v{z}urica, Nikola Milovi\'c, Igor Jovan\v{c}evi\'c, Conor Heins, and Miguel de Prado, 12 Aug 2025, A Hardware-oriented Approach for Efficient Active Inference Computation and Deployment, https://arxiv.org/abs/2508.13177
Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann and Gregor Schiele, 19 Aug 2025, Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management, https://arxiv.org/abs/2508.13905
Sukheon Kang, Youngkwon Kim, Jinkyu Yang, Seunghwa Ryu, 19 Aug 2025, Physics-Informed Neural Networks for Programmable Origami Metamaterials with Controlled Deployment, https://arxiv.org/abs/2508.13559
Mackenzie Jorgensen, Kendall Brogle, Katherine M. Collins, Lujain Ibrahim, Arina Shah, Petra Ivanovic, Noah Broestl, Gabriel Piles, Paul Dongha, Hatim Abdulhussein, Adrian Weller, Jillian Powers, Umang Bhatt, 18 Aug 2025, Documenting Deployment with Fabric: A Repository of Real-World AI Governance, https://arxiv.org/abs/2508.14119
Mauro Belgiovine, Chris Dick, Kaushik Chowdhury, 17 Aug 2025, Better Together: Leveraging Multiple Digital Twins for Deployment Optimization of Airborne Base Stations, https://arxiv.org/abs/2508.15816
Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, and Navin Kumar, 22 Aug 2025, Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment, https://arxiv.org/abs/2508.16839
Deepak Kumar, Divakar Yadav, Yash Patel, 22 Aug 2025, GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model, https://arxiv.org/abs/2508.16700
Drew Prinster, Xing Han, Anqi Liu, Suchi Saria, 25 Aug 2025, WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales, https://arxiv.org/abs/2505.04608
Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu, 5 Sep 2025, Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment, https://arxiv.org/abs/2503.15937
Xinyi Hou, Jiahao Han, Yanjie Zhao, Haoyu Wang, 26 Aug 2025, Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study, https://arxiv.org/abs/2505.02502
Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li, 30 Aug 2025, An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment, https://arxiv.org/abs/2509.00560
Wei Huang, Anda Cheng, Zhao Zhang, Yinggui Wang, 1 Sep 2025, DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment, https://arxiv.org/abs/2509.01354
Victor Guyomard and Mathis Mauvisseau and Marie Paindavoine, 8 Sep 2025, Breaking SafetyCore: Exploring the Risks of On-Device AI Deployment, https://arxiv.org/abs/2509.06371
Juan D. Gil, Ehecatl Antonio Del Rio Chanona, Jos\'e L. Guzm\'an, Manuel Berenguel, 8 Sep 2025, Reinforcement learning meets bioprocess control through behaviour cloning: Real-world deployment in an industrial photobioreactor, https://arxiv.org/abs/2509.06853
Luxi He, Xiangyu Qi, Michel Liao, Inyoung Cheong, Prateek Mittal, Danqi Chen, Peter Henderson, 9 Sep 2025, The Model Hears You: Audio Language Model Deployments Should Consider the Principle of Least Privilege, https://arxiv.org/abs/2503.16833
Vera Pavlova and Mohammed Makhlouf, 18 Sep 2025, Efficient and Versatile Model for Multilingual Information Retrieval of Islamic Text: Development and Deployment in Real-World Scenarios, https://arxiv.org/abs/2509.15380
Xianyuan Liu, Jiayang Zhang, Shuo Zhou, Thijs L. van der Plas, Avish Vijayaraghavan, Anastasiia Grishina, Mengdie Zhuang, Daniel Schofield, Christopher Tomlinson, Yuhan Wang, Ruizhe Li, Louisa van Zeeland, Sina Tabakhi, Cyndie Demeocq, Xiang Li, Arunav Das, Orlando Timmerman, Thomas Baldwin-McDonald, Jinge Wu, Peizhen Bai, Zahraa Al Sahili, Omnia Alwazzan, Thao N. Do, Mohammod N.I. Suvon, Angeline Wang, Lucia Cipolina-Kun, Luigi A. Moretti, Lucas Farndale, Nitisha Jain, Natalia Efremova, Yan Ge, Marta Varela, Hak-Keung Lam, Oya Celiktutan, Ben R. Evans, Alejandro Coca-Castro, Honghan Wu, Zahraa S. Abdallah, Chen Chen, Valentin Danchev, Nataliya Tkachenko, Lei Lu, Tingting Zhu, Gregory G. Slabaugh, Roger K. Moore, William K. Cheung, Peter H. Charlton, Haiping Lu, 19 Sep 2025, Towards deployment-centric multimodal AI beyond vision and language, https://arxiv.org/abs/2504.03603
Qinghui Liu, Jon E. Nesvold, Hanna Raaum, Elakkyen Murugesu, Martin R{\o}vang, Bradley J Maclntosh, Atle Bj{\o}rnerud, Karoline Skogen, 19 Sep 2025, Examining Deployment and Refinement of the VIOLA-AI Intracranial Hemorrhage Model Using an Interactive NeoMedSys Platform, https://arxiv.org/abs/2505.09380
Ozan Baris Mulayim, Elias N. Pergantis, Levi D. Reyes Premer, Bingqing Chen, Guannan Qu, Kevin J. Kircher, Mario Berg\'es, 1 Oct 2025, Comparative Field Deployment of Reinforcement Learning and Model Predictive Control for Residential HVAC, https://arxiv.org/abs/2510.01475
Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang, 14 Oct 2025, HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment, https://arxiv.org/abs/2510.12217
Deokjae Lee and Hyun Oh Song, 24 Sep 2025, Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment, https://arxiv.org/abs/2509.20214
Saif Ur Rehman Khan, Muhammad Nabeel Asim, Sebastian Vollmer, Andreas Dengel, 23 Oct 2025, Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment, https://arxiv.org/abs/2510.20438
Tushar Nayan (1), Ziqi Zhang (2), Ruimin Sun (1) ((1) Florida International University, (2) University of Illinois Urbana-Champaign), 22 Oct 2025, SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment, https://arxiv.org/abs/2510.19979
Derda Kaymak, Gyuhak Kim, Tomoya Kaichi, Tatsuya Konishi, Bing Liu, 20 Oct 2025, Learning After Model Deployment, https://arxiv.org/abs/2510.17160
Kayhan Behdin, Qingquan Song, Sriram Vasudevan, Jian Sheng, Xiaojing Ma, Z Zhou, Chuanrui Zhu, Guoyao Li, Chanh Nguyen, Sayan Ghosh, Hejian Sang, Ata Fatahi Baarzi, Sundara Raman Ramachandran, Xiaoqing Wang, Qing Lan, Vinay Y S, Qi Guo, Caleb Johnson, Zhipeng Wang, Fedor Borisyuk, 25 Oct 2025, Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search, https://arxiv.org/abs/2510.22101
Pavel Dolin, Weizhi Li, Gautam Dasarathy, Visar Berisha, 24 Oct 2025, Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health, https://arxiv.org/abs/2506.05701
Ryan Marinelli and Angelica Chowdhury, 30 Sep 2025, Scaling Homomorphic Applications in Deployment, https://arxiv.org/abs/2510.02376
Kunyu Wu, Qiushi Zhao, Zihan Feng, Yunxi Mu, Hao Qin, Xinyu Zhang, and Xingqi Zhang, 29 Sep 2025, Intelligent Optimization of Wireless Access Point Deployment for Communication-Based Train Control Systems Using Deep Reinforcement Learning, https://arxiv.org/abs/2509.24819
Ivan Kartashov, Mariia Pushkareva, Iakov Karandashev, 17 Oct 2025, SpikeFit: Towards Optimal Deployment of Spiking Networks on Neuromorphic Hardware, https://arxiv.org/abs/2510.15542
Raghav Sharma and Manan Mehta, 4 Oct 2025, Small Language Models for Agentic Systems: A Survey of Architectures, Capabilities, and Deployment Trade offs, https://arxiv.org/abs/2510.03847
Lyes Saad Saoud and Irfan Hussain, 6 Oct 2025, Bio-Inspired Robotic Houbara: From Development to Field Deployment for Behavioral Studies, https://arxiv.org/abs/2510.04692
Viet Nguyen, Changjian Shui, Vijay Giri, Siddarth Arya, Michael Cooper, Amol Verma, Fahad Razak, Rahul G. Krishnan, 3 Oct 2025, Reliably Detecting Model Failures in Deployment Without Labels, https://arxiv.org/abs/2506.05047
Deven Panchal, 12 Oct 2025, Simpliflow: A Lightweight Open-Source Framework for Rapid Creation and Deployment of Generative Agentic AI Workflows, https://arxiv.org/abs/2510.10675
Lion Mueller, Alberto Garcia-Ortiz, Ardalan Najafi, Adam Fuks, Lennart Bamberg, 13 Oct 2025, Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware, https://arxiv.org/abs/2510.11484
Shaokai Wu, Yanbiao Ji, Qiuchang Li, Zhiyi Zhang, Qichen He, Wenyuan Xie, Guodong Zhang, Bayram Bayramli, Yue Ding, Hongtao Lu, 11 Oct 2025, Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback, https://arxiv.org/abs/2510.10181
Jiecheng Zhou, Qinghao Hu, Yuyang Jin, Zerui Wang, Peng Sun, Yuzhe Gu, Wenwei Zhang, Mingshu Zhai, Xingcheng Zhang, Weiming Zhang, 13 Oct 2025, RL in the Wild: Characterizing RLVR Training in LLM Deployment, https://arxiv.org/abs/2509.25279
Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno, Takuma Udagawa, 9 Oct 2025, Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning, https://arxiv.org/abs/2510.07635
Guanzhong Pan, Haibo Wang, 30 Aug 2025, A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services, https://arxiv.org/abs/2509.18101
Xingkun Yin, Kaibin Huang, Dong In Kim, Hongyang Du, 23 Sep 2025, Experience Scaling: Post-Deployment Evolution For Large Language Models, https://arxiv.org/abs/2509.18771
Javed I. Khan an Henry Uwabor Moye, 9 Sep 2025, A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU, https://arxiv.org/abs/2509.18114
Yixiao Chen, Yanyue Xie, Ruining Yang, Wei Jiang, Wei Wang, Yong He, Yue Chen, Pu Zhao and Yanzhi Wang, 30 Sep 2025, Collaborative Compression for Large-Scale MoE Deployment on Edge, https://arxiv.org/abs/2509.25689
Ibrahim Salihu Yusuf, Iffanice Houndayi, Rym Oualha, Mohamed Aziz Cherif, Kobby Panford-Quainoo, Arnu Pretorius, 7 Oct 2025, InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment, https://arxiv.org/abs/2510.05617
Hanbo Huang, Yihan Li, Bowen Jiang, Bo Jiang, Lin Liu, Ruoyu Sun, Zhuotao Liu, Shiyu Liang, 7 Oct 2025, A Middle Path for On-Premises LLM Deployment: Preserving Privacy Without Sacrificing Model Confidentiality, https://arxiv.org/abs/2410.11182
Brandon Hill and Kma Solaiman, 16 Oct 2025, BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble, https://arxiv.org/abs/2510.14389
Qinfeng Li, Tianyue Luo, Xuhong Zhang, Yangfan Xie, Zhiqiang Shen, Lijun Zhang, Yier Jin, Hao Peng, Xinkui Zhao, Xianwei Zhu, and Jianwei Yin, 16 Oct 2025, CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment, https://arxiv.org/abs/2410.13903
Xu Cheng, Liang Yao, Feng He, Yukuo Cen, Yufei He, Chenhui Zhang, Wenzheng Feng, Hongyun Cai, Jie Tang, 19 Jul 2025, LPS-GNN : Deploying Graph Neural Networks on Graphs with 100-Billion Edges, https://arxiv.org/abs/2507.14570
Christina Butsko, Kristof Van Tricht, Gabriel Tseng, Giorgia Milli, David Rolnick, Ruben Cartuyvels, Inbal Becker Reshef, Zoltan Szantoi, Hannah Kerner, 16 Jul 2025, Deploying Geospatial Foundation Models in the Real World: Lessons from WorldCereal, https://arxiv.org/abs/2508.00858
Yanjie Dong, Haijun Zhang, Chengming Li, Song Guo, Victor C. M. Leung, Xiping Hu, 6 Aug 2025, Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches, https://arxiv.org/abs/2408.10691
Yuhao Zhou, Jindi Lv, Yuxin Tian, Dan Si, Qing Ye, Jiancheng Lv, 18 Aug 2025, Deploying Models to Non-participating Clients in Federated Learning without Fine-tuning: A Hypernetwork-based Approach, https://arxiv.org/abs/2508.12673
Jack Gallifant, Katherine C. Kellogg, Matt Butler, Amanda Centi, Shan Chen, Patrick F. Doyle, Sayon Dutta, Joyce Guo, Matthew J. Hadfield, Esther H. Kim, David E. Kozono, Hugo JWL Aerts, Adam B. Landman, Raymond H. Mak, Rebecca G. Mishuris, Tanna L. Nelson, Guergana K. Savova, Elad Sharon, Benjamin C. Silverman, Umit Topaloglu, Jeremy L. Warner, and Danielle S. Bitterman, 1 Oct 2025, Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice, https://arxiv.org/abs/2509.26153
Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, {\L}ukasz Str\k{a}k, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, Nir Mashkif, Asaf Adi, 27 Oct 2025, From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production, https://arxiv.org/abs/2510.23856
Haochen Su, Cristian Meo, Francesco Stella, Andrea Peirone, Kai Junge, Josie Hughes, 20 Oct 2025, Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots, https://arxiv.org/abs/2510.17369
Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder, 26 Oct 2025, Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems, https://arxiv.org/abs/2502.14305
Bal\'azs M\'esz\'aros, James C. Knight, Jonathan Timcheck and Thomas Nowotny, 15 Oct 2025, A Complete Pipeline for deploying SNNs with Synaptic Delays on Loihi 2, https://arxiv.org/abs/2510.13757
Yuze Sun and Wentao Luo and Yanfei Xiang and Jiancheng Pan and Jiahao Li and Quan Zhang and Xiaomeng Huang, 14 Oct 2025, Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies, Performance Optimization and Analysis, https://arxiv.org/abs/2510.17852
Angel M. Beltre, Jeff Ogden, Kevin Pedretti, 24 Sep 2025, Experience Deploying Containerized GenAI Services at an HPC Center, https://arxiv.org/abs/2509.20603
Adam Swanda, Amy Chang, Alexander Chen, Fraser Burch, Paul Kassianik, Konstantin Berlin, 25 Sep 2025, A Framework for Rapidly Developing and Deploying Protection Against Large Language Model Attacks, https://arxiv.org/abs/2509.20639
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang, 10 Oct 2025, Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices, https://arxiv.org/abs/2510.07545
Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun, 22 Oct 2025, ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices, https://arxiv.org/abs/2510.19482
Paula Reyero Lobo, Kevin Johnson, Bill Buchanan, Matthew Shardlow, Ashley Williams, Samuel Attwood, 30 Sep 2025, TVS Sidekick: Challenges and Practical Insights from Deploying Large Language Models in the Enterprise, https://arxiv.org/abs/2509.26482
Arjan Blankestijn, Uraz Odyurt, Amirreza Yousefzadeh, 30 Sep 2025, TrackCore-F: Deploying Transformer-Based Subatomic Particle Tracking on FPGAs, https://arxiv.org/abs/2509.26335
Michael R Smith, Joe Ingram, 27 Aug 2025, Surveying the Operational Cybersecurity and Supply Chain Threat Landscape when Developing and Deploying AI Systems, https://arxiv.org/abs/2508.20307
Thanh Tung Khuat, Johnny Peng, Robert Bassett, Ellen Otte, Bogdan Gabrys, 29 Aug 2025, Lessons Learned from Deploying Adaptive Machine Learning Agents with Limited Data for Real-time Cell Culture Process Monitoring, https://arxiv.org/abs/2509.02606
Jarvis Haupt, Qin Lu, Yanning Shen, Jia Chen, Yue Dong, Dan McCreary, Mehmet Ak\c{c}akaya, Georgios B. Giannakis, 10 Sep 2025, Deploying AI for Signal Processing education: Selected challenges and intriguing opportunities, https://arxiv.org/abs/2509.08950
Weimin Wu, Xuefeng Song, Yibo Wen, Qinjie Lin, Zhihan Zhou, Jerry Yao-Chieh Hu, Zhong Wang, Han Liu, 13 Sep 2025, Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models, https://arxiv.org/abs/2509.12266
Eric Zhang, Li Wei, Sarah Chen, Michael Wang (SSHealth Team, AI for Healthcare Laboratory), 17 Sep 2025, Deploying UDM Series in Real-Life Stuttered Speech Applications: A Clinical Evaluation Framework, https://arxiv.org/abs/2509.14304
Philip Kiely, Inference Engineering, March 2026, https://www.baseten.co/inference-engineering/
David Spuler, Ph.D., March 1st 2026 (updated), List of 600+ Low-Latency C++ Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-low-latency-techniques
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
David Spuler, Michael Sharpe, June 2025, RAG Deployment, Chapter 12, "RAG Optimization: Accurate and Efficient LLM Applications", https://www.aussieai.com/book/rag-book-12-rag-deployment
David Spuler, Ph.D., January 23, 2025 Low Latency Programming, Aussie AI Blog, https://www.aussieai.com/blog/low-latency-programming
David Spuler, Michael Sharpe, 2025, Deployment Architecture, Chapter 18, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-18-deployment-architecture
David Spuler, March 2024, Chapter 7. Deployment Architecture, in book "Generative AI in C++", https://www.aussieai.com/book/ch7-deployment-architecture
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, in book "Generative AI in C++", https://www.aussieai.com/book/ch54-ensemble-research
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Batching

Research papers on batching:

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
D Shin, May 8, 2024, Multi-User Language Model Resource Allocation Using Contextual Pause Token Aware Transformers, Technical Disclosure Commons, https://www.tdcommons.org/dpubs_series/6981/ PDF: https://www.tdcommons.org/cgi/viewcontent.cgi?article=8121&context=dpubs_series (Interesting idea of training a model how and when to pause during inference, so it can be pre-empted if needed, and thus the overall system can schedule batching of multiple queries more optimally.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Qidong Su, Christina Giannoula, Gennady Pekhimenko, Oct 2023, The Synergy of Speculative Decoding and Batching in Serving Large Language Models, https://arxiv.org/abs/2310.18813 (Optimizing by adapting dynamically the length of the speculated sequence in batches.)
Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn, April 2024, AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Pages 103–119, https://doi.org/10.1145/3620665.3640422
Gyeongin Yu, Geon-Woo Kim, Joo Seong Jeong, Soo Jeong Kim, Byung-Gon Chun, 2022, Selective Batching for Inference System for Transformer-Based Generation Tasks, U.S. Patent, US20230177401A1 https://patents.google.com/patent/US20230177401A1/en
Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang, 7 Jun 2024, Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction, https://arxiv.org/abs/2406.04785
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
Kartik Talamadupula, March 4, 2024, A Guide to LLM Inference Performance Monitoring, https://symbl.ai/developers/blog/a-guide-to-llm-inference-performance-monitoring/
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Myung Beom Her, Jisu Jeong, Hojoon Song, Ji-Hyeong Han, 5 Jul 2024, Batch Transformer: Look for Attention in Batch, https://arxiv.org/abs/2407.04218
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang, 25 Jul 2024, An Efficient Inference Framework for Early-exit Large Language Models, https://arxiv.org/abs/2407.20272 (Faster early exit using batching and KV cache resolution.)
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
S. Selvam, A. Nagarajan, A. Raghunathan, 16 August 2024, Efficient Batched Inference in Conditional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3445263, https://ieeexplore.ieee.org/abstract/document/10638141 Code: https://github.com/surya00060/BatchCond
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
S. Selvam, A. Nagarajan and A. Raghunathan, 2024, Efficient Batched Inference in Conditional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3445263, https://ieeexplore.ieee.org/document/10638141
O. Khan, J. Yu, Y. Kim and E. Seo, 2024, Efficient Adaptive Batching of DNN Inference Services for Improved Latency, 2024 International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 2024, pp. 197-200, doi: 10.1109/ICOIN59985.2024.10572152, https://ieeexplore.ieee.org/document/10572152
Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, 2024, Accelerating Deep Learning Frameworks with Micro-Batches, 2018 IEEE International Conference on Cluster Computing (CLUSTER), Year: 2018, Pages: 402-412, DOI Bookmark: 10.1109/CLUSTER.2018.00058, https://www.computer.org/csdl/proceedings-article/cluster/2018/831900a402/17D45WHONl5
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, Luis Ceze, Feb 2, 2024 , Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding, https://flashinfer.ai/2024/02/02/cascade-inference.html
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Jiayi Liu, Tinghan Yang, Jennifer Neville, 17 Feb 2024, CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness, https://arxiv.org/abs/2402.14833
Lightning AI, 2024, Batching, https://lightning.ai/docs/litserve/features/batching
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
Michael Nuñez, October 8, 2024, Anthropic challenges OpenAI with affordable batch processing, https://venturebeat.com/ai/anthropic-challenges-openai-with-affordable-batch-processing/
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang, 24 Oct 2024, BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching, https://arxiv.org/abs/2410.18701
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Don Moon, Aug 28, 2024, Chunked Prefill and Decode-Maximal Batching https://medium.com/byte-sized-ai/llm-inference-optimizations-2-chunked-prefill-764407b3a67a
Ming Yin, Minshuo Chen, Kaixuan Huang, Mengdi Wang, 30 Oct 2024, A Theoretical Perspective for Speculative Decoding Algorithm, https://arxiv.org/abs/2411.00841
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng, 29 Nov 2024, BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching, https://arxiv.org/abs/2412.03594
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh, 7 Dec 2024, Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, https://arxiv.org/abs/2412.05693 (KV cache compression in prefill or prompt processing phase.)
Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Waleed Kadous, May 17, 2023, Numbers every LLM Developer should know, https://www.anyscale.com/blog/num-every-llm-developer-should-know (Includes discussion of "be concise" prompting.)
Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
Qwen Team, January 21, 2025, Global-batch load balance almost free lunch to improve your MoE LLM training, https://qwenlm.github.io/blog/global-load-balance/
Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng, 9 Feb 2025, Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline, https://arxiv.org/abs/2502.06888
Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral, 11 Mar 2025, Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference, https://arxiv.org/abs/2503.08311
Bilel Bensaid, Gaël Poëtte, Rodolphe Turpault. Numerical splitting schemes as the cornerstone for mini-batch optimization. 2025. hal-04991621 https://hal.science/hal-04991621/document
Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 16 May 2025, TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference, https://arxiv.org/abs/2505.11329
Rajeshkumar Bambhaniya, Abhimanyu ; Wu, Hanjiang ; Subramanian, Suvinay ; Srinivasan, Sudarshan ; Kundu, Souvik ; Yazdanbakhsh, Amir ; Elavazhagan, Midhilesh ; Kumar, Madhu ; Krishna, Tushar, April 2025, Understanding and Optimizing Multi-Stage AI Inference Pipelines, https://ui.adsabs.harvard.edu/abs/2025arXiv250409775R/abstract https://arxiv.org/abs/2504.09775
CC Hu, HY Huang, LL Xu, XS Chen, C Wang, J Xu, 2025, ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads, ACMTrans. Arch. Code Optim., https://dl.acm.org/doi/pdf/10.1145/3732941 https://doi.org/10.1145/3732941
Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim, 28 May 2025, FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference, https://arxiv.org/abs/2505.22758 https://github.com/aninrusimha/flashformer (Optimizing kernels for low latency in a single isolated query, not a batch, via kernel fusion and running all components in one kernel, along with programming techniques like metaprogramming.)
Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye, 23 Jul 2025, BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving, https://arxiv.org/abs/2507.17120
Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li, 2 Jul 2025, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices, https://arxiv.org/abs/2507.01438
Meixuan Wang, Yinyu Ye, Zijie Zhou, 8 Aug 2025, LLM Serving Optimization with Variable Prefill and Decode Lengths, https://arxiv.org/abs/2508.06133
Orlenys L\'opez-Pintado, Jannis Rosenbaum, Marlon Dumas, 21 Jul 2025, Optimization of Activity Batching Policies in Business Processes, https://arxiv.org/abs/2507.15457
Seohyeon Cha, Kevin Chan, Gustavo de Veciana, Haris Vikalo, 18 Aug 2025, Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference, https://arxiv.org/abs/2508.13380
Zixi Chen, Yinyu Ye, Zijie Zhou, 20 Aug 2025, Adaptively Robust LLM Inference Optimization under Prediction Uncertainty, https://arxiv.org/abs/2508.14544 (Using length prediction and optimistic scheduling for inference.)
Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov, 25 Aug 2025, AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models, https://arxiv.org/abs/2508.18182
Sangmin Bae, 7 Sep 2025, Accelerating Large Language Model Inference via Early-Exiting Algorithms, https://arxiv.org/abs/2509.05915 (Impact of early-exiting on batch performance with mitigation by co-design and layer sharing.)
Devansh, Sep 2025, The Chocolate Milk Cult’s Guide to Inference Scaling for AI Models: How to Reduce the costs of Running LLMs https://machine-learning-made-simple.medium.com/the-chocolate-milk-cults-guide-to-inference-scaling-for-ai-models-50aa2290eb50 (Deep analysis of using many progressive optimizations to real-life LLM inference.)
Carl Franzen, September 24, 2025, Chinese food delivery app Meituan's open source AI model LongCat-Flash-Thinking rivals GPT-5, https://venturebeat.com/ai/chinese-food-delivery-firm-meituans-open-source-ai-model-longcat-flash
James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901
Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park, 26 Feb 2026, LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure, https://arxiv.org/abs/2602.23036
Sean Goedecke, February 15, 2026, Two different tricks for fast LLM inference, https://www.seangoedecke.com/fast-llm-inference/ (Anthropic uses "low batch size" inference whereas OpenAI uses various approaches.)
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yifan Qiao, Yang Zhou, Jiarong Xing, and Ion Stoica. 2026. BlendServe: Optimizing Offline Inference with Resource-Aware Batching. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '26). Association for Computing Machinery, New York, NY, USA, 255–273. https://doi.org/10.1145/3779212.3790133 https://dl.acm.org/doi/abs/10.1145/3779212.3790133

Continuous Batching

Research papers on continuous batching:

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Wei Da, Evangelia Kalyvianaki, 5 Aug 2025, Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling, https://arxiv.org/abs/2508.03611
Devansh, Sep 2025, The Chocolate Milk Cult’s Guide to Inference Scaling for AI Models: How to Reduce the costs of Running LLMs https://machine-learning-made-simple.medium.com/the-chocolate-milk-cults-guide-to-inference-scaling-for-ai-models-50aa2290eb50 (Deep analysis of using many progressive optimizations to real-life LLM inference.)
Aleksa Gordić, August 29, 2025, Inside vLLM: Anatomy of a High-Throughput LLM Inference System: From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale https://www.aleksagordic.com/blog/vllm
Jiawei Xu, Chia Xin Liang, Ziqian Bi, Xiaoming Li, Danyang Zhang, Zhenyu Yu, Dec 2025, A Comprehensive Survey on Large Language Models: From Pre-training to Autonomous Agents, https://www.researchgate.net/profile/Ziqian_Bi/publication/399059225_A_Comprehensive_Survey_on_Large_Language_Models_From_Pre-training_to_Autonomous_Agents/links/694c94a07e61d05b5312836f/A-Comprehensive-Survey-on-Large-Language-Models-From-Pre-training-to-Autonomous-Agents.pdf
Zylos, 15 Jan 2026, LLM Inference Optimization and Quantization 2026, https://zylos.ai/research/2026-01-15-llm-inference-optimization
Morph, March 27, 2026, LLM Inference Optimization: A Practical Guide to Cutting Cost and Latency (2026): Concrete techniques for optimizing LLM inference across model, system, and application layers. Quantization, KV cache compression, continuous batching, speculative decoding, and context compaction with real benchmarks, https://www.morphllm.com/llm-inference-optimization
Dev Patel, Sep 30, 2025, Inside Real-Time LLM Inference: From Prefill to Decode, Explained, https://medium.com/@devsp0703/inside-real-time-llm-inference-from-prefill-to-decode-explained-72a1c9b1d85a

Frameworks

Research on inference frameworks as part of serving:

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Pierrick Pochelu, 9 Oct 2022, Deep Learning Inference Frameworks Benchmark, https://arxiv.org/abs/2210.04323 (Benchmarking study in 2022 of various frameworks.)
Max A. Cherney, March 26, 2024, Exclusive: Behind the plot to break Nvidia's grip on AI by targeting software, https://www.reuters.com/technology/behind-plot-break-nvidias-grip-ai-by-targeting-software-2024-03-25/
Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang, Sep 2023, Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, https://arxiv.org/pdf/2309.08978.pdf
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
Suresh G, Sep 25, 2023, 7 Frameworks for Serving LLMs, Medium, https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88
Doug Eadline, October 5, 2023, How AMD May Get Across the CUDA Moat, HPC Wire, https://www.hpcwire.com/2023/10/05/how-amd-may-get-across-the-cuda-moat/
Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
K Dinghofer, F Hartung, 2020, Analysis of criteria for the selection of machine learning frameworks 2020 International Conference on Computing, Networking and Communications (ICNC), https://ieeexplore.ieee.org/document/9049650
H Dai, X Peng, X Shi, L He, Q Xiong, H Jin, 2022, Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment, Science China Information Sciences volume 65, Article number: 112103 (2022), https://link.springer.com/article/10.1007/s11432-020-3182-1 http://scis.scichina.com/en/2022/112103.pdf
C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
R. Sanchez-Iborra and A. F. Skarmeta, Tinyml-enabled frugal smart objects: Challenges and opportunities, IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4–18, 2020. https://ieeexplore.ieee.org/document/9166461 PDF: https://sci-hub.se/10.1109/MCAS.2020.3005467
R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
M. Giordano, L. Piccinelli, and M. Magno, Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 94–97. https://ieeexplore.ieee.org/document/9870017
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
LiLMod, Aug 27, 2024, Haystack: the new LLM framework that is shaking its competitors, https://ai.plainenglish.io/haystack-the-new-llm-framework-that-is-shaking-its-competitors-1a083a153fd9
Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Nicola Sessions, Oct 15, 2024, DataStax Announces New AI Development Platform, Built with NVIDIA AI, https://developer.nvidia.com/blog/datastax-announces-new-ai-development-platform-built-with-nvidia-ai/
Anurag Guda and Shruthii Sathyanarayanan, Oct 16, 2024, Simplify AI Application Development with NVIDIA Cloud Native Stack, https://developer.nvidia.com/blog/simplify-ai-application-development-with-nvidia-cloud-native-stack/
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo, 17 Oct 2024, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, https://arxiv.org/abs/2410.13848 https://github.com/deepseek-ai/Janus?tab=readme-ov-file
Robert Corwin Nov 2024, Running Large Language Models Privately: A comparison of frameworks, models, and costs, https://towardsdatascience.com/running-large-language-models-privately-a-comparison-of-frameworks-models-and-costs-ac33cfe3a462
Kristian McCann, November 13, 2024, Top 10 AI Frameworks, https://aimagazine.com/articles/top-10-ai-frameworks
Meta, Jan 2025 (accessed), Llama Stack: Composable building blocks to build Llama Apps, https://github.com/meta-llama/llama-stack

Serverless

Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
David Linthicum, July 2, 2024, Serverless cloud technology fades away, InfoWorld, https://www.infoworld.com/article/3715605/serverless-cloud-technology-fades-away.html
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Joe Oakley, Hakan Ferhatosmanoglu, 22 Mar 2024, FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication, https://arxiv.org/abs/2403.15195
Hao Wu, Yue Yu, and Junxiao Deng, Shadi Ibrahim, Inria; Song Wu and Hao Fan, Ziyue Cheng, Hai Jin, Huazhong, 2024, StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow, Usenix 2024, https://www.usenix.org/conference/atc24/presentation/wu-hao PDF: https://www.usenix.org/system/files/atc24-wu-hao.pdf
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, ServerlessLLM: Low-Latency Serverless Inference for Large Language Models, 2024, OSDI 2024, https://www.usenix.org/conference/osdi24/presentation/fu
Google, 2024, L’Oréal: Launching Gen AI as a Service in 3 months with Cloud Run and LangChain, https://services.google.com/fh/files/misc/google_loreal_with_langchain_case_study.pdf
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
C. Lu et al., "SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 590-606, doi: 10.1109/SC41406.2024.00044. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a590/21HUVxvcnoA
Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen, 23 Dec 2024, Fast and Live Model Auto Scaling with O(1) Host Caching, https://arxiv.org/abs/2412.17246
HuggingFace, Jan 28, 2025, Welcome to Inference Providers on the Hub , https://huggingface.co/blog/inference-providers (Announcing " integration of four awesome serverless Inference Providers – fal, Replicate, Sambanova, Together AI")
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan, 27 Jan 2025 (v2), DeepFlow: Serverless Large Language Model Serving at Scale, https://arxiv.org/abs/2501.14417
Z. Wu, Y. Deng, J. Hu, L. Cui, Z. Zhang, L. Zeng, G. Min, 28 Feb 2025, Cicada: A Pipeline-Efficient Approach to Serverless Inference with Decoupled Management, https://arxiv.org/abs/2502.20959
Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Minyi Guo, 1 Jul 2025, LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference, https://arxiv.org/abs/2507.00507
https://arxiv.org/abs/2505.14468
Guilin Zhang, Wulan Guo, Ziqi Tan, Srinivas Vippagunta, Suchitra Raman, Shreeshankar Chatterjee, Ju Lin, Shang Liu, Mary Schladenhauffen, Jeffrey Luo, Hailong Jiang, 22 Oct 2025, Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation, https://arxiv.org/abs/2510.19689
Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang, 6 Mar 2026, MoEless: Efficient MoE LLM Serving via Serverless Computing, https://arxiv.org/abs/2603.06350
Minchen Yu, Ao Wang, Bohui Wu, Yuxuan Liu, Dong Chen, Haoxuan Yu, Wei Wang, Ruichuan Chen, Dapeng Nie, Haoran Yang, and Yu Ding. 2026. Enabling Low-Latency, GPU-Efficient Serverless Inference with Model Swapping. ACM Trans. Archit. Code Optim. Just Accepted (April 2026). https://doi.org/10.1145/3800690 https://dl.acm.org/doi/abs/10.1145/3800690 https://dl.acm.org/doi/pdf/10.1145/3800690

Scheduling

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang, 19 Jun 2024, Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving, https://arxiv.org/abs/2406.13511 (Improved batched scheduling by splitting queries into fixed-size token generation slices.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
vLLM, 2024, Performance and Tuning, https://docs.vllm.ai/en/latest/models/performance.html
Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan, 24 Aug 2024, Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling, https://arxiv.org/abs/2408.13510
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang, 28 Aug 2024, Efficient LLM Scheduling by Learning to Rank, https://arxiv.org/abs/2408.15792 https://github.com/hao-ai-lab/vllm-ltr.git
Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
Y. Peng, W. Gao and H. Peng, "Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2024.3463429. https://ieeexplore.ieee.org/document/10684028 https://www.computer.org/csdl/journal/sc/5555/01/10684028/20lm4PEVn9u
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt, 1 Sep 2023, FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference, https://arxiv.org/abs/2309.00558
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, Harry Xu, 2 Oct 2024, ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving, https://arxiv.org/abs/2410.01228
Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher, 1 Oct 2024, Don't Stop Me Now: Embedding Based Scheduling for LLMs, https://arxiv.org/abs/2410.01035
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
S Durvasula, A Zhao, R Kiguru, Y Guan, Z Chen, Oct 2024, ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational Graphs, PACT ’24, October 14–16, 2024, Southern California, CA, USA, https://www.embarclab.com/static/media/ace.1c73b44bc2ad143f7b9f.pdf (Identify parallel kernels at runtime.)
Ferdi Kossmann, Bruce Fontaine, Daya Khudia, Michael Cafarella, Samuel Madden, 23 Oct 2024, Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs, https://arxiv.org/abs/2410.17840
Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher, 23 Oct 2024, Efficient Inference for Augmented Large Language Models, https://arxiv.org/abs/2410.18248
Youpeng Zhao, Jun Wang, 31 Oct 2024, ALISE: Accelerating Large Language Model Serving with Speculative Scheduling, https://arxiv.org/abs/2410.23537
R Mendoza, I Cruz, P Singh, A Martinez, N Kim, S Patel, Nov 2024, Dynamic Resource Management for Efficient Fast Device Placement https://www.researchgate.net/profile/Priya-Singh-103/publication/385528236_Dynamic_Resource_Management_for_Efficient_Fast_Device_Placement/links/672983c3ecbbde716b584acc/Dynamic-Resource-Management-for-Efficient-Fast-Device-Placement.pdf
H Zhang, Z Chen, XLY Liu, J Wu, L Wang, Nov 2024, Dynamic Fast Device Placement Strategies for Real-Time Resource Allocation, https://www.researchgate.net/profile/Haoran-Zhang-111/publication/385589353_Dynamic_Fast_Device_Placement_Strategies_for_Real-Time_Resource_Allocation/links/672b9ca977f274616d60a5e6/Dynamic-Fast-Device-Placement-Strategies-for-Real-Time-Resource-Allocation.pdf
Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis, 5 Nov 2024, AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution https://arxiv.org/abs/2411.03519 (Scheduling multiple agents.)
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 18–35. https://doi.org/10.1145/3698038.3698523 https://dl.acm.org/doi/abs/10.1145/3698038.3698523
Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert Bernstein, Jonathan Ragan-Kelley, 14 Nov 2024 (v2), Exo 2: Growing a Scheduling Language, https://arxiv.org/abs/2411.07211
M. Gil et al., "TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs," in IEEE Embedded Systems Letters, doi: 10.1109/LES.2024.3497587. https://ieeexplore.ieee.org/abstract/document/10753458/
Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki, 12 Nov 2024, The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving, https://arxiv.org/abs/2411.07447
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Xiaowen Chu, 24 Nov 2024, Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems, https://arxiv.org/abs/2411.15715
He, Y., Xu, M., Wu, J., Zheng, W., Ye, K., Xu, C. (2025). UELLM: A Unified and Efficient Approach for Large Language Model Inference Serving. In: Gaaloul, W., Sheng, M., Yu, Q., Yangui, S. (eds) Service-Oriented Computing. ICSOC 2024. Lecture Notes in Computer Science, vol 15404. Springer, Singapore. https://doi.org/10.1007/978-981-96-0805-8_16 https://link.springer.com/chapter/10.1007/978-981-96-0805-8_16
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze, 2 Jan 2025, FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving, https://arxiv.org/abs/2501.01005
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Haoran Qiu, Rodrigo Fonseca, Josep Torrellas, Ricardo Bianchini, 5 Jan 2025, TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms, https://arxiv.org/abs/2501.02600
Liu Qianli, Hong Zicong, Chen Fahao, Li Peng, Guo Song, 12 Jan 2025, Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management, https://arxiv.org/abs/2501.06709
Archit Patke, Dhemath Reddy, Saurabh Jha, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer, 14 Jan 2025, Hierarchical Autoscaling for Large Language Model Serving with Chiron, https://arxiv.org/abs/2501.08090 (Scheduling for interactive versus batch queries.)
Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, Ion Stoica, 24 Jan 2025, Locality-aware Fair Scheduling in LLM Serving, https://arxiv.org/abs/2501.14312 (Scheduling taking into account prefix cache availability based on locality.)
Gohar Irfan Chaudhry, Esha Choukse, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, Ricardo Bianchini, 29 Jan 2025 (v2), Towards Resource-Efficient Compound AI Systems, https://arxiv.org/abs/2501.16634
Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou, 13 Feb 2025 (v2), Online Scheduling for LLM Inference with KV Cache Constraints, https://arxiv.org/abs/2502.07115
Youhe Jiang, Ran Yan, Binhang Yuan, 11 Feb 2025, HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment, https://arxiv.org/abs/2502.07903
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Bowen Pang, Kai Li, Ruifeng She, Feifan Wang, 14 Feb 2025, Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization, https://arxiv.org/abs/2502.15763
Kai Mei, Wujiang Xu, Shuhang Lin, Yongfeng Zhang, 27 Feb 2025, ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving, https://arxiv.org/abs/2502.20576 https://github.com/agiresearch/ECCOS
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)
Amr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson and Omri Kahalon, Mar 18, 2025, Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin, 15 May 2025, ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production, https://arxiv.org/abs/2505.09999
CC Hu, HY Huang, LL Xu, XS Chen, C Wang, J Xu, 2025, ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads, ACMTrans. Arch. Code Optim., https://dl.acm.org/doi/pdf/10.1145/3732941 https://doi.org/10.1145/3732941
Azam Ikram, Xiang Li, Sameh Elnikety, Saurabh Bagchi, 30 Apr 2025 (v2), Ascendra: Dynamic Request Prioritization for Efficient LLM Serving, https://arxiv.org/abs/2504.20828
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
Wei Da, Evangelia Kalyvianaki, 5 Aug 2025, Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling, https://arxiv.org/abs/2508.03611
Meixuan Wang, Yinyu Ye, Zijie Zhou, 8 Aug 2025, LLM Serving Optimization with Variable Prefill and Decode Lengths, https://arxiv.org/abs/2508.06133
Michal Sutter, August 26, 2025, Your LLM is 5x Slower Than It Should Be. The Reason? Pessimism—and Stanford Researchers Just Showed How to Fix It, https://www.marktechpost.com/2025/08/26/your-llm-is-5x-slower-than-it-should-be-the-reason-pessimism-and-stanford-researchers-just-showed-how-to-fix-it/
Zixi Chen, Yinyu Ye, Zijie Zhou, 20 Aug 2025, Adaptively Robust LLM Inference Optimization under Prediction Uncertainty, https://arxiv.org/abs/2508.14544 (Using length prediction and optimistic scheduling for inference.)
Carl Franzen, September 24, 2025, Chinese food delivery app Meituan's open source AI model LongCat-Flash-Thinking rivals GPT-5, https://venturebeat.com/ai/chinese-food-delivery-firm-meituans-open-source-ai-model-longcat-flash
James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901
Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu, 25 Dec 2025, Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism, https://arxiv.org/abs/2512.21487
Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang, 1 Apr 2026, Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions, https://arxiv.org/abs/2604.00499
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Ishan Dhanani and Matej Kosec, Apr 17, 2026, Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo, https://developer.nvidia.com/blog/full-stack-optimizations-for-agentic-inference-with-nvidia-dynamo/
Boskey Savla, Ekin Karabulut and Roman Iurkov, Feb 18, 2026, Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai, https://developer.nvidia.com/blog/unlock-massive-token-throughput-with-gpu-fractioning-in-nvidia-runai/

Load Balancing

Research papers on AI load balancing:

Grant Wilkins, 3 June 202, Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems, Master of Philosophy in Advanced Computer Science, Churchill College, University of Cambridge, https://grantwilkins.github.io/gfw27_project.pdf
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
J Liu, 2024, Data-driven Performance Optimization for Data-intensive Applications, Ph.D. Thesis, Electrical Engineering and Computer Science, University of California, Merced, https://escholarship.org/content/qt6gn2p8mn/qt6gn2p8mn.pdf (Optimization of data movement intensive algorithms, mostly non-AI applications.)
An Efficient Network Orchestrator for Distributed Compound Language Model Systems Muhammad Shahir Abdurrahman, Stanford University, Stanford, California, USA, https://www.scs.stanford.edu/24sp-cs244b/projects/An_Efficient_Network_Orchestrator_for_Distributed_Compound_Language_Model_Systems.pdf
David Spuler, March 2024, Load Balancing, in Generative AI in C++, https://www.aussieai.com/book/ch7-load-balancer
N Kim, P Singh, S Patel, A Martinez, I Cruz, R Mendoza, Oct 2024, Dynamic Load Balancing Techniques for Efficient Fast Device Placement, https://www.researchgate.net/profile/Priya-Singh-103/publication/385103680_Dynamic_Load_Balancing_Techniques_for_Efficient_Fast_Device_Placement/links/6716aa9a68ac304149aa2fa6/Dynamic-Load-Balancing-Techniques-for-Efficient-Fast-Device-Placement.pdf
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos, 8 Nov 2024, AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality, https://arxiv.org/abs/2411.05555
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Geeks4Geeks, 25 Jun, 2024, What is Session Affinity in Load Balancing? https://www.geeksforgeeks.org/what-is-session-affinity-in-load-balancing/
Qwen Team, January 21, 2025, Global-batch load balance almost free lunch to improve your MoE LLM training, https://qwenlm.github.io/blog/global-load-balance/
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, with RabbitMQ or Kafka API rate limits, history database management, in-memory caching with Redis, load balancing, and other real-world deployment issues.)
Jiangsu Du, Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Kaiyi Wu, Zhiguang Chen, Yutong Lu, 25 Apr 2025, EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration, https://arxiv.org/abs/2504.18154
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
Sarah McClure, Sylvia Ratnasamy, Scott Shenker, 28 Jul 2025, Load Balancing for AI Training Workloads, https://arxiv.org/abs/2507.21372
Wei Da and Evangelia Kalyvianaki, 5 Aug 2025, Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling, https://arxiv.org/abs/2508.03611
Han Ji, Xiping Wu, Zhihong Zeng, Chen Chen, 8 Sep 2025, Learning Load Balancing with GNN in MPTCP-Enabled Heterogeneous Networks, https://arxiv.org/abs/2410.17118
Yiyuan He, Minxian Xu, Jingfeng Wu, Jianmin Hu, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, Kejiang Ye, 15 Oct 2025, BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure, https://arxiv.org/abs/2510.13223
Leszek Sliwko, 22 Sep 2025, Intelligent Load Balancing in Cloud Computer Systems, https://arxiv.org/abs/2509.22704
Nabil Omi, Siddhartha Sen, Ali Farhadi, 11 Oct 2025, Load Balancing Mixture of Experts with Similarity Preserving Routers, https://arxiv.org/abs/2506.14038
James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901
Zylos, 15 Jan 2026, LLM Inference Optimization and Quantization 2026, https://zylos.ai/research/2026-01-15-llm-inference-optimization
David Spuler, Ph.D., March 12th, 2026, Scaling Your AI Wrapper Architecture, Aussie AI Blog, https://www.aussieai.com/blog/scaling-ai-wrapper-architectures
David Spuler, Michael Sharpe, June 2025, RAG Deployment, Chapter 12, "RAG Optimization: Accurate and Efficient LLM Applications", https://www.aussieai.com/book/rag-book-12-rag-deployment
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen, 9 Apr 2026, Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving https://arxiv.org/abs/2604.08075
David Spuler, Michael Sharpe, 2025, Deployment Architecture, Chapter 18, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-18-deployment-architecture
David Spuler, March 2024, Chapter 7. Deployment Architecture, in book "Generative AI in C++", https://www.aussieai.com/book/ch7-deployment-architecture
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Networking

Research papers on networking optimizations for LLMs:

Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li, 24 Nov 2024, Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution, https://arxiv.org/abs/2411.15871
Greg Gutmann, Sep 2020, Peer-to-peer Memory Copy with NVLink: CUDA Feature Testing, https://codingbyexample.com/2020/09/14/p2p-memcpy-with-nvlink/
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
Y Tang, R Cheng, P Zhou, T Liu, F Liu, W Tang, K Bae, Nov 2024, Exploring CXL-based KV Cache Storage for LLMServing, https://mlforsystems.org/assets/papers/neurips2024/paper17.pdf
Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Anirudha Agrawal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam, 18 Dec 2024, Optimizing ML Concurrent Computation and Communication with GPU DMA Engines, https://arxiv.org/abs/2412.14335
J. Du et al., "Co-designing Transformer Architectures for Distributed Inference with Low Communication," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3521582. https://ieeexplore.ieee.org/abstract/document/10812976/ (Distributed inference with sub-block parallelism capabilities and a planning phase that optimizes both compute and communications.)
Zongwu Wang, Fangxin Liu, Mingshuai Li, Li Jiang, 29 Dec 2024, TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication, https://arxiv.org/abs/2412.20501 https://github.com/ACA-Lab-SJTU/token-ring (Ring attention with inter-GPU network transmission optimizations.)
Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu, 31 Dec 2024, Automatically Planning Optimal Parallel Strategy for Large Language Models, https://arxiv.org/abs/2501.00254
You Zhou, Changsheng You, Kaibin Huang, 1 Jan 2025, Communication Efficient Cooperative Edge AI via Event-Triggered Computation Offloading, https://arxiv.org/abs/2501.02001
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
NVIDIA, 2024, nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs. https://github.com/NVIDIA/nvbandwidth
NVIDIA, 2024, DCGM Diagnostics, https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
Liu Qianli, Hong Zicong, Chen Fahao, Li Peng, Guo Song, 12 Jan 2025, Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management, https://arxiv.org/abs/2501.06709
Daniel S. Berger, Yuhong Zhong, Pantea Zardoshti, Shuwei Teng, Fiodar Kazhamiaka, Rodrigo Fonseca, 15 Jan 2025, Octopus: Scalable Low-Cost CXL Memory Pooling, https://arxiv.org/abs/2501.09020
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Y Wang, B Li, MTI Ziad, L Eeckhout, J Yang, A Jaleel, Jan 2025, OASIS: Object-Aware Page Management for Multi-GPU Systems https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf
Sylvain Jeaugey, Giuseppe Congiu, Thomas Gillis, Ben Williams and Fred Oh, Jan 31, 2025, New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23, https://developer.nvidia.com/blog/new-scaling-algorithm-and-initialization-with-nvidia-collective-communications-library-2-23/
Nick Comly, Joe DeLaere, Ashraf Eassa, Ivan Goldwasser, Brian Pharris and Brian Slechta, Sep 26, 2024, Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance, https://developer.nvidia.com/blog/low-latency-inference-chapter-2-blackwell-is-coming-nvidia-gh200-nvl32-with-nvlink-switch-gives-signs-of-big-leap-in-time-to-first-token-performance/
Anton Korzh, Brian Pharris, Nick Comly, Ashraf Eassa and Amr Elmeleegy, Nov 01, 2024, 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot, https://developer.nvidia.com/blog/3x-faster-allreduce-with-nvswitch-and-tensorrt-llm-multishot/
Andy Patrizio, Feb 05, 2025, Nvidia claims near 50% boost in AI storage speed, https://www.networkworld.com/article/3817927/nvidia-claims-near-50-boost-in-ai-storage-speed.html
Shenggan Cheng, Shengjie Lin, Lansong Diao, Hao Wu, Siyu Wang, Chang Si, Ziming Liu, Xuanlei Zhao, Jiangsu Du, Wei Lin, and Yang You. 2025. Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 198–213. https://doi.org/10.1145/3669940.3707223 https://dl.acm.org/doi/abs/10.1145/3669940.3707223
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki, 13 Feb 2025, ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments, https://arxiv.org/abs/2502.09334
Hongsun Jang, Siung Noh, Changmin Shin, Jaewon Jung, Jaeyong Song, Jinho Lee, 14 Feb 2025, INF^2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing, https://arxiv.org/abs/2502.09921
Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 170–182. https://doi.org/10.1145/3710848.3710868 https://dl.acm.org/doi/abs/10.1145/3710848.3710868
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)
Dylan Butts, May 19 2025, Nvidia announces new tech to keep it at the center of AI development, https://www.cnbc.com/2025/05/19/nvidia-announces-new-tech-to-keep-it-at-the-center-of-ai-development-.html
Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu, 19 May 2025 (v2), MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production, https://arxiv.org/abs/2505.11432
M. Kim, P. Pinyoanuntapong, B. Kim, W. Saad and D. Calin, "Edge vs Cloud: How Do We Balance Cost, Latency, and Quality for Large Language Models Over 5G Networks?," 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 2025, pp. 1-6, doi: 10.1109/WCNC61545.2025.10978177, https://ieeexplore.ieee.org/abstract/document/10978177/
John Edwards, Jul 22, 2025 7 things you need to know about AI and the data center, https://www.cio.com/article/222623/7-things-to-know-about-ai-in-the-data-center.html
MXI Circuits, Z Zhong, July 2025, Enabling Efficient GPU Communication over Multiple NICs with FuseLink, Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation. July 7–9, 2025, Boston, MA, USA, https://www.usenix.org/system/files/osdi25-ren.pdf
Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf, Feb 19, 2025, The Ultra-Scale Playbook: Training LLMs on GPU Clusters, Hugging Face, https://huggingface.co/spaces/nanotron/ultrascale-playbook https://huggingface.co/spaces/nanotron/ultrascale-playbook/resolve/main/The_Ultra-Scale_Playbook_Training_LLMs_on_GPU_Clusters.pdf
Daniel Commey, Kamel Abbad, Garth V. Crosby and Lyes Khoukhi, 18 Jul 2025, FedSkipTwin: Digital-Twin-Guided Client Skipping for Communication-Efficient Federated Learning, https://arxiv.org/abs/2507.13624
Zhiyong Jin, Runhua Xu, Chao Li, Yizhong Liu, Jianxin Li, 18 Jul 2025, Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning, https://arxiv.org/abs/2505.01454
Mohamad Assaad, Zeinab Nehme, Merouane Debbah, 11 Aug 2025, Communication-Efficient Zero-Order and First-Order Federated Learning Methods over Wireless Networks, https://arxiv.org/abs/2508.08013
Abdelrhman Gaber, Hassan Abd-Eltawab, John Elgallab, Youssif Abuzied, Dineo Mpanya, Turgay Celik, Swarun Kumar, Tamer ElBatt, 30 Jul 2025, FedCVD++: Communication-Efficient Federated Learning for Cardiovascular Risk Prediction with Parametric and Non-Parametric Model Optimization, https://arxiv.org/abs/2507.22963
Chuan Li, Qianyi Zhao, Fengran Mo, Cen Chen, 7 Aug 2025, FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models, https://arxiv.org/abs/2508.10020
Tolga Dimlioglu, Anna Choromanska, 27 Jul 2025, Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning, https://arxiv.org/abs/2507.20424
Sudip K. Seal, Maksudul Alam, Jorge Ramirez, Sajal Dash and Hao Lu, 1 Aug 2025, Compression-Induced Communication-Efficient Large Model Training and Inferencing, https://arxiv.org/abs/2508.00960
Sagar Shrestha, 17 Aug 2025, Communication-Efficient Distributed Asynchronous ADMM, https://arxiv.org/abs/2508.12233
Yue Xia, Tayyebeh Jahani-Nezhad and Rawad Bitar, 18 Aug 2025, Fed-DPRoC:Communication-Efficient Differentially Private and Robust Federated Learning, https://arxiv.org/abs/2508.12978
Zehang Lin, Zheng Lin, Miao Yang, Jianhao Huang, Yuxin Zhang, Zihan Fang, Xia Du, Zhe Chen, Shunzhi Zhu, Wei Ni, 18 Aug 2025, SL-ACC: A Communication-Efficient Split Learning Framework with Adaptive Channel-wise Compression, https://arxiv.org/abs/2508.12984
Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li, 17 Aug 2025, The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning, https://arxiv.org/abs/2505.23176
Sergey Skorik, Vladislav Dorofeev, Gleb Molodtsov, Aram Avetisyan, Dmitry Bylinkin, Daniil Medyakov, Aleksandr Beznosikov, 19 Aug 2025, Communication-Efficient Federated Learning with Adaptive Number of Participants, https://arxiv.org/abs/2508.13803
Xiaoxing Ren, Nicola Bastianello, Karl H. Johansson, Thomas Parisini, 21 Aug 2025, Jointly Computation- and Communication-Efficient Distributed Learning, https://arxiv.org/abs/2508.15509
Arefin Niam, Tevfik Kosar and M S Q Zulkar Nine, 5 Sep 2025, RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks, https://arxiv.org/abs/2509.05207
Ferdous Pervej and Richeng Jin and Md Moin Uddin Chowdhury and Simran Singh and \.Ismail G\"uven\c{c} and Huaiyu Dai, 27 Aug 2025, Computation- and Communication-Efficient Online FL for Resource-Constrained Aerial Vehicles, https://arxiv.org/abs/2506.02972
Kaoru Otsuka, Yuki Takezawa, Makoto Yamada, 3 Sep 2025, Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation, https://arxiv.org/abs/2509.02970
Simon Raviv and Markus Fischer, Sep 10, 2025, Maximizing Low-Latency Networking Performance for Financial Services with NVIDIA Rivermax and NEIO FastSocket, https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/
Mohammad Hasan Narimani and Mostafa Tavassolipour, 12 Sep 2025, FedRP: A Communication-Efficient Approach for Differentially Private Federated Learning Using Random Projection, https://arxiv.org/abs/2509.10041
Shiwei Li, Qunwei Li, Haozhao Wang, Ruixuan Li, Jianbin Lin, Wenliang Zhong, 12 Sep 2025, FedBiF: Communication-Efficient Federated Learning via Bits Freezing, https://arxiv.org/abs/2509.10161
Xiumei Deng, Jun Li, Kang Wei, Long Shi, Zehui Xiong, Ming Ding, Wen Chen, Shi Jin, and H. Vincent Poor, 19 Sep 2025, Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization, https://arxiv.org/abs/2405.17932
Jaiprakash Nagar, Zheng Chen, Marios Kountouris, Photios A. Stavrou, 23 Jul 2025, Information Entropy-Based Scheduling for Communication-Efficient Decentralized Learning, https://arxiv.org/abs/2507.17426
Chih Wei Ling, Chun Hei Michael Shiu, Youqi Wu, Jiande Sun, Cheuk Ting Li, Linqi Song, Weitao Xu, 18 Sep 2025, Communication-Efficient and Privacy-Adaptable Mechanism for Federated Learning, https://arxiv.org/abs/2501.12046
Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richt\'arik, 10 Sep 2025, FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models, https://arxiv.org/abs/2403.09904
Le-Tuan Nguyen, Minh-Duong Nguyen, Seon-Geun Jeong, Dung D. Le, Quoc-Viet Pham, 2 Oct 2025, Communication-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation, https://arxiv.org/abs/2509.26399
Ludi Li, Junbin Mao, Hanhe Lin, Xu Tian, Fang-Xiang Wu, Jin Liu, 20 Oct 2025, CEPerFed: Communication-Efficient Personalized Federated Learning for Multi-Pulse MRI Classification, https://arxiv.org/abs/2510.17584
Yijia Fan, Jusheng Zhang, Jing Yang, Keze Wang, 26 Oct 2025, Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization, https://arxiv.org/abs/2510.22477
Mounssif Krouka and Mehdi Bennis, 26 Sep 2025, Communication-Efficient and Interoperable Distributed Learning, https://arxiv.org/abs/2509.22823
Yuanfei Wang, Xinju Huang, Fangwei Zhong, Yaodong Yang, Yizhou Wang, Yuanpei Chen, Hao Dong, 27 Sep 2025, Communication-Efficient Desire Alignment for Embodied Agent-Human Adaptation, https://arxiv.org/abs/2505.22503
Kunyun Wang, Bohan Li, Kai Yu, Minyi Guo, Jieru Zhao, 13 Oct 2025, Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism, https://arxiv.org/abs/2505.14741
Xiaoxing Ren, Nicola Bastianello, Thomas Parisini, Andreas A. Malikopoulos, 22 Oct 2025, A Communication-Efficient Decentralized Actor-Critic Algorithm, https://arxiv.org/abs/2510.19199
Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang, 26 Feb 2026 (v2), DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference, https://arxiv.org/abs/2602.21548
Gyana Swain, Mar 18, 2026, Microsoft’s laser-free cable tech promises to slash AI data center networking power bills, https://www.networkworld.com/article/4146960/microsofts-laser-free-cable-tech-promises-to-slash-ai-data-center-networking-power-bills.html
Siyuan Mu, Sen Lin, 24 Jan 2026 (v4), A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications https://arxiv.org/abs/2503.07137
David Spuler, Ph.D., March 1st 2026 (updated), List of 600+ Low-Latency C++ Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-low-latency-techniques
David Spuler, Ph.D., September 22, 2025, List of CUDA C++ Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/list-cuda-optimization-techniques
David Spuler, Michael Sharpe, June 2025, RAG Deployment, Chapter 12, "RAG Optimization: Accurate and Efficient LLM Applications", https://www.aussieai.com/book/rag-book-12-rag-deployment
Seonghee Lee, Moein Khazraee, Timothy Stamler, Adit Ranadive and Chris Hoge, Mar 09, 2026, Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library, https://developer.nvidia.com/blog/enhancing-distributed-inference-performance-with-the-nvidia-inference-transfer-library/
Mohammad Siavashi, Mariano Scazzariello, Gerald Q. Maguire Jr., Dejan Kostić, Marco Chiesa, 8 Apr 2026, Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC, https://arxiv.org/abs/2604.07609
David Spuler, Michael Sharpe, 2025, Deployment Architecture, Chapter 18, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-18-deployment-architecture

AI Tech Stack

Research on AI tech stacks:

Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
Matt Murphy, Tim Tully, Grace Ge, Derek Xiao, Katie Keller, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/?tpcc=NL_Marketing
David Spuler, March 2024, Chapter 5. Design Choices & Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
MongoDB, Jun 20, 2024, Understanding the AI Stack In the Era of Generative AI: Exploring the Layers and Components of Today’s AI Applications https://medium.com/mongodb/understanding-the-ai-stack-in-the-era-of-generative-ai-f1fcd66e1393
Akash Bajwa and Chia Jeng Yang, May 27, 2024, The RAG Stack: Featuring Knowledge Graphs: Reducing Hallucinations To Make LLMs Production-Grade With Complex RAG, https://akashbajwa.substack.com/p/the-rag-stack-featuring-knowledge
Melissa Malec, June 5, 2024, AI Orchestration Explained: The What, Why & How for 2024, https://hatchworks.com/blog/gen-ai/ai-orchestration/
Artem Shelamanov, Jun 30, 2024. Tech Stack For Production-Ready LLM Applications In 2024, https://python.plainenglish.io/tech-stack-for-production-ready-llm-applications-in-2024-5eb14105d1b4
David Spuler, March 2024, AI Tech Stack, in Generative AI in C++, https://www.aussieai.com/book/ch5-ai-tech-stack
Cobus Greyling, Sep 2024, An AI Agent Architecture & Framework Is Emerging, https://cobusgreyling.medium.com/an-ai-agent-architecture-framework-is-emerging-addae3804f23
Brandon Royal, Sam Stoelinga, 2024, Scaling and Optimizing Your LLM Pipeline for End-to-End Efficiency, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62006/
Michael Nuñez, September 25, 2024, AI for all: Meta’s ‘Llama Stack’ promises to simplify enterprise adoption, https://venturebeat.com/ai/ai-for-all-meta-llama-stack-promises-to-simplify-enterprise-ai-adoption/
Matt Marshall, October 24, 2024, The enterprise verdict on AI models: Why open source will win, https://venturebeat.com/ai/the-enterprise-verdict-on-ai-models-why-open-source-will-win/
Letta, November 14, 2024, The AI agents stack, https://www.letta.com/blog/ai-agents-stack
Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
Meta, July 2024, RFC-0001 - Llama Stack #6, https://github.com/meta-llama/llama-toolchain/issues/6 (Meta's request for comment on its "Llama stack" for AI.)
Tiernan Ray, Dec. 3, 2024, Enterprises are struggling with what to do with Gen AI, say venture capitalists Despite some uncertainty, enterprise investments in applications soared eight-fold in 2024, with spending on AI-generated code leading the way. https://www.zdnet.com/article/enterprises-are-struggling-with-what-to-do-with-gen-ai-say-venture-capitalists/ (Growing usage but some confusion. Dominant use cases are coding, support chatbots, enterprise search, and meeting summaries.)
Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf
Tanay Jaipuria, May 21, 2025, Infrastructure in the Age of AI Gatekeepers: What happens when AI agents choose your stack, https://www.tanayj.com/p/infrastructure-in-the-age-of-ai-gatekeepers
Apple, June 2025, Updates to Apple's On-Device and Server Foundation Language Models, https://machinelearning.apple.com/research/apple-foundation-models-2025-updates (Apple's 3B on-device model with cloud server alternative. The on-device architecture includes 2-bit quantization, 4-bit embeddings quantization, 8-bit KV quantization, a unique KV cache compression, interleaved local-global attention and multi-LoRA.)
Tomasz Tunguz, Jul 17, 2025, Hidden Technical Debt in AI, https://tomtunguz.com/hidden-technical-debt-in-ai/
Pengcheng Hou, Tao Wang, Daniel Cerkoney, Xiansheng Cai, Zhiyi Li, Youjin Deng, Lei Wang, and Kun Chen, 18 Jul 2025, An AI-powered Technology Stack for Solving Many-Electron Field Theory, https://arxiv.org/abs/2403.18840
Character.AI, Jan 13, 2026, Technical Deep Dive: How DigitalOcean and AMD Delivered a 2x Production Inference Performance Increase for Character.ai, https://blog.character.ai/technical-deep-dive-how-digitalocean-and-amd-delivered-a-2x-production-inference-performance-increase-for-character-ai/
David Spuler, Ph.D., December 9th, 2024 Humans are the Top Layer of the AI Stack, Aussie AI Blog, https://www.aussieai.com/blog/human-top-layer
David Spuler, Ph.D., December 9th, 2024 Reasoning is the New AI Middleware, Aussie AI Blog, https://www.aussieai.com/blog/reasoning-middleware
David Spuler, Ph.D., December 9th, 2024 The AI Application Layer, Aussie AI Blog, https://www.aussieai.com/blog/application-layer
Ishan Dhanani and Matej Kosec, Apr 17, 2026, Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo, https://developer.nvidia.com/blog/full-stack-optimizations-for-agentic-inference-with-nvidia-dynamo/
David Spuler, Michael Sharpe, 2025, Architectures of AI Projects, Chapter 12, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-12-architectures-project
David Spuler, Michael Sharpe, 2025, Deployment Architecture, Chapter 18, in "Generative AI Applications", https://www.aussieai.com/book/ai-apps-book-18-deployment-architecture
David Spuler, March 2024, Chapter 5. https://www.aussieai.com/book/ch5-design-architectures Design Choices & Architectures, in book "Generative AI in C++",
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf