Aussie AI

Ensemble Multi-Model AI

Last Updated 22 October, 2025

by David Spuler, Ph.D.

If one AI engine is amazing, imagine what two could do. Or ten. Or a hundred.

The idea of using two or more AI engines together to complete a task is not new. This area of research is called "ensemble learning" or "multi-model" engines.

There are many ways that AI engines could cooperate to achieve more than one would alone. This is an area ripe for exploration, where we have only scratched the surface of possibilities. On the other hand, with today's high cost of GPUs limiting what can be done in both AI inference and training, the full realization of ensemble AI algorithms is still in the distant future.

One way in which two AI models work together has become common in practice: using the output of one model as input text for the training data set of a new model. This has been an effective technique for improving downstream models, but it isn't usually classed as an ensemble algorithm, although there is a paper about it with Honovich et al. (2022). This idea is similar to Knowledge Distillation, but differs in that its goal isn't to create a cut-down smaller model, but usually to improve accuracy of a large model.

Existing Ensemble AI Algorithms

Despite the costs, there are a number of areas where there is existing research, and even a significant number of practical use cases, in ensemble models.

Generative Adversarial Networks (GANs). These use two AI models against each other, one being critical of the output of the other, to improve overall results. This is an architecture that has proven very successful in practice.
Knowledge Distillation (KD). This is an optimization method whereby a large model is built first, and then it is used to "distill" its knowledge into a smaller model, as a "teacher" model to a "student" model. Two models are used in training, but only the smaller model is used for inference. Note that there are also various more advanced forms of "ensemble distillation" (multi-teacher) that involve two or more teachers, plus the student model. See knowledge distillation.
Cascade inference optimizations. Cascade optimizations involve the selection of different models, or paths through multiple models, as a type of inference optimization. Two or more models are used in the inference phase, and there are various methods for deciding at runtime which to choose. See cascades.
Big-small models. In a specific subtype of the cascade method called big-small models, there are two models trained differently. The idea is that during inference, a heuristic decides which model to invoke, and a faster "small" model is used in common cases, and the slower "big" model is only needed to handle the rarer cases. This can improve inference latency and total throughput.
Speculative decoding. This method is similar to the big-little architecture, but differs because all queries initially go to the small model. The small, faster model does its best to suggest output tokens (i.e. it "speculates"), and then a bigger model is used to "verify" the correctness. If the bigger model has to override the smaller model, it is then slower, but usually the smaller model is accurate enough for the whole process to be faster on average, with accuracy close to using a larger model. Read more about: speculative decoding,
Collaborative inference. This is where two or more models combine to complete inference. They could be on separate servers, or together on one server. See collaborative inference research.
Multi-model training. There are various other methods of uses an ensemble technique to train up better models. Some of the techniques are called: bagging, boosting, stacking, and many variants. There is also the simple training method of using input data sets to train models, where the data is based on the output of some other model.
Multiple parallel models inference. In some architectures, multiple models can process the same input data in parallel. Each model produces its own results, and the resulting ensemble model has to then choose an overall result. The algorithm to decide amongst the multiple options could be maximum (or minimum), majority (counting), weighted averages, and many other combinations.
Hybrid dual transformer architectures. Rather than entire duplicate Transformer models, there has been research on adding extra components to the basic Transformer architecture, such as two heads or two encoders merged together. See Transformer architectures.

Model Selection Algorithms

Model selection algorithms are dynamic inference optimizations where a choice is made between two or more models for execution. One example is "big-little" architectures (see below), where a heuristic attempts to send "easy" queries to a faster "little" model. Various other ensemble architectures are possible with multiple models. Another practical example of a different type of model selection is the deployment architecture, which may be deciding which server to send the request to, where each server may have different models or instances of the same model. Other areas of research with similar aims include cascades and collaborative inference.

Model selection architectures are a general class of ensemble architectures where one of two or more models is "selected" to process a query. The general idea is that only model actually processes the query, with a decision mechanism beforehand (which can be model-based or heuristic-based). Example sub-classes of model selection include:

Research on model selection architectures:

Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J. Yadwadkar, Aditya Akella, 27 Oct 2023, MOSEL: Inference Serving Using Dynamic Modality Selection, https://arxiv.org/abs/2310.18481 (Multi-modal model with dynamic selection of modality.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang, Dec 2023, Green Edge AI: A Contemporary Survey, https://arxiv.org/abs/2312.00333
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica, 20 Feb 2025, Optimizing Model Selection for Compound AI Systems, https://arxiv.org/abs/2502.14815
Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das, 14 Apr 2025, HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving, https://arxiv.org/abs/2504.10724
Jianfei Li, Kevin Kam Fung Yuen, 18 Jul 2025, CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis, https://arxiv.org/abs/2507.14022
Judy Long, Tao Liu, Sean Alexander Woznicki, Miljana Markovi\'c, Oskar Marko, Molly Sears, 10 Aug 2025, From Time-series Generation, Model Selection to Transfer Learning: A Comparative Review of Pixel-wise Approaches for Large-scale Crop Mapping, https://arxiv.org/abs/2507.12590
Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, and Sara Beery, 31 Jul 2025, Consensus-Driven Active Model Selection, https://arxiv.org/abs/2507.23771
Lorenzo Volpi, Alejandro Moreo, Fabrizio Sebastiani, 30 Jul 2025, Transductive Model Selection under Prior Probability Shift, https://arxiv.org/abs/2507.22647
Basile Lewandowski, Robert Birke, Lydia Y. Chen, 14 Aug 2025, Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.10993
Andrea Napoli, Paul White, 17 Aug 2025, Clustering-Based Validation Splits for Model Selection under Domain Shift, https://arxiv.org/abs/2405.19461
Chongyu Qu, Allen J. Luna, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Kim L. Sandler, Bennett A. Landman, Yuankai Huo, 20 Aug 2025, Cohort-Aware Agents for Individualized Lung Cancer Risk Prediction Using a Retrieval-Augmented Model Selection Framework, https://arxiv.org/abs/2508.14940
Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou, 21 Jul 2025, Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design, https://arxiv.org/abs/2507.15336
Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan, 7 Aug 2025, Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning, https://arxiv.org/abs/2507.12612
Bohan Yang, Gang Liu, Yang Zhong, Rirao Dao, Yujia Qian, Ke Shi, Anke Tang, Yong Luo, Qi Kong, Jingnan Liu, 7 Aug 2025, Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma, https://arxiv.org/abs/2506.15803
Chenghui Zheng, Garvesh Raskutti, 19 Aug 2025, Comparing Model-agnostic Feature Selection Methods through Relative Efficiency, https://arxiv.org/abs/2508.14268
Courtney Ford and Mark T. Keane, 26 Aug 2025, Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions, https://arxiv.org/abs/2507.06029
Ayaka Tsutsumi, Guang Li, Ren Togo, Takahiro Ogawa, Satoshi Kondo, Miki Haseyama, 28 Aug 2025, Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification, https://arxiv.org/abs/2508.20461
Radha Kodali, Venkata Rao Dhulipalla, Venkata Siva Kishor Tatavarty, Madhavi Nadakuditi, Bharadwaj Thiruveedhula, Suryanarayana Gunnam, Durga Prasad Bavirisetti and Gogulamudi Pradeep Reddy, 28 Aug 2025, Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment, https://arxiv.org/abs/2506.06680
Florian Frommlet, Jon Lachmann, Geir Storvik, Aliaksandr Hubin, 31 Aug 2025, FBMS: An R Package for Flexible Bayesian Model Selection and Model Averaging, https://arxiv.org/abs/2509.00753
Jorge-Humberto Urrea-Quintero and David Anton and Laura De Lorenzis and Henning Wessels, 19 Sep 2025, Automated Constitutive Model Discovery by Pairing Sparse Regression Algorithms with Model Selection Criteria, https://arxiv.org/abs/2509.16040
Argimiro Arratia, Alejandra Caba\~na, Ernesto Mordecki, Gerard Rovira-Parra, 15 Sep 2025, The Morgan-Pitman Test of Equality of Variances and its Application to Machine Learning Model Evaluation and Selection, https://arxiv.org/abs/2509.12185

Model Routing

Model routing is a generalization of model selection, which also includes evaluation of issues such as serving and network costs. The idea is to select a model from a set (i.e., model selection), and then route the query over the network to wherever that model is being served. This could potentially include choosing between commercial and open source models, and many variations therein.

Research papers on model routing algorithms:

M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 12 May 2024, MARS: A Benchmark for Multi-LLM Algorithmic Routing System, ICLR 2024, https://openreview.net/forum?id=C0rs3wM0N8 PDF: https://openreview.net/pdf?id=C0rs3wM0N8
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 28 Mar 2024 (v2), RouterBench: A Benchmark for Multi-LLM Routing System, https://arxiv.org/abs/2403.12031 https://github.com/withmartian/routerbench
Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik, 2 Oct 2024, ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation, https://arxiv.org/abs/2410.01731 https://comfygen-paper.github.io/
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
Arun Shankar, Oct 2024, Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch, https://medium.com/google-cloud/designing-cognitive-architectures-agentic-workflow-patterns-from-scratch-63baa74c54bc
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
Kirill Vasilevski, Dayi Lin, Ahmed Hassan, 14 Nov 2024, Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models, https://arxiv.org/abs/2411.09837
AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu, 5 Dec 2024, Bench-CoE: a Framework for Collaboration of Experts from Benchmark, https://arxiv.org/abs/2412.04167 https://github.com/ZhangXJ199/Bench-CoE
Dimitrios Sikeridis, Dennis Ramdass, Pranay Pareek, 12 Dec 2024, PickLLM: Context-Aware RL-Assisted Large Language Model Routing, https://arxiv.org/abs/2412.12170
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Avital Shafran, Roei Schuster, Thomas Ristenpart, Vitaly Shmatikov, 3 Jan 2025, Rerouting LLM Routers, https://arxiv.org/abs/2501.01818
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
J. Pichlmeier, P. Ross and A. Luckow, "Performance Characterization of Expert Router for Scalable LLM Inference," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 1686-1693, doi: 10.1109/BigData62323.2024.10826121. https://ieeexplore.ieee.org/abstract/document/10826121
Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, François Jacquenet, 4 Feb 2025 (v2), Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey, https://arxiv.org/abs/2502.00409
Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, Subha Maity, 5 Feb 2025, CARROT: A Cost Aware Rate Optimal Router, https://arxiv.org/abs/2502.03261
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica, 20 Feb 2025, Optimizing Model Selection for Compound AI Systems, https://arxiv.org/abs/2502.14815
Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
W Zhang, X Ren, Mar 2025, ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation, ICLR 2025 review, https://openreview.net/pdf?id=cizhOu3CZa (Induce activation sparsity for MoE choice in the model router.)
Tella Rajashekhar Reddy, Palak, Rohan Gandhi, Anjaly Parayil, Chaojie Zhang, Mike Shepperd, Liangcheng Yu, Jayashree Mohan, Srinivasan Iyengar, Shivkumar Kalyanaraman, Debopam Bhattacherjee, 15 May 2025, AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with Heron, https://arxiv.org/abs/2505.09989
Ben Dickson, July 7, 2025, New 1.5B router model achieves 93% accuracy without costly retraining, https://venturebeat.com/ai/new-1-5b-router-model-achieves-93-accuracy-without-costly-retraining/
Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar, 22 Jul 2025, Universal Model Routing for Efficient LLM Inference, https://arxiv.org/abs/2502.08773
Linjiang Cao, Maonan Wang and Xi Xiong, 21 Jul 2025, A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows, https://arxiv.org/abs/2505.06178
Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni, 9 Aug 2025, A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning, https://arxiv.org/abs/2408.07057
Xin He, Junxi Shen, Zhenheng Tang, Xiaowen Chu, Bo Li, Ivor W. Tsang, Yew-Soon Ong, 3 Aug 2025, RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging, https://arxiv.org/abs/2508.01784
Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, and Daniele Vigo, 12 Aug 2025, Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem, https://arxiv.org/abs/2508.08659
Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, and Daniele Vigo, 12 Aug 2025, Edge-Selector Model Applied for Local Search Neighborhood for Solving Vehicle Routing Problems, https://arxiv.org/abs/2508.14071
Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen, 21 Aug 2025, Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model, https://arxiv.org/abs/2411.13787
Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, Junlan Feng, 9 Sep 2025, Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference, https://arxiv.org/abs/2509.07571

Big-Little Transformer Models

Although many ensemble architectures are about doing even more computations to achieve even more advanced capabilities, the idea of big-little or big-small architectures is to improve inference speed and throughput by sending common queries to a smaller model. The larger model is reserved for more difficult or rarer queries which take longer. As such, it's an AI version of the "common case first" code optimization technique.

Note that "collaborative inference" (e.g. "parallel decoding" or "speculative decoding") is also conceptually a similar architecture, but differs because multiple models work together for inference, whereas pure big-little architectures choose the model at the start, and only one model does the inference. Also related are the various non-autoregressive architectures.

Research papers on big-little architectures:

Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K., Big little transformer decoder, arXiv preprint arXiv:2302.07863, May 2023, https://arxiv.org/abs/2302.07863
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
Leviathan, Y., Kalman, M., and Matias, Y., Fast inference from transformers via speculative decoding, May 2023, https://arxiv.org/abs/2211.17192
Stern, M., Shazeer, N., and Uszkoreit, J., Nov 2018, Blockwise parallel decoding for deep autoregressive models, Advances in Neural Information Processing Systems, 31, https://arxiv.org/abs/1811.03115
Z. Peng et al. 2018. AXNet: ApproXimate computing using an end-to-end trainable neural network. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) https://ieeexplore.ieee.org/document/8605388 (Ensemble dual-model method where one model is a fast approximatation of the other.)
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, "Input Hardness Adaptive Models" for methods of running faster on easy image classification problems.)
Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget. arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
Park, E., Kim, D., Kim, S., Kim, Y.-D., Kim, G., Yoon, S., and Yoo, S. (2015). Big/little deep neural network for ultra low power inference. In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 124–132. https://ieeexplore.ieee.org/document/7331375
D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
H Malard, S Zaiem, R Algayres, 2023, Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences, arXiv preprint arXiv:2309.12712, https://arxiv.org/pdf/2309.12712.pdf (Big-little architecture for audio models.)
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
Kaya Y., Hong S., Dumitras T., Shallow-deep networks: Understanding and mitigating network overthinking Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)
Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf, Nov 2023, OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking, https://arxiv.org/abs/2311.09758
Zichao Shen, Neil Howard and Jose Nunez-Yanez, 2022, Big–Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors, J. Low Power Electron. Appl. 2022, 12(2), 28, https://doi.org/10.3390/jlpea12020028 https://www.mdpi.com/2079-9268/12/2/28 Code: https://github.com/DarkSZChao/Big-Little_NN_Strategies
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao, June 2024, Hybrid SLM and LLM for Edge-Cloud Collaborative Inference, EdgeFM ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan, https://dl.acm.org/doi/pdf/10.1145/3662006.3662067 (Small model on edge devices with large model in the cloud, performing collaborative inference.)
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis, 10 Jul 2023 (v2), Contrastive Decoding: Open-ended Text Generation as Optimization, https://arxiv.org/abs/2210.15097
Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong, 14 Oct 2024, big.LITTLE Vision Transformer for Efficient Visual Recognition, https://arxiv.org/abs/2410.10267
Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf
Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
Yang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin Zhang, 24 Apr 2025, Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks, https://arxiv.org/abs/2504.17421

General Research on Ensemble Models

Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems. Yoshitomo Matsubara, Luca Soldaini, Eric Lind, Alessandro Moschitt, Dec 2022, https://arxiv.org/abs/2201.05767
Yungeng Zhang, Yuru Pei & Hongbin Zha, Learning Dual Transformer Network for Diffeomorphic Registration, Sep 2021, Medical Image Computing and Computer Assisted Intervention, MICCAI 2021, https://link.springer.com/chapter/10.1007/978-3-030-87202-1_13
Xian-Feng Han, Yi-Fei Jin, Hui-Xian Cheng, Guo-Qiang Xiao, Dual Transformer for Point Cloud Analysis, Apr 2021, https://arxiv.org/abs/2104.13044
Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei, 2023, Dual Vision Transformer, https://arxiv.org/pdf/2207.04976, Code: https://github.com/YehLi/ImageNetModel
Mohammed Alhamid, Ensemble Models, March 2021, https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c
Oliver R. A. Dunbar, Andrew B. Duncan, Andrew M. Stuart, Marie-Therese Wolfram, Jan 2022, Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods, https://arxiv.org/abs/2104.03384
Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548, PDF: https://arxiv.org/pdf/2308.06548.pdf (Treating the internal components of a Transformer as if they are an ensemble model.)
T. G. Dietterich. Ensemble methods in machine learning. In Multiple classifier systems, pages 1–15. Springer, 2000, Lecture Notes in Computer Science book series LNCS, volume 1857, https://link.springer.com/chapter/10.1007/3-540-45014-9_1, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e3b09a777c71a4b88888509ab9bfa12d8bf295ba (Early paper with ensemble idea applied to classifiers, rather than multi-model.)
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505 (Split computing is splitting the inference between server and edge machines.)
L. Li, K. Ota and M. Dong, "Deep learning for smart industry: Efficient manufacture inspection system with fog computing", IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665-4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640 ("Fog computing" is like cloud computing but on servers "nearer" to the ground.)
C. Lo, Y.-Y. Su, C.-Y. Lee and S.-C. Chang, "A dynamic deep neural network design for efficient workload allocation in edge computing", Proc. IEEE Int. Conf. Comput. Design (ICCD), pp. 273-280, Nov. 2017. https://ieeexplore.ieee.org/document/8119222
G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with "wisdom of committees" decisions.)
Naftaly, U., N. Intrator, and D. Horn. "Optimal ensemble averaging of neural networks." Network: Computation in Neural Systems 8, no. 3 (1997): 283–296. https://www.tau.ac.il/~horn/publications/optimal.pdf
Y. Liu and X. Yao, Ensemble Learning via Negative Correlation, Neural Networks, Volume 12, Issue 10, December 1999, pp. 1399-1404. doi:10.1016/S0893-6080(99)00073-8, https://www.sciencedirect.com/science/article/abs/pii/S0893608099000738
Z.S.H. Chan; N. Kasabov, 2005, Fast neural network ensemble learning via negative-correlation data correction, IEEE Transactions on Neural Networks, Volume 16, Issue 6, November 2005, https://ieeexplore.ieee.org/document/1528547
E Diao, 2023, Efficient and Collaborative Methods for Distributed Machine Learning, Ph.D. thesis, Department of Electrical and Computer Engineering Duke University, https://www.proquest.com/openview/410ea5eb4275fded25890f04c96a902e/1?pq-origsite=gscholar&cbl=18750&diss=y
X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707 (Multiple submodels inside a large model.)
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
NVIDIA, Aug 2023, Triton Architecture, NVIDIA Triton Inference Server user guide documentation, https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html
Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, and Haizhou Li, “LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT,” in Interspeech, June 2022, https://arxiv.org/abs/2203.15610 2022.
Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodolà, May 2023, Accelerating Transformer Inference for Translation via Parallel Decoding, https://arxiv.org/abs/2305.10427
Meng Wang; Liang Qian; Na Meng; Yusong Cheng; Weiwei Fang, Nov 2023, Model Parallelism Optimization for Distributed DNN Inference on Edge Devices, 2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), https://ieeexplore.ieee.org/abstract/document/10391646 (Distributes inference across multiple edge devices at the layer level, with further optimization using layer fusion.)
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-Former: Bridging mobilenet and transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279. https://arxiv.org/abs/2108.05895
S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
Kah Phooi Seng, Li-Minn Ang, 2022, "Embedded Intelligence: State-of-the-Art and Research Challenges", IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683 PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou, 7 Jun 2024, Mixture-of-Agents Enhances Large Language Model Capabilities, https://arxiv.org/abs/2406.04692
Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex: Pushing the boundaries of complex reasoning through multi model collaboration,” arXiv preprint arXiv:2310.00280, 2023. https://arxiv.org/abs/2310.00280
Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, John Langford, Furong Huang, 6 Oct 2024, EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? https://arxiv.org/abs/2410.04571
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin, 7 Nov 2024, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. https://arxiv.org/abs/2411.04996
Yingxuan Yang, Qiuying Peng, Jun Wang, Weinan Zhang, 21 Nov 2024, Multi-LLM-Agent Systems: Techniques and Business Perspectives, https://arxiv.org/abs/2411.14033
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
Chan-Jan Hsu, Davide Buffelli, Jamie McGowan, Feng-Ting Liao, Yi-Chang Chen, Sattar Vakili, Da-shan Shiu, 16 May 2025, Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity, https://arxiv.org/abs/2505.11107
Yang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin Zhang, 24 Apr 2025, Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks, https://arxiv.org/abs/2504.17421
Juvenal Bassa, Vidya Manian, Sudhir Malik, Arghya Chattopadhyay, 9 Aug 2025, Jet Image Tagging Using Deep Learning: An Ensemble Model, https://arxiv.org/abs/2508.10034
Adithya Mohan, Dominik R\"o{\ss}le, Daniel Cremers and Torsten Sch\"on, 22 Jul 2025, Advancing Robustness in Deep Reinforcement Learning with an Ensemble Defense Approach, https://arxiv.org/abs/2507.17070
Moncef Garouani, Ayah Barhrhouj, Olivier Teste, 23 Jul 2025, XStacking: Explanation-Guided Stacked Ensemble Learning, https://arxiv.org/abs/2507.17650
Md Min-Ha-Zul Abedin and Tazqia Mehrub, 22 Jul 2025, Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset, https://arxiv.org/abs/2507.16952
Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal, 24 Jul 2025, Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods, https://arxiv.org/abs/2507.18242
Mizuki Funato and Yohei Sawada, 24 Jul 2025, Multi-Model Ensemble and Reservoir Computing for River Discharge Prediction in Ungauged Basins, https://arxiv.org/abs/2507.18423
Hemanth Kumar M, Karthika M, Saianiruth M, Vasanthakumar Venugopal, Anandakumar D, Revathi Ezhumalai, Charulatha K, Kishore Kumar J, Dayana G, Kalyan Sivasailam, Bargava Subramanian, 17 Jul 2025, A Deep Learning-Based Ensemble System for Automated Shoulder Fracture Detection in Clinical Radiographs, https://arxiv.org/abs/2507.13408
Kondrup Emma, 17 Jul 2025, Base3: a simple interpolation-based ensemble method for robust dynamic link prediction, https://arxiv.org/abs/2506.12764
Satyankar Chandra, Ashutosh Gupta, Kaushik Mallik, Krishna Shankaranarayanan, Namrita Varshney, 19 Jul 2025, Glitches in Decision Tree Ensemble Models, https://arxiv.org/abs/2507.14492
Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie, 18 Jul 2025, Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports, https://arxiv.org/abs/2410.16543
I. Bentley, J. Tedder, M. Gebran, and A. Paul, 21 Jul 2025, Further exploration of binding energy residuals using machine learning and the development of a composite ensemble model, https://arxiv.org/abs/2503.11066
Angel Felipe Magnoss\~ao de Paula, Imene Bensalem, Paolo Rosso, Wajdi Zaghouani, 20 Jul 2025, Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages, https://arxiv.org/abs/2303.09823
Yuya Kawakami, Daniel Cayan, Dongyu Liu, and Kwan-Liu Ma, 8 Aug 2025, ClimateSOM: A Visual Analysis Workflow for Climate Ensemble Datasets, https://arxiv.org/abs/2508.06732
Yu Pan, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao, 10 Aug 2025, Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching, https://arxiv.org/abs/2411.02026
Md Basit Azam and Sarangthem Ibotombi Singh, 21 Jul 2025, Clinical-Grade Blood Pressure Prediction in ICU Settings: An Ensemble Framework with Uncertainty Quantification and Cross-Institutional Validation, https://arxiv.org/abs/2507.19530
Uzzal Saha, Surya Prakash, 27 Jul 2025, Multi-Attention Stacked Ensemble for Lung Cancer Detection in CT Scans, https://arxiv.org/abs/2507.20221
Nicklas Werge, Yi-Shan Wu, Bahareh Tasdighi, Melih Kandemir, 31 Jul 2025, Directional Ensemble Aggregation for Actor-Critics, https://arxiv.org/abs/2507.23501
Navid Yazdanjue, Morteza Rakhshaninejad, Hossein Yazdanjouei, Mohammad Sadegh Khorshidi, Mikko S. Niemela, Fang Chen, Amir H. Gandomi, 19 Jul 2025, A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms, https://arxiv.org/abs/2507.22912
Jizhou Guo, 31 Jul 2025, LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration, https://arxiv.org/abs/2507.23167
Galadrielle Humblot-Renaux, Gianni Franchi, Sergio Escalera and Thomas B. Moeslund, 30 Jul 2025, COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP, https://arxiv.org/abs/2507.22576
Vinicius L. S. Silva, Gabriel S. Seabra, Alexandre A. Emerick, 30 Jul 2025, Mitigating loss of variance in ensemble data assimilation: machine learning-based and distance-free localization, https://arxiv.org/abs/2506.13362
\c{C}a\u{g}atay Demirel, 30 Jul 2025, RocketStack: Level-aware deep recursive ensemble learning framework with adaptive feature fusion and model pruning dynamics, https://arxiv.org/abs/2506.16965
Md. Ehsanul Haque, S. M. Jahidul Islam, Shakil Mia, Rumana Sharmin, Ashikuzzaman, Md Samir Morshed, Md. Tahmidul Huque, 31 Jul 2025, StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection, https://arxiv.org/abs/2508.00117
Maxime Bouscary, Saurabh Amin, 4 Aug 2025, OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling, https://arxiv.org/abs/2508.02503
Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis, 2 Aug 2025, Ensemble learning for uncertainty estimation with application to the correction of satellite precipitation products, https://arxiv.org/abs/2403.10567
Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima and Katsuyoshi Hotta, 5 Aug 2025, CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification, https://arxiv.org/abs/2508.03064
Mari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, and Zheng Wang, 5 Aug 2025, Ensemble Learning for Large Language Models in Text and Code Generation: A Survey, https://arxiv.org/abs/2503.13505
Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Syahid Al Irfan, Hindriyanto Dwi Purnomo and Radius Tanone, 6 Aug 2025, CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion, https://arxiv.org/abs/2508.04036
Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, and Yanzhi Wang, 5 Aug 2025, VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting, https://arxiv.org/abs/2507.05116
Rui Zou, 7 Aug 2025, Self-Error Adjustment: Theory and Practice of Balancing Individual Performance and Diversity in Ensemble Learning, https://arxiv.org/abs/2508.04948
Beicheng Xu, Wei Liu, Keyao Ding, Yupeng Lu, Bin Cui, 7 Aug 2025, PSEO: Optimizing Post-hoc Stacking Ensemble Through Hyperparameter Tuning, https://arxiv.org/abs/2508.05144
MD Shaikh Rahman, Feiroz Humayara, Syed Maudud E Rabbi, Muhammad Mahbubur Rashid, 6 Aug 2025, Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization, https://arxiv.org/abs/2508.04790
Daniil Vlasenko, Vadim Ushakov, Alexey Zaikin, Denis Zakharov, 8 Aug 2025, Ensemble-Based Graph Representation of fMRI Data for Cognitive Brain State Classification, https://arxiv.org/abs/2508.06118
Dan MacKinlay, 8 Aug 2025, The Ensemble Kalman Update is an Empirical Matheron Update, https://arxiv.org/abs/2502.03048
Keumseo Ryum, Jinu Gong, and Joonhyuk Kang, 12 Aug 2025, SHEFL: Resource-Aware Aggregation and Sparsification in Heterogeneous Ensemble Federated Learning, https://arxiv.org/abs/2508.08552
Luigi D'Amico, Daniel De Rosso, Ninad Dixit, Raul Salles de Padua, Samuel Palmer, Samuel Mugel, Rom\'an Or\'us, Holger Eble, and Ali Abedi, 12 Aug 2025, Blockchain Network Analysis using Quantum Inspired Graph Neural Networks & Ensemble Models, https://arxiv.org/abs/2508.09237
Md. Milon Islam, Md Rezwanul Haque, S M Taslim Uddin Raju, Fakhri Karray, 12 Aug 2025, FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition, https://arxiv.org/abs/2508.09362
Hossein Shokouhinejad, Roozbeh Razavi-Far, Griffin Higgins, Ali A Ghorbani, 13 Aug 2025, Explainable Ensemble Learning for Graph-Based Malware Detection, https://arxiv.org/abs/2508.09801
DongSeong-Yoon, 9 Aug 2025, A Cooperative Game-Based Multi-Criteria Weighted Ensemble Approach for Multi-Class Classification, https://arxiv.org/abs/2508.10926
Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng, 15 Aug 2025, Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble, https://arxiv.org/abs/2508.11279
Jiadong Chen, Xiao He, Hengyu Ye, Fuxin Jiang, Tieying Zhang, Jianjun Chen, Xiaofeng Gao, 18 Aug 2025, Online Ensemble Transformer for Accurate Cloud Workload Forecasting in Predictive Auto-Scaling, https://arxiv.org/abs/2508.12773
Dongjae Jeon, Taeheon Kim, Seongwon Cho, Minhyuk Seo, Jonghyun Choi, 18 Aug 2025, TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions, https://arxiv.org/abs/2508.12690
Qingyan Meng, Mingqing Xiao, Zhengyu Ma, Huihui Zhou, Yonghong Tian, Zhouchen Lin, 18 Aug 2025, A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks, https://arxiv.org/abs/2508.12609
Tzu-Chieh Chen and Wen-Yang Lin, 17 Aug 2025, On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora, https://arxiv.org/abs/2412.16976
Atsushi Nitanda, Anzelle Lee, Damian Tan Xing Kai, Mizuki Sakaguchi, Taiji Suzuki, 17 Aug 2025, Propagation of Chaos for Mean-Field Langevin Dynamics and its Application to Model Ensemble, https://arxiv.org/abs/2502.05784
Yifei Chen, Guanting Dong, Yutao Zhu, Zhicheng Dou, 19 Aug 2025, Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration, https://arxiv.org/abs/2508.13828
Hanseul Kang, Shervin Karimkashi, Ville Vuorinen, 13 Aug 2025, Parameter-Aware Ensemble SINDy for Interpretable Symbolic SGS Closure, https://arxiv.org/abs/2508.14085
Orestis Konstantaropoulos, Stelios Manolis Smirnakis, Maria Papadopouli, 19 Aug 2025, Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs, https://arxiv.org/abs/2508.14140
Dylan Bouchard, Mohit Singh Chauhan, 20 Aug 2025, Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers, https://arxiv.org/abs/2504.19254
Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
Junying Yang, Gang Lu, Xiaoqing Yan, Peng Xia, Di Wu, 25 Aug 2025, Adaptive Ensemble Learning with Gaussian Copula for Load Forecasting, https://arxiv.org/abs/2508.17700
Xinrui He, Yikun Ban, Jiaru Zou, Tianxin Wei, Curtiss B. Cook, Jingrui He, 23 Aug 2025, LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation, https://arxiv.org/abs/2410.21520
Semih Eren and Deniz Kucukahmetler and Nico Scherf, 23 Jul 2025, Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025), https://arxiv.org/abs/2507.17897
Ameya Daigavane, Bodhi P. Vani, Darcy Davidson, Saeed Saremi, Joshua Rackers, Joseph Kleinhenz, 21 Jul 2025, JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles, https://arxiv.org/abs/2410.14621
Brian Liu, Rahul Mazumder, Peter Radchenko, 29 Jul 2025, Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives, https://arxiv.org/abs/2506.20114
Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi and Yongjae Lee, 30 Jul 2025, Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization, https://arxiv.org/abs/2503.13544
Sahil Bansal, Sai Shruthi Sistla, Aarti Arikatala, Sebastian Schreiber, 7 Aug 2025, Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning, https://arxiv.org/abs/2508.05888
Akshat Dubey, Aleksandar An\v{z}el, Bahar \.Ilgen, Georges Hattab, 13 Aug 2025, UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles, https://arxiv.org/abs/2508.09639
Ruipu Li, Daniel Menacho, Alexander Rodr\'iguez, 18 Aug 2025, Adaptive Conformal Prediction Intervals Over Trajectory Ensembles, https://arxiv.org/abs/2508.13362
Xinzhu Liang, Joseph M. Lukens, Sanjaya Lohani, Brian T. Kirby, Thomas A. Searles, Xin Qiu, Kody J. H. Law, 21 Aug 2025, Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles, https://arxiv.org/abs/2505.13585
Yixuan Sun, Romain Egele, Sri Hari Krishna Narayana, Luke Van Roekel, Carmelo Gonzales, Steven Brus, Balu Nadiga, Sandeep Madireddy, Prasanna Balaprakash, 22 Aug 2025, Ensembles of Neural Surrogates for Parametric Sensitivity in Ocean Modeling, https://arxiv.org/abs/2508.16489
Dhruv D. Modi, Rong Pan, 18 Aug 2025, Enhancing Transformer-Based Foundation Models for Time Series Forecasting via Bagging, Boosting and Statistical Ensembles, https://arxiv.org/abs/2508.16641
Serhii Svystun and Pavlo Radiuk and Oleksandr Melnychenko and Oleg Savenko and Anatoliy Sachenko, 4 Sep 2025, YOLO Ensemble for UAV-based Multispectral Defect Detection in Wind Turbine Components, https://arxiv.org/abs/2509.04156
Mieko Ochi, Bae Yuan, 4 Sep 2025, Ensemble of Pathology Foundation Models for MIDOG 2025 Track 2: Atypical Mitosis Classification, https://arxiv.org/abs/2509.02591
Siyu Zhang, Kenneth Mcmillan, 4 Sep 2025, Toward Faithfulness-guided Ensemble Interpretation of Neural Network, https://arxiv.org/abs/2509.04588
Wonseo Jang, Dongjae Kim, 5 Sep 2025, An Arbitration Control for an Ensemble of Diversified DQN variants in Continual Reinforcement Learning, https://arxiv.org/abs/2509.04815
V.S. Usatyuk, D.A. Sapoznikov and S.I. Egorov, 26 Aug 2025, Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature, https://arxiv.org/abs/2508.18717
Sidahmed Benabderrahmane, Talal Rahwan, 26 Aug 2025, Attackers Strike Back? Not Anymore - An Ensemble of RL Defenders Awakens for APT Detection, https://arxiv.org/abs/2508.19072
Zekun Ni, Jonathan Weyn, Hang Zhang, Yanfei Xiang, Jiang Bian, Weixin Jin, Kit Thambiratnam, Qi Zhang, Haiyu Dong, Hongyu Sun, 25 Aug 2025, Huracan: A skillful end-to-end data-driven system for ensemble data assimilation and weather prediction, https://arxiv.org/abs/2508.18486
Xin Du, Subramanian Ramamoorthy, Wouter Duivesteijn, Jin Tian, Mykola Pechenizkiy, 26 Aug 2025, Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles, https://arxiv.org/abs/2109.10432
Feijiang Li, Jieting Wang, Liuya zhang, Yuhua Qian, Shuai jin, Tao Yan, Liang Du, 27 Aug 2025, k-HyperEdge Medoids for Clustering Ensemble, https://arxiv.org/abs/2412.08289
Nanna E. Hartong, Ilias Sachpazidis, Oliver Blanck, Lucas Etzel, Jan C. Peeken, Stephanie E. Combs, Horst Urbach, Maxim Zaitsev, Dimos Baltas, Ilinca Popp, Anca-Ligia Grosu, Tobias Fechter, 28 Aug 2025, Prediction of Local Failure after Stereotactic Radiotherapy in Melanoma Brain Metastases Using Ensemble Learning on Clinical, Dosimetric, and Radiomic Data, https://arxiv.org/abs/2405.20825
Ziheng Chen, Jin Huang, Jiali Cheng, Yuchan Guo, Mengjie Wang, Lalitesh Morishetti, Kaushiki Nag, Hadi Amiri, 28 Aug 2025, FUTURE: Flexible Unlearning for Tree Ensemble, https://arxiv.org/abs/2508.21181
Rajiv Kailasanathan, William R. Clements, Mohammad Reza Boskabadi, Shawn M. Gibford, Emmanouil Papadakis, Christopher J. Savoie, Seyed Soheil Mansouri, 29 Aug 2025, Quantum enhanced ensemble GANs for anomaly detection in continuous biomanufacturing, https://arxiv.org/abs/2508.21438
Sara B. Coutinho, Rafael M.O. Cruz, Francimaria R. S. Nascimento and George D. C. Cavalcanti, 29 Aug 2025, HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble, https://arxiv.org/abs/2508.21482
Sagy Ephrati, James Woodfield, 29 Aug 2025, Trajectory learning for ensemble forecasts via the continuous ranked probability score: a Lorenz '96 case study, https://arxiv.org/abs/2508.21664
Ephraiem Sarabamoun, 27 Aug 2025, Ensemble Debates with Local Large Language Models for AI Alignment, https://arxiv.org/abs/2509.00091
Li Dengjin and Guo Yanming and Xie Yuxiang and Li Zheng and Chen Jiangming and Li Xiaolong and Lao Mingrui, 27 Aug 2025, Learning from Peers: Collaborative Ensemble Adversarial Training, https://arxiv.org/abs/2509.00089
Mingzhi Dai, Weiwei Cai, Xiang Feng, Huiqun Yu, Weibin Guo and Miao Guo, 1 Sep 2025, Prediction, Generation of WWTPs microbiome community structures and Clustering of WWTPs various feature attributes using DE-BP model, SiTime-GAN model and DPNG-EPMC ensemble clustering algorithm with modulation of microbial ecosystem health, https://arxiv.org/abs/2509.01526
Junghoon Justin Park, Huan-Hsin Tseng, Shinjae Yoo, Samuel Yen-Chi Chen, Jiook Cha, 31 Aug 2025, It's-A-Me, Quantum Mario: Scalable Quantum Reinforcement Learning with Multi-Chip Ensembles, https://arxiv.org/abs/2509.00713
Aditya Sengar, Ali Hariri, Pierre Vandergheynst, Patrick Barth, 2 Sep 2025, Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space, https://arxiv.org/abs/2509.02196
Ali Hamdi, Malak Mohamed, Rokaia Emad and Khaled Shaban, 2 Sep 2025, An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction, https://arxiv.org/abs/2509.02446
Alejandro Rodriguez Dominguez, Muhammad Shahzad and Xia Hong, 2 Sep 2025, Structured Basis Function Networks: Loss-Centric Multi-Hypothesis Ensembles with Controllable Diversity, https://arxiv.org/abs/2509.02792
Towhidul Islam and Md Sumon Ali, 2 Sep 2025, Ensemble Learning for Healthcare: A Comparative Analysis of Hybrid Voting and Ensemble Stacking in Obesity Risk Prediction, https://arxiv.org/abs/2509.02826
Fatemeh Azad, Zoran Bosni\'c, Matja\v{z} Kukar, 3 Sep 2025, Meta-Imputation Balanced (MIB): An Ensemble Approach for Handling Missing Data in Biomedical Machine Learning, https://arxiv.org/abs/2509.03316
Nadezhda Dobreva, Emmanuel Blazquez, Jai Grover, Dario Izzo, Yuzhen Qin, Dominik Dold, 3 Sep 2025, Decentralised self-organisation of pivoting cube ensembles using geometric deep learning, https://arxiv.org/abs/2509.03140
John Zobolas, Anne-Marie George, Alberto L\'opez, Sebastian Fischer, Marc Becker, Tero Aittokallio, 2 Sep 2025, Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data, https://arxiv.org/abs/2509.02648
Jiaju Miao, Wei Zhu, 6 Sep 2025, Ensemble of Precision-Recall Curve (PRC) Classification Trees with Autoencoders, https://arxiv.org/abs/2509.05766
Saghar Ganji, Mohammad Naisipour, Alireza Hassani, Arash Adib, 7 Sep 2025, Distillation of CNN Ensemble Results for Enhanced Long-Term Prediction of the ENSO Phenomenon, https://arxiv.org/abs/2509.06227
Quinten Steenhuis, 8 Sep 2025, That's So FETCH: Fashioning Ensemble Techniques for LLM Classification in Civil Legal Intake and Referral, https://arxiv.org/abs/2509.07170
Saumya Goswami, Siddharth Kurra, 9 Sep 2025, HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention, https://arxiv.org/abs/2509.07475
Jiajun Shen, Yufei Jin, Yi He and Xingquan Zhu, 11 Sep 2025, HGEN: Heterogeneous Graph Ensemble Networks, https://arxiv.org/abs/2509.09843
Zexu Jin, 12 Sep 2025, Hadamard-Riemannian Optimization for Margin-Variance Ensemble, https://arxiv.org/abs/2509.10189
Dung T. Tran, Huyen Ngoc Huyen, Hong Nguyen, Xuan-Vu Phan, Nam-Phong Nguyen, 12 Sep 2025, Adaptive Rainfall Forecasting from Multiple Geographical Models Using Matrix Profile and Ensemble Learning, https://arxiv.org/abs/2509.08277
Gianlucca Zuin and Adriano Veloso, 11 Sep 2025, "A 6 or a 9?": Ensemble Learning Through the Multiplicity of Performant Models and Explanations, https://arxiv.org/abs/2509.09073
Graham Clyne, Guillaume Couairon, Guillaume Gastineau, Claire Monteleoni, Anastase Charantonis, 19 Sep 2025, ArchesClimate: Probabilistic Decadal Ensemble Generation With Flow Matching, https://arxiv.org/abs/2509.15942
Chunna Li, Yiwei Song, Yuanhai Shao, 19 Sep 2025, SETrLUSI: Stochastic Ensemble Multi-Source Transfer Learning Using Statistical Invariant, https://arxiv.org/abs/2509.15593
Cenyang Wu, Qinhan Yu, Liang Zhou, 16 Sep 2025, Ensemble Visualization With Variational Autoencoder, https://arxiv.org/abs/2509.13000
Ben Griffin, Diego Vidaurre, Ugur Koyluoglu, Joseph Ternasky, Fuat Alican, Yigit Ihlamur, 15 Sep 2025, Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success, https://arxiv.org/abs/2505.24622
Bingchen Wang, Zi-Yu Khoo, Bryan Kian Hsiang Low, 14 Sep 2025, Prompts to Proxies: Emulating Human Preferences via a Compact LLM Ensemble, https://arxiv.org/abs/2509.11311
Kevin Valencia, Ziyang Liu, Justin Cui, 14 Sep 2025, Data-Efficient Ensemble Weather Forecasting with Diffusion Models, https://arxiv.org/abs/2509.11047
Byeonghee Lee and Joonsung Kang, 4 Sep 2025, An Interpretable Ensemble Framework for Multi-Omics Dementia Biomarker Discovery Under HDLSS Conditions, https://arxiv.org/abs/2509.10527
Amanuel Anteneh, 12 Sep 2025, Parameter estimation with uncertainty quantification from continuous measurement data using neural network ensembles, https://arxiv.org/abs/2509.10756
Amsalu Tessema, Tizazu Bayih, Kassahun Azezew, Ayenew Kassie, 18 Sep 2025, Data-Driven Prediction of Maternal Nutritional Status in Ethiopia Using Ensemble Machine Learning Models, https://arxiv.org/abs/2509.14945
Matthew Nolan and Lina Yao and Robert Davidson, 10 Sep 2025, Ensemble Distribution Distillation for Self-Supervised Human Activity Recognition, https://arxiv.org/abs/2509.08225
Hong Liu, 10 Sep 2025, MAESTRO: Multi-modal Adaptive Ensemble for Spectro-Temporal Robust Optimization, https://arxiv.org/abs/2509.08578
Zeinab Ghasemi Darehnaei, Mohammad Shokouhifar, Hossein Yazdanjouei, S.M.J. Rastegar Fatemi, 9 Sep 2025, Two-Stage Swarm Intelligence Ensemble Deep Transfer Learning (SI-EDTL) for Vehicle Detection Using Unmanned Aerial Vehicles, https://arxiv.org/abs/2509.08026
Divya Thuremella, Yi Yang, Simon Wanna, Lars Kunze, and Daniele De Martini, 17 Sep 2025, Ensemble of Pre-Trained Models for Long-Tailed Trajectory Prediction, https://arxiv.org/abs/2509.13914
Felipe Crivellaro Minuzzi, Leandro Farina, 17 Sep 2025, Artificial neural networks ensemble methodology to predict significant wave height, https://arxiv.org/abs/2509.14020

Deployment: Serving Multiple Cloud Models

When running AI engines on a server, there are multiple models running, and a server has to decide how to allocated queries to the models efficiently. There are some papers on the practical deployment aspects of managing multiple models in a cloud server.

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Mahmut Taylan Kandemir, and Chita R Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), April 2022. PDF: https://www.usenix.org/system/files/nsdi22spring_prepub_gunasekaran.pdf, Code: https://github.com/jashwantraj92/cocktail (Serving framework for scheduling and serving queries from multiple ensemble models.)
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), July 2021, https://www.usenix.org/conference/atc21/presentation/romero (Choosing models when serving queries from multiple ensemble models.)
Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Training a new model by using another model to automatically create the data set on which to train it.)
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel, March 2023, PETALS: Collaborative Inference and Fine-tuning of Large Models, https://arxiv.org/abs/2209.01188, Code: https://petals.ml/ (Swarm deployment that shares the load to multiple servers.)
Y Liu, C Wang, Y Xiao, Z Lv, L Xiao, X Ji, 2023, Collaborative Inference for MEC Services Based on Multimodal Deep Neural Network, 2023 IEEE/CIC International Conference on Communications in China (ICCC) https://ieeexplore.ieee.org/abstract/document/10233276
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 11, "Efficient Edge Inference by Selective Query (Hybrid Models)".)
Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176
Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. (2017). Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. SIGPLAN Notices, 52(4):615–629. https://doi.org/10.1145/3093336.3037698
Letian Zhang, Lixing Chen, Jie Xu, Feb 2021, Autodidactic Neurosurgeon: Collaborative Deep Inference for Mobile Edge Intelligence via Online Learning, https://arxiv.org/abs/2102.02638
Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176, PDF: https://arxiv.org/pdf/2105.04104v2.pdf
Y. Long, I. Chakraborty, and K. Roy, 2020, “Conditionally deep hybrid neural networks across edge and cloud,” arXiv:2005.10851, https://arxiv.org/abs/2005.10851
Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully, "Enabling All In-Edge Deep Learning: A Literature Review", IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
E Samikwa, A Di Maio, T Braun, 2023, DISNET: Distributed Micro-Split Deep Learning in Heterogeneous Dynamic IoT, IEEE internet of things journal, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10243578 (Partitioning methods for a model split over multiple distributed servers.)
Daniele Jahier Pagliari, Roberta Chiaro, Enrico Macii, Massimo Poncino, "CRIME: Input-Dependent Collaborative Inference for Recurrent Neural Networks", IEEE Transactions on Computers, vol.70, no.10, pp.1626-1639, 2021. https://ieeexplore.ieee.org/document/9184963 (Collaborative inference by sharing workload to multiple devices.)
Y Zhang, Z Zhang, W Bao, D Yuan, 2023, ITIF: Integrated Transformers Inference Framework for Multiple Tenants on GPU, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 112–121, https://doi.org/10.1145/3605573.3605585, https://dl.acm.org/doi/abs/10.1145/3605573.3605585
Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
JH Jones, May 2024, A Quantitative Comparison of Pre-Trained Model Registries to Traditional Software Package Registries, Masters Thesis, Electrical and Computer Engineering, Purdue University, https://hammer.purdue.edu/articles/thesis/A_Quantitative_Comparison_of_Pre-Trained_Model_Registries_to_Traditional_Software_Package_Registries/25686447/1 PDF: https://hammer.purdue.edu/ndownloader/files/46096152
Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Cohere Toolkit, https://github.com/cohere-ai/cohere-toolkit (A set of open source components for RAG architectures.)
Ahmed Menshawy, Zeeshan Nawaz, Mahmoud Fahmy, April 2024, Navigating Challenges and Technical Debt in Large Language Models Deployment, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, Pages 192–199, https://doi.org/10.1145/3642970.3655840 https://dl.acm.org/doi/abs/10.1145/3642970.3655840 PDF Slides: https://www.cl.cam.ac.uk/research/srg/netos/euromlsys2024/slides/P_5_27.pdf
Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury, 25 Apr 2024, Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services, https://arxiv.org/abs/2404.16283 (Scheduling GPU activity for multiple queries to ensure good UI experience for text-streaming outputs like chatbots.)
Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
Konstantinos Papaioannou, Thaleia Dimitra Doudali, April 2024, The Importance of Workload Choice in Evaluating LLM Inference Systems, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, April 2024, Pages 39–46, https://doi.org/10.1145/3642970.3655823 https://dl.acm.org/doi/abs/10.1145/3642970.3655823
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, 2023, Deep Learning Workload Scheduling in GPU Datacenters: A Survey, ACM Computing Surveys, PDF: https://dl.acm.org/doi/pdf/10.1145/3638757
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Meenu Mary John; Helena Holmström Olsson; Jan Bosch, 2020, AI Deployment Architecture: Multi-Case Study for Key Factor Identification, 2020 27th Asia-Pacific Software Engineering Conference (APSEC), https://ieeexplore.ieee.org/abstract/document/9359253
Meenu Mary John, Helena Holmström Olsson, Jan Bosch, 2020, Architecting AI Deployment: A Systematic Review of State-of-the-Art and State-of-Practice Literature, ICSOB 2020: Software Business, pp 14–29, https://link.springer.com/chapter/10.1007/978-3-030-67292-8_2
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
Ali Rahmanian, Doctoral Thesis, April 2024, Edge Orchestration for Latency-Sensitive Applications, Department of Computing Science, Umea University, Sweden, https://www.diva-portal.org/smash/get/diva2:1849510/FULLTEXT02.pdf
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, 2 Apr 2024, MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving, https://arxiv.org/abs/2404.02015
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, https://arxiv.org/abs/2403.07648 12 Mar 2024, Characterization of Large Language Model Development in the Datacenter, (Analysis of deployment and LLOps issues in a 6-month production deployment.)
Apple, June 2022, Deploying Transformers on the Apple Neural Engine, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
Andrew Starc, Feb 22 2024, Mantel Group survey reveals AI challenges of large Australian businesses, CRN, https://www.crn.com.au/news/mantel-group-survey-reveals-ai-challenges-of-large-australian-businesses-605376
Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, Austin Z. Henley, 21 Dec 2023, Building Your Own Product Copilot: Challenges, Opportunities, and Needs, https://arxiv.org/abs/2312.14231
Jacob Robbins, January 4, 2024, Why generative AI orchestration startups are poised for growth in 2024, Pitch Book, https://pitchbook.com/news/articles/generative-ai-orchestration-startups-venture-capital-unicorns
Eberhard Hechler , Martin Oberhofer , Thomas Schaeck, 2020, Deploying AI in the Enterprise, Book, https://link.springer.com/book/10.1007/978-1-4842-6206-1
Teresa Tung, June 2023, 7 architecture considerations for generative AI, Accenture, https://www.accenture.com/us-en/blogs/cloud-computing/7-generative-ai-architecture-considerations
Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Kirill Kolodiazhnyi, May 15, 2020, Hands-On Machine Learning with C++: Build, train, and deploy end-to-end machine learning and deep learning pipelines, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335/
Deci Engineering Team, September 28, 2021, 5 Factors that Impact the Inference Pipeline in Production + Hardware Usage Metrics, https://deci.ai/blog/optimize-inference-pipeline-production/
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
Waleed Kadous, August 23, 2023, Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper, https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper Code: https://github.com/anyscale/factuality-eval
Louis-François Bouchard, Louie Peters, May 2024, Chapter 11: Deployment, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
Aarushi Kansal, Chapter 7: Monitoring, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Intel, Jul 24, 2024, Generative AI Fundamentals: Deploying LLMs with OpenVINO™, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/generative-ai-fundamentals-deploying-llms-with-openvino-3057861f6feb
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Abhinand, Aug 20, 2024, Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably, https://abhinand05.medium.com/self-hosting-llama-3-1-70b-or-any-70b-llm-affordably-2bd323d72f8d
Dom Couldwell, Sep 03, 2024 Dealing with ‘day two’ issues in generative AI deployments, https://www.infoworld.com/article/3493255/dealing-with-day-two-issues-in-generative-ai-deployments.html
Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
Evan Schuman, 01 May 2024, LLM deployment flaws that catch IT by surprise, https://www.computerworld.com/article/2095216/llm-deployment-flaws-that-catch-it-by-surprise.html
Michael Nuñez, September 10, 2024, Is Anthropic’s new ‘Workspaces’ feature the future of enterprise AI management? https://venturebeat.com/ai/is-anthropics-new-workspaces-feature-the-future-of-enterprise-ai-management/
Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, 19 May 2022 (v3), Challenges in Deploying Machine Learning: a Survey of Case Studies, ACM Comput. Surv., Vol. 55, No. 6, Article 114, December 2022. https://doi.org/10.1145/3533378 https://arxiv.org/abs/2011.09926 https://dl.acm.org/doi/fullHtml/10.1145/3533378#Bib0005
Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Michael J. Zellinger, Matt Thomson, 3 Oct 2024, Efficiently Deploying LLMs with Controlled Risk, https://arxiv.org/abs/2410.02173
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
Fan Yang, Zehao Wang∗, Haoyu Zhang, Zhenhua Zhu, Xinhao Yang, Guohao Dai, Yu Wang, Oct 2024, Efficient Deployment of Large Language Model across Cloud-Device Systems, https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/f06a14c1-4d6d-441d-b4e4-82545ac5781b.pdf
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Alina Mailach, Sebastian Simon, Johannes Dorn, Norbert Siegmund, 13 Nov 2024, Practitioners' Discussions on Building LLM-based Applications for Production, https://arxiv.org/abs/2411.08574
Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Kailai Sun, Xinwei Wang, Xi Miao, and Qianchuan Zhao. 2025. A review of AI edge devices and lightweight CNN and LLM deployment. Neurocomput. 614, C (Jan 2025). https://doi.org/10.1016/j.neucom.2024.128791 https://dl.acm.org/doi/abs/10.1016/j.neucom.2024.128791
Alex Fazio, Feb 2025, How to Build an LLM Chat App: The New Litmus Test for Junior Devs, https://x.com/alxfazio/status/1893242657331101976 (How to build a wrapper chat app that scales by taking care of message queueing, API rate limits, history database management, caching, and other real-world deployment issues.)
The Latency Gambler, May 10, 2025, Scaling to 1 Million Users: The Architecture I Wish I Knew Sooner, https://medium.com/@kanishks772/scaling-to-1-million-users-the-architecture-i-wish-i-knew-sooner-39c688ded2f1
John Edwards, Jul 22, 2025 7 things you need to know about AI and the data center, https://www.cio.com/article/222623/7-things-to-know-about-ai-in-the-data-center.html
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)
Annie S. Chen, Govind Chada, Laura Smith, Archit Sharma, Zipeng Fu, Sergey Levine, Chelsea Finn, 22 Jul 2025, Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment, https://arxiv.org/abs/2311.01059
Rodrigo Moreira and Larissa F. Rodrigues Moreira and Fl\'avio de Oliveira Silva, 23 Jul 2025, Performance Evaluation and Threat Mitigation in Large-scale 5G Core Deployment, https://arxiv.org/abs/2507.17850
Aidan Furlong, Xingang Zhao, Robert Salko, Xu Wu, 18 Jul 2025, Development and Deployment of Hybrid ML Models for Critical Heat Flux Prediction in Annulus Geometries, https://arxiv.org/abs/2507.14332
Anton Abilov, Ke Zhang, Hemank Lamba, Elizabeth M. Olson, Joel R. Tetreault, Alejandro Jaimes, 21 Jul 2025, Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work, https://arxiv.org/abs/2507.15823
Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang, 10 Aug 2025, Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative, https://arxiv.org/abs/2508.07329
Stephan Rabanser, 11 Aug 2025, Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning, https://arxiv.org/abs/2508.07556
Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, 28 Jul 2025, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, https://arxiv.org/abs/2507.20526
Yixin Song,Zhenliang Xue,Dongliang Wei,Feiyang Chen,Jianxiang Gao,Junchen Liu,Hangyu Liang,Guangshuo Qin,Chengrong Tian,Bo Wen,Longyu Zhao,Xinrui Zheng,Zeyu Mi,Haibo Chen, 28 Jul 2025, SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment, https://arxiv.org/abs/2507.20984
Hugo Retief, Kayathri, Vigneswaran, Surajit Ghosh, Mariangel Garcia Andarcia, Chris Dickens, 28 Jul 2025, Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa, https://arxiv.org/abs/2502.19989
Arianna Stropeni, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Marco Fabris, Gian Antonio Susto, 28 Jul 2025, Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression, https://arxiv.org/abs/2505.07119
Alex Durkin, Jasper Stolte, Matthew Jones, Raghuraman Pitchumani, Bei Li, Christian Michler, Mehmet Mercang\"oz, 30 Jul 2025, Safe Deployment of Offline Reinforcement Learning via Input Convex Action Correction, https://arxiv.org/abs/2507.22640
Wentao Zhang, Yilei Zhao, Chuqiao Zong, Xinrun Wang, Bo An, 4 Aug 2025, FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment, https://arxiv.org/abs/2508.02292
Dai Li, Kevin Course, Wei Li, Hongwei Li, Jie Hua, Yiqi Chen, Zhao Zhu, Rui Jian, Xuan Cao, Bi Xue, Yu Shi, Jing Qian, Kai Ren, Matt Ma, Qunshu Zhang, Rui Li, 4 Aug 2025, Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment, https://arxiv.org/abs/2508.02929
Zakariya Ba Alawi, 6 Aug 2025, A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs, https://arxiv.org/abs/2508.04035
Mengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang, 7 Aug 2025, Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks, https://arxiv.org/abs/2508.05783
Bill Tang, \c{C}a\u{g}{\i}l Ko\c{c}yi\u{g}it, Eric Rice, Phebe Vayanos, 11 Aug 2025, Learning Optimal and Fair Policies for Online Allocation of Scarce Societal Resources from Data Collected in Deployment, https://arxiv.org/abs/2311.13765
Nikola Pi\v{z}urica, Nikola Milovi\'c, Igor Jovan\v{c}evi\'c, Conor Heins, and Miguel de Prado, 12 Aug 2025, A Hardware-oriented Approach for Efficient Active Inference Computation and Deployment, https://arxiv.org/abs/2508.13177
Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann and Gregor Schiele, 19 Aug 2025, Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management, https://arxiv.org/abs/2508.13905
Sukheon Kang, Youngkwon Kim, Jinkyu Yang, Seunghwa Ryu, 19 Aug 2025, Physics-Informed Neural Networks for Programmable Origami Metamaterials with Controlled Deployment, https://arxiv.org/abs/2508.13559
Mackenzie Jorgensen, Kendall Brogle, Katherine M. Collins, Lujain Ibrahim, Arina Shah, Petra Ivanovic, Noah Broestl, Gabriel Piles, Paul Dongha, Hatim Abdulhussein, Adrian Weller, Jillian Powers, Umang Bhatt, 18 Aug 2025, Documenting Deployment with Fabric: A Repository of Real-World AI Governance, https://arxiv.org/abs/2508.14119
Mauro Belgiovine, Chris Dick, Kaushik Chowdhury, 17 Aug 2025, Better Together: Leveraging Multiple Digital Twins for Deployment Optimization of Airborne Base Stations, https://arxiv.org/abs/2508.15816
Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, and Navin Kumar, 22 Aug 2025, Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment, https://arxiv.org/abs/2508.16839
Deepak Kumar, Divakar Yadav, Yash Patel, 22 Aug 2025, GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model, https://arxiv.org/abs/2508.16700
Drew Prinster, Xing Han, Anqi Liu, Suchi Saria, 25 Aug 2025, WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales, https://arxiv.org/abs/2505.04608
Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu, 5 Sep 2025, Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment, https://arxiv.org/abs/2503.15937
Xinyi Hou, Jiahao Han, Yanjie Zhao, Haoyu Wang, 26 Aug 2025, Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study, https://arxiv.org/abs/2505.02502
Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li, 30 Aug 2025, An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment, https://arxiv.org/abs/2509.00560
Wei Huang, Anda Cheng, Zhao Zhang, Yinggui Wang, 1 Sep 2025, DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment, https://arxiv.org/abs/2509.01354
Victor Guyomard and Mathis Mauvisseau and Marie Paindavoine, 8 Sep 2025, Breaking SafetyCore: Exploring the Risks of On-Device AI Deployment, https://arxiv.org/abs/2509.06371
Juan D. Gil, Ehecatl Antonio Del Rio Chanona, Jos\'e L. Guzm\'an, Manuel Berenguel, 8 Sep 2025, Reinforcement learning meets bioprocess control through behaviour cloning: Real-world deployment in an industrial photobioreactor, https://arxiv.org/abs/2509.06853
Luxi He, Xiangyu Qi, Michel Liao, Inyoung Cheong, Prateek Mittal, Danqi Chen, Peter Henderson, 9 Sep 2025, The Model Hears You: Audio Language Model Deployments Should Consider the Principle of Least Privilege, https://arxiv.org/abs/2503.16833
Vera Pavlova and Mohammed Makhlouf, 18 Sep 2025, Efficient and Versatile Model for Multilingual Information Retrieval of Islamic Text: Development and Deployment in Real-World Scenarios, https://arxiv.org/abs/2509.15380
Xianyuan Liu, Jiayang Zhang, Shuo Zhou, Thijs L. van der Plas, Avish Vijayaraghavan, Anastasiia Grishina, Mengdie Zhuang, Daniel Schofield, Christopher Tomlinson, Yuhan Wang, Ruizhe Li, Louisa van Zeeland, Sina Tabakhi, Cyndie Demeocq, Xiang Li, Arunav Das, Orlando Timmerman, Thomas Baldwin-McDonald, Jinge Wu, Peizhen Bai, Zahraa Al Sahili, Omnia Alwazzan, Thao N. Do, Mohammod N.I. Suvon, Angeline Wang, Lucia Cipolina-Kun, Luigi A. Moretti, Lucas Farndale, Nitisha Jain, Natalia Efremova, Yan Ge, Marta Varela, Hak-Keung Lam, Oya Celiktutan, Ben R. Evans, Alejandro Coca-Castro, Honghan Wu, Zahraa S. Abdallah, Chen Chen, Valentin Danchev, Nataliya Tkachenko, Lei Lu, Tingting Zhu, Gregory G. Slabaugh, Roger K. Moore, William K. Cheung, Peter H. Charlton, Haiping Lu, 19 Sep 2025, Towards deployment-centric multimodal AI beyond vision and language, https://arxiv.org/abs/2504.03603
Qinghui Liu, Jon E. Nesvold, Hanna Raaum, Elakkyen Murugesu, Martin R{\o}vang, Bradley J Maclntosh, Atle Bj{\o}rnerud, Karoline Skogen, 19 Sep 2025, Examining Deployment and Refinement of the VIOLA-AI Intracranial Hemorrhage Model Using an Interactive NeoMedSys Platform, https://arxiv.org/abs/2505.09380

Submodels (Many-Models-in-One)

Although most ensemble architectures do have multiple distinct models, another approach is to have one model act as many models. This is called "submodels" or "many-models-in-one" or "many-in-one models."

Several methods have been tried, including training multiple submodels as part of a larger model, or using cut-down versions of a bigger model as multiple smaller submodels (e.g. using early exit to give submodels along the depth dimension, width pruning along the width dimension, etc). In some such architectures, the same model is simply executed with different parameters, such as the meta-parameters controlling early exit or width pruning.

This idea also appears as a specialization of other optimizations. For example, the self-speculative decoding method involves having the smaller draft model simply an early exit of the larger verifier model. This avoids the cost of having to train two models, and there are advantages in computation reuse when half of the layers of the big model have already been computed in the small model.

Research papers on submodels and many-models-in-one architectures:

Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
Lei Xun, Jonathon Hare, Geoff V. Merrett, 17 Jan 2024, Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices, https://arxiv.org/abs/2401.08965
Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Parsa Kavehzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 1 Jun 2024 (v3), SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks, https://arxiv.org/abs/2309.00255
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 11 Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707
Janek Haberer, Ali Hojjat, Olaf Landsiedel, 26 Sep 2024, HydraViT: Stacking Heads for a Scalable ViT, https://arxiv.org/abs/2409.17978 https://github.com/ds-kiel/HydraViT
Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, Yanzhi Wang, 25 Sep 2024, Search for Efficient Large Language Models, https://arxiv.org/abs/2409.17372 (Looking for subnets inside models as an alternative to NAS.)
Shrenik Bhansali, Alwin Jin, Tyler Lizzo, Larry Heck, 23 Oct 2024, LEGO: Language Model Building Blocks, https://arxiv.org/abs/2410.18287 (Extract small models out of large models.)
R Cai, Y Ro, GW Kim, P Wang, BE Bejnordi, A Akella, Oct 2024, Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://utns.cs.utexas.edu/assets/papers/neurips24-readme.pdf https://github.com/VITA-Group/READ-ME (Extract multiple smaller MoE expert models from a large LLM.)
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 23 Jan 2017, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, https://arxiv.org/abs/1701.06538
Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 6 Feb 2025, CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference, https://arxiv.org/abs/2502.04416 https://github.com/JarvisPei/CMoE
Gabe Guo, Stefano Ermon, 29 Apr 2025, Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding, https://arxiv.org/abs/2504.20456
Andrii Balashov, 23 Jul 2025, Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models, https://arxiv.org/abs/2507.17107
Francesco Corti, Balz Maag, Joachim Schauer, Ulrich Pferschy, Olga Saukh, 28 Jul 2025, REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints, https://arxiv.org/abs/2311.13349
Mengting Ai, Tianxin Wei, Yifan Chen, Zeming Guo, Jingrui He, 6 Jan 2025 (v3), MLP Fusion: Towards Efficient Fine-tuning of Dense and Mixture-of-Experts Language Models, https://arxiv.org/abs/2307.08941 https://github.com/weitianxin/MLP_Fusion

Distributed Inference

Distributed inference is the technique of spreading the inference of a single query over multiple models in different locations. It is a generalization of multi-GPU architectures, to use multiple distributed servers, each with one or more computation engines that handle parts of the inference processing stack.

Research papers on distributed inference algorithms:

B Wu, Y Zhong, Z Zhang, G Huang, X Liu, 2023, Fast Distributed Inference Serving for Large Language Models, https://arxiv.org/abs/2305.05920
Davide Macario, 2024, A Model-Distributed Inference Approach for Large Language Models at the Edge, Masters Thesis, Master of Science in Electrical and Computer Engineering, Graduate College, University of Illinois at Chicago, https://webthesis.biblio.polito.it/secure/31718/1/tesi.pdf
Marco Colocrese, Erdem Koyuncu, Hulya Seferoglu, 8 Aug 2024, Early-Exit meets Model-Distributed Inference at Edge Networks, https://arxiv.org/abs/2408.05247
Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs, https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://digitalassets.lib.berkeley.edu/techreports/ucb/incoming/EECS-2024-108.pdf
Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 6 Oct 2024, Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach, https://arxiv.org/abs/2410.05338
Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang, 11 Nov 2024, WDMoE: Wireless Distributed Mixture of Experts for Large Language Models, https://arxiv.org/abs/2411.06681
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 21 Dec 2024, Distributed Inference on Mobile Edge and Cloud: A Data-Cartography based Clustering Approach, https://arxiv.org/abs/2412.16616 https://anonymous.4open.science/r/DIMEC-1B04
J. Du et al., "Co-designing Transformer Architectures for Distributed Inference with Low Communication," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3521582. https://ieeexplore.ieee.org/abstract/document/10812976/ (Distributed inference with sub-block parallelism capabilities and a planning phase that optimizes both compute and communications.)
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 13 Jan 2025, A Survey of Early Exit Deep Neural Networks in NLP, https://arxiv.org/abs/2501.07670 (Good survey of exit exit classifier types.)
Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 16 May 2025, TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference, https://arxiv.org/abs/2505.11329
Zihan Liu, Jiayi Wen, Shouhong Tan, Zhirun Zheng, Cheng Huang, 6 Aug 2025, From Split to Share: Private Inference with Distributed Feature Sharing, https://arxiv.org/abs/2508.04346
Ziliang Shen, Caixing Wang, Shaoli Wang, Yibo Yan, 7 Aug 2025, High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference, https://arxiv.org/abs/2508.05212
Zhanghan Wang, Ding Ding, Hang Zhu, Haibin Lin, Aurojit Panda, 13 Aug 2025, Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference, https://arxiv.org/abs/2508.09505
Ruokai Yin, Sattwik Deb Mishra, Xuan Zuo, Hokchhay Tann, Preyas Shah, Apala Guha, 29 Aug 2025, Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference, https://arxiv.org/abs/2509.00217