Aussie AI

Ensemble Multi-Model AI

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

If one AI engine is amazing, imagine what two could do. Or ten. Or a hundred.

The idea of using two or more AI engines together to complete a task is not new. This area of research is called "ensemble learning" or "multi-model" engines.

There are many ways that AI engines could cooperate to achieve more than one would alone. This is an area ripe for exploration, where we have only scratched the surface of possibilities. On the other hand, with today's high cost of GPUs limiting what can be done in both AI inference and training, the full realization of ensemble AI algorithms is still in the distant future.

One way in which two AI models work together has become common in practice: using the output of one model as input text for the training data set of a new model. This has been an effective technique for improving downstream models, but it isn't usually classed as an ensemble algorithm, although there is a paper about it with Honovich et al. (2022). This idea is similar to Knowledge Distillation, but differs in that its goal isn't to create a cut-down smaller model, but usually to improve accuracy of a large model.

Existing Ensemble AI Algorithms

Despite the costs, there are a number of areas where there is existing research, and even a significant number of practical use cases, in ensemble models.

  • Generative Adversarial Networks (GANs). These use two AI models against each other, one being critical of the output of the other, to improve overall results. This is an architecture that has proven very successful in practice.
  • Knowledge Distillation (KD). This is an optimization method whereby a large model is built first, and then it is used to "distill" its knowledge into a smaller model, as a "teacher" model to a "student" model. Two models are used in training, but only the smaller model is used for inference. Note that there are also various more advanced forms of "ensemble distillation" (multi-teacher) that involve two or more teachers, plus the student model. See knowledge distillation.
  • Cascade inference optimizations. Cascade optimizations involve the selection of different models, or paths through multiple models, as a type of inference optimization. Two or more models are used in the inference phase, and there are various methods for deciding at runtime which to choose. See cascades.
  • Big-small models. In a specific subtype of the cascade method called big-small models, there are two models trained differently. The idea is that during inference, a heuristic decides which model to invoke, and a faster "small" model is used in common cases, and the slower "big" model is only needed to handle the rarer cases. This can improve inference latency and total throughput.
  • Speculative decoding. This method is similar to the big-little architecture, but differs because all queries initially go to the small model. The small, faster model does its best to suggest output tokens (i.e. it "speculates"), and then a bigger model is used to "verify" the correctness. If the bigger model has to override the smaller model, it is then slower, but usually the smaller model is accurate enough for the whole process to be faster on average, with accuracy close to using a larger model. Read more about: speculative decoding,
  • Collaborative inference. This is where two or more models combine to complete inference. They could be on separate servers, or together on one server. See collaborative inference research.
  • Multi-model training. There are various other methods of uses an ensemble technique to train up better models. Some of the techniques are called: bagging, boosting, stacking, and many variants. There is also the simple training method of using input data sets to train models, where the data is based on the output of some other model.
  • Multiple parallel models inference. In some architectures, multiple models can process the same input data in parallel. Each model produces its own results, and the resulting ensemble model has to then choose an overall result. The algorithm to decide amongst the multiple options could be maximum (or minimum), majority (counting), weighted averages, and many other combinations.
  • Hybrid dual transformer architectures. Rather than entire duplicate Transformer models, there has been research on adding extra components to the basic Transformer architecture, such as two heads or two encoders merged together. See Transformer architectures.

Model Selection Algorithms

Model selection algorithms are dynamic inference optimizations where a choice is made between two or more models for execution. One example is "big-little" architectures (see below), where a heuristic attempts to send "easy" queries to a faster "little" model. Various other ensemble architectures are possible with multiple models. Another practical example of a different type of model selection is the deployment architecture, which may be deciding which server to send the request to, where each server may have different models or instances of the same model. Other areas of research with similar aims include cascades and collaborative inference.

Model selection architectures are a general class of ensemble architectures where one of two or more models is "selected" to process a query. The general idea is that only model actually processes the query, with a decision mechanism beforehand (which can be model-based or heuristic-based). Example sub-classes of model selection include:

Research on model selection architectures:

  • Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J. Yadwadkar, Aditya Akella, 27 Oct 2023, MOSEL: Inference Serving Using Dynamic Modality Selection, https://arxiv.org/abs/2310.18481 (Multi-modal model with dynamic selection of modality.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
  • Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang, Dec 2023, Green Edge AI: A Contemporary Survey, https://arxiv.org/abs/2312.00333
  • David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
  • Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
  • Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
  • Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
  • Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
  • Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
  • Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
  • Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
  • Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
  • Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
  • Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
  • Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
  • Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
  • Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
  • Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
  • KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
  • David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
  • Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
  • Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica, 20 Feb 2025, Optimizing Model Selection for Compound AI Systems, https://arxiv.org/abs/2502.14815
  • Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
  • Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
  • Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
  • Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das, 14 Apr 2025, HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving, https://arxiv.org/abs/2504.10724
  • Jianfei Li, Kevin Kam Fung Yuen, 18 Jul 2025, CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis, https://arxiv.org/abs/2507.14022
  • Judy Long, Tao Liu, Sean Alexander Woznicki, Miljana Markovi\'c, Oskar Marko, Molly Sears, 10 Aug 2025, From Time-series Generation, Model Selection to Transfer Learning: A Comparative Review of Pixel-wise Approaches for Large-scale Crop Mapping, https://arxiv.org/abs/2507.12590
  • Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, and Sara Beery, 31 Jul 2025, Consensus-Driven Active Model Selection, https://arxiv.org/abs/2507.23771
  • Lorenzo Volpi, Alejandro Moreo, Fabrizio Sebastiani, 30 Jul 2025, Transductive Model Selection under Prior Probability Shift, https://arxiv.org/abs/2507.22647
  • Basile Lewandowski, Robert Birke, Lydia Y. Chen, 14 Aug 2025, Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.10993
  • Andrea Napoli, Paul White, 17 Aug 2025, Clustering-Based Validation Splits for Model Selection under Domain Shift, https://arxiv.org/abs/2405.19461
  • Chongyu Qu, Allen J. Luna, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Kim L. Sandler, Bennett A. Landman, Yuankai Huo, 20 Aug 2025, Cohort-Aware Agents for Individualized Lung Cancer Risk Prediction Using a Retrieval-Augmented Model Selection Framework, https://arxiv.org/abs/2508.14940
  • Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou, 21 Jul 2025, Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design, https://arxiv.org/abs/2507.15336
  • Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan, 7 Aug 2025, Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning, https://arxiv.org/abs/2507.12612
  • Bohan Yang, Gang Liu, Yang Zhong, Rirao Dao, Yujia Qian, Ke Shi, Anke Tang, Yong Luo, Qi Kong, Jingnan Liu, 7 Aug 2025, Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma, https://arxiv.org/abs/2506.15803
  • Chenghui Zheng, Garvesh Raskutti, 19 Aug 2025, Comparing Model-agnostic Feature Selection Methods through Relative Efficiency, https://arxiv.org/abs/2508.14268

Model Routing

Model routing is a generalization of model selection, which also includes evaluation of issues such as serving and network costs. The idea is to select a model from a set (i.e., model selection), and then route the query over the network to wherever that model is being served. This could potentially include choosing between commercial and open source models, and many variations therein.

Research papers on model routing algorithms:

  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
  • Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
  • Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
  • Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 12 May 2024, MARS: A Benchmark for Multi-LLM Algorithmic Routing System, ICLR 2024, https://openreview.net/forum?id=C0rs3wM0N8 PDF: https://openreview.net/pdf?id=C0rs3wM0N8
  • Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 28 Mar 2024 (v2), RouterBench: A Benchmark for Multi-LLM Routing System, https://arxiv.org/abs/2403.12031 https://github.com/withmartian/routerbench
  • Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik, 2 Oct 2024, ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation, https://arxiv.org/abs/2410.01731 https://comfygen-paper.github.io/
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
  • Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
  • Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
  • Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
  • Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
  • Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
  • Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
  • Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
  • Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
  • Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
  • KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
  • David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
  • Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
  • Arun Shankar, Oct 2024, Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch, https://medium.com/google-cloud/designing-cognitive-architectures-agentic-workflow-patterns-from-scratch-63baa74c54bc
  • Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
  • Kirill Vasilevski, Dayi Lin, Ahmed Hassan, 14 Nov 2024, Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models, https://arxiv.org/abs/2411.09837
  • AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
  • Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu, 5 Dec 2024, Bench-CoE: a Framework for Collaboration of Experts from Benchmark, https://arxiv.org/abs/2412.04167 https://github.com/ZhangXJ199/Bench-CoE
  • Dimitrios Sikeridis, Dennis Ramdass, Pranay Pareek, 12 Dec 2024, PickLLM: Context-Aware RL-Assisted Large Language Model Routing, https://arxiv.org/abs/2412.12170
  • Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
  • Avital Shafran, Roei Schuster, Thomas Ristenpart, Vitaly Shmatikov, 3 Jan 2025, Rerouting LLM Routers, https://arxiv.org/abs/2501.01818
  • Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
  • J. Pichlmeier, P. Ross and A. Luckow, "Performance Characterization of Expert Router for Scalable LLM Inference," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 1686-1693, doi: 10.1109/BigData62323.2024.10826121. https://ieeexplore.ieee.org/abstract/document/10826121
  • Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, François Jacquenet, 4 Feb 2025 (v2), Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey, https://arxiv.org/abs/2502.00409
  • Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, Subha Maity, 5 Feb 2025, CARROT: A Cost Aware Rate Optimal Router, https://arxiv.org/abs/2502.03261
  • Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica, 20 Feb 2025, Optimizing Model Selection for Compound AI Systems, https://arxiv.org/abs/2502.14815
  • Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
  • Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
  • Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
  • W Zhang, X Ren, Mar 2025, ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation, ICLR 2025 review, https://openreview.net/pdf?id=cizhOu3CZa (Induce activation sparsity for MoE choice in the model router.)
  • Tella Rajashekhar Reddy, Palak, Rohan Gandhi, Anjaly Parayil, Chaojie Zhang, Mike Shepperd, Liangcheng Yu, Jayashree Mohan, Srinivasan Iyengar, Shivkumar Kalyanaraman, Debopam Bhattacherjee, 15 May 2025, AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with Heron, https://arxiv.org/abs/2505.09989
  • Ben Dickson, July 7, 2025, New 1.5B router model achieves 93% accuracy without costly retraining, https://venturebeat.com/ai/new-1-5b-router-model-achieves-93-accuracy-without-costly-retraining/
  • Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar, 22 Jul 2025, Universal Model Routing for Efficient LLM Inference, https://arxiv.org/abs/2502.08773
  • Linjiang Cao, Maonan Wang and Xi Xiong, 21 Jul 2025, A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows, https://arxiv.org/abs/2505.06178
  • Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni, 9 Aug 2025, A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning, https://arxiv.org/abs/2408.07057
  • Xin He, Junxi Shen, Zhenheng Tang, Xiaowen Chu, Bo Li, Ivor W. Tsang, Yew-Soon Ong, 3 Aug 2025, RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging, https://arxiv.org/abs/2508.01784
  • Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, and Daniele Vigo, 12 Aug 2025, Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem, https://arxiv.org/abs/2508.08659
  • Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, and Daniele Vigo, 12 Aug 2025, Edge-Selector Model Applied for Local Search Neighborhood for Solving Vehicle Routing Problems, https://arxiv.org/abs/2508.14071
  • Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen, 21 Aug 2025, Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model, https://arxiv.org/abs/2411.13787

Big-Little Transformer Models

Although many ensemble architectures are about doing even more computations to achieve even more advanced capabilities, the idea of big-little or big-small architectures is to improve inference speed and throughput by sending common queries to a smaller model. The larger model is reserved for more difficult or rarer queries which take longer. As such, it's an AI version of the "common case first" code optimization technique.

Note that "collaborative inference" (e.g. "parallel decoding" or "speculative decoding") is also conceptually a similar architecture, but differs because multiple models work together for inference, whereas pure big-little architectures choose the model at the start, and only one model does the inference. Also related are the various non-autoregressive architectures.

Research papers on big-little architectures:

  • Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K., Big little transformer decoder, arXiv preprint arXiv:2302.07863, May 2023, https://arxiv.org/abs/2302.07863
  • Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
  • Leviathan, Y., Kalman, M., and Matias, Y., Fast inference from transformers via speculative decoding, May 2023, https://arxiv.org/abs/2211.17192
  • Stern, M., Shazeer, N., and Uszkoreit, J., Nov 2018, Blockwise parallel decoding for deep autoregressive models, Advances in Neural Information Processing Systems, 31, https://arxiv.org/abs/1811.03115
  • Z. Peng et al. 2018. AXNet: ApproXimate computing using an end-to-end trainable neural network. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) https://ieeexplore.ieee.org/document/8605388 (Ensemble dual-model method where one model is a fast approximatation of the other.)
  • Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, "Input Hardness Adaptive Models" for methods of running faster on easy image classification problems.)
  • Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget. arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
  • Park, E., Kim, D., Kim, S., Kim, Y.-D., Kim, G., Yoon, S., and Yoo, S. (2015). Big/little deep neural network for ultra low power inference. In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 124–132. https://ieeexplore.ieee.org/document/7331375
  • D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
  • Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
  • H Malard, S Zaiem, R Algayres, 2023, Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences, arXiv preprint arXiv:2309.12712, https://arxiv.org/pdf/2309.12712.pdf (Big-little architecture for audio models.)
  • S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
  • Kaya Y., Hong S., Dumitras T., Shallow-deep networks: Understanding and mitigating network overthinking Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)
  • Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
  • Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
  • Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf, Nov 2023, OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking, https://arxiv.org/abs/2311.09758
  • Zichao Shen, Neil Howard and Jose Nunez-Yanez, 2022, Big–Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors, J. Low Power Electron. Appl. 2022, 12(2), 28, https://doi.org/10.3390/jlpea12020028 https://www.mdpi.com/2079-9268/12/2/28 Code: https://github.com/DarkSZChao/Big-Little_NN_Strategies
  • David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
  • Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao, June 2024, Hybrid SLM and LLM for Edge-Cloud Collaborative Inference, EdgeFM ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan, https://dl.acm.org/doi/pdf/10.1145/3662006.3662067 (Small model on edge devices with large model in the cloud, performing collaborative inference.)
  • Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis, 10 Jul 2023 (v2), Contrastive Decoding: Open-ended Text Generation as Optimization, https://arxiv.org/abs/2210.15097
  • Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
  • Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
  • Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
  • Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
  • J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
  • Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
  • He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong, 14 Oct 2024, big.LITTLE Vision Transformer for Efficient Visual Recognition, https://arxiv.org/abs/2410.10267
  • Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf
  • Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
  • Yang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin Zhang, 24 Apr 2025, Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks, https://arxiv.org/abs/2504.17421

General Research on Ensemble Models

  • Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems. Yoshitomo Matsubara, Luca Soldaini, Eric Lind, Alessandro Moschitt, Dec 2022, https://arxiv.org/abs/2201.05767
  • Yungeng Zhang, Yuru Pei & Hongbin Zha, Learning Dual Transformer Network for Diffeomorphic Registration, Sep 2021, Medical Image Computing and Computer Assisted Intervention, MICCAI 2021, https://link.springer.com/chapter/10.1007/978-3-030-87202-1_13
  • Xian-Feng Han, Yi-Fei Jin, Hui-Xian Cheng, Guo-Qiang Xiao, Dual Transformer for Point Cloud Analysis, Apr 2021, https://arxiv.org/abs/2104.13044
  • Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei, 2023, Dual Vision Transformer, https://arxiv.org/pdf/2207.04976, Code: https://github.com/YehLi/ImageNetModel
  • Mohammed Alhamid, Ensemble Models, March 2021, https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c
  • Oliver R. A. Dunbar, Andrew B. Duncan, Andrew M. Stuart, Marie-Therese Wolfram, Jan 2022, Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods, https://arxiv.org/abs/2104.03384
  • Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548, PDF: https://arxiv.org/pdf/2308.06548.pdf (Treating the internal components of a Transformer as if they are an ensemble model.)
  • T. G. Dietterich. Ensemble methods in machine learning. In Multiple classifier systems, pages 1–15. Springer, 2000, Lecture Notes in Computer Science book series LNCS, volume 1857, https://link.springer.com/chapter/10.1007/3-540-45014-9_1, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e3b09a777c71a4b88888509ab9bfa12d8bf295ba (Early paper with ensemble idea applied to classifiers, rather than multi-model.)
  • Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
  • Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505 (Split computing is splitting the inference between server and edge machines.)
  • L. Li, K. Ota and M. Dong, "Deep learning for smart industry: Efficient manufacture inspection system with fog computing", IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665-4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640 ("Fog computing" is like cloud computing but on servers "nearer" to the ground.)
  • C. Lo, Y.-Y. Su, C.-Y. Lee and S.-C. Chang, "A dynamic deep neural network design for efficient workload allocation in edge computing", Proc. IEEE Int. Conf. Comput. Design (ICCD), pp. 273-280, Nov. 2017. https://ieeexplore.ieee.org/document/8119222
  • G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
  • Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with "wisdom of committees" decisions.)
  • Naftaly, U., N. Intrator, and D. Horn. "Optimal ensemble averaging of neural networks." Network: Computation in Neural Systems 8, no. 3 (1997): 283–296. https://www.tau.ac.il/~horn/publications/optimal.pdf
  • Y. Liu and X. Yao, Ensemble Learning via Negative Correlation, Neural Networks, Volume 12, Issue 10, December 1999, pp. 1399-1404. doi:10.1016/S0893-6080(99)00073-8, https://www.sciencedirect.com/science/article/abs/pii/S0893608099000738
  • Z.S.H. Chan; N. Kasabov, 2005, Fast neural network ensemble learning via negative-correlation data correction, IEEE Transactions on Neural Networks, Volume 16, Issue 6, November 2005, https://ieeexplore.ieee.org/document/1528547
  • E Diao, 2023, Efficient and Collaborative Methods for Distributed Machine Learning, Ph.D. thesis, Department of Electrical and Computer Engineering Duke University, https://www.proquest.com/openview/410ea5eb4275fded25890f04c96a902e/1?pq-origsite=gscholar&cbl=18750&diss=y
  • X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707 (Multiple submodels inside a large model.)
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
  • NVIDIA, Aug 2023, Triton Architecture, NVIDIA Triton Inference Server user guide documentation, https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html
  • Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, and Haizhou Li, “LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT,” in Interspeech, June 2022, https://arxiv.org/abs/2203.15610 2022.
  • Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodolà, May 2023, Accelerating Transformer Inference for Translation via Parallel Decoding, https://arxiv.org/abs/2305.10427
  • Meng Wang; Liang Qian; Na Meng; Yusong Cheng; Weiwei Fang, Nov 2023, Model Parallelism Optimization for Distributed DNN Inference on Edge Devices, 2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), https://ieeexplore.ieee.org/abstract/document/10391646 (Distributes inference across multiple edge devices at the layer level, with further optimization using layer fusion.)
  • Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
  • Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-Former: Bridging mobilenet and transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279. https://arxiv.org/abs/2108.05895
  • S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
  • David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
  • Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
  • Kah Phooi Seng, Li-Minn Ang, 2022, "Embedded Intelligence: State-of-the-Art and Research Challenges", IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683 PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
  • Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou, 7 Jun 2024, Mixture-of-Agents Enhances Large Language Model Capabilities, https://arxiv.org/abs/2406.04692
  • Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex: Pushing the boundaries of complex reasoning through multi model collaboration,” arXiv preprint arXiv:2310.00280, 2023. https://arxiv.org/abs/2310.00280
  • Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
  • Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
  • Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
  • Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, John Langford, Furong Huang, 6 Oct 2024, EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? https://arxiv.org/abs/2410.04571
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
  • Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin, 7 Nov 2024, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. https://arxiv.org/abs/2411.04996
  • Yingxuan Yang, Qiuying Peng, Jun Wang, Weinan Zhang, 21 Nov 2024, Multi-LLM-Agent Systems: Techniques and Business Perspectives, https://arxiv.org/abs/2411.14033
  • Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
  • Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
  • Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
  • Chan-Jan Hsu, Davide Buffelli, Jamie McGowan, Feng-Ting Liao, Yi-Chang Chen, Sattar Vakili, Da-shan Shiu, 16 May 2025, Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity, https://arxiv.org/abs/2505.11107
  • Yang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin Zhang, 24 Apr 2025, Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks, https://arxiv.org/abs/2504.17421
  • Juvenal Bassa, Vidya Manian, Sudhir Malik, Arghya Chattopadhyay, 9 Aug 2025, Jet Image Tagging Using Deep Learning: An Ensemble Model, https://arxiv.org/abs/2508.10034
  • Adithya Mohan, Dominik R\"o{\ss}le, Daniel Cremers and Torsten Sch\"on, 22 Jul 2025, Advancing Robustness in Deep Reinforcement Learning with an Ensemble Defense Approach, https://arxiv.org/abs/2507.17070
  • Moncef Garouani, Ayah Barhrhouj, Olivier Teste, 23 Jul 2025, XStacking: Explanation-Guided Stacked Ensemble Learning, https://arxiv.org/abs/2507.17650
  • Md Min-Ha-Zul Abedin and Tazqia Mehrub, 22 Jul 2025, Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset, https://arxiv.org/abs/2507.16952
  • Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal, 24 Jul 2025, Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods, https://arxiv.org/abs/2507.18242
  • Mizuki Funato and Yohei Sawada, 24 Jul 2025, Multi-Model Ensemble and Reservoir Computing for River Discharge Prediction in Ungauged Basins, https://arxiv.org/abs/2507.18423
  • Hemanth Kumar M, Karthika M, Saianiruth M, Vasanthakumar Venugopal, Anandakumar D, Revathi Ezhumalai, Charulatha K, Kishore Kumar J, Dayana G, Kalyan Sivasailam, Bargava Subramanian, 17 Jul 2025, A Deep Learning-Based Ensemble System for Automated Shoulder Fracture Detection in Clinical Radiographs, https://arxiv.org/abs/2507.13408
  • Kondrup Emma, 17 Jul 2025, Base3: a simple interpolation-based ensemble method for robust dynamic link prediction, https://arxiv.org/abs/2506.12764
  • Satyankar Chandra, Ashutosh Gupta, Kaushik Mallik, Krishna Shankaranarayanan, Namrita Varshney, 19 Jul 2025, Glitches in Decision Tree Ensemble Models, https://arxiv.org/abs/2507.14492
  • Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie, 18 Jul 2025, Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports, https://arxiv.org/abs/2410.16543
  • I. Bentley, J. Tedder, M. Gebran, and A. Paul, 21 Jul 2025, Further exploration of binding energy residuals using machine learning and the development of a composite ensemble model, https://arxiv.org/abs/2503.11066
  • Angel Felipe Magnoss\~ao de Paula, Imene Bensalem, Paolo Rosso, Wajdi Zaghouani, 20 Jul 2025, Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages, https://arxiv.org/abs/2303.09823
  • Yuya Kawakami, Daniel Cayan, Dongyu Liu, and Kwan-Liu Ma, 8 Aug 2025, ClimateSOM: A Visual Analysis Workflow for Climate Ensemble Datasets, https://arxiv.org/abs/2508.06732
  • Yu Pan, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao, 10 Aug 2025, Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching, https://arxiv.org/abs/2411.02026
  • Md Basit Azam and Sarangthem Ibotombi Singh, 21 Jul 2025, Clinical-Grade Blood Pressure Prediction in ICU Settings: An Ensemble Framework with Uncertainty Quantification and Cross-Institutional Validation, https://arxiv.org/abs/2507.19530
  • Uzzal Saha, Surya Prakash, 27 Jul 2025, Multi-Attention Stacked Ensemble for Lung Cancer Detection in CT Scans, https://arxiv.org/abs/2507.20221
  • Nicklas Werge, Yi-Shan Wu, Bahareh Tasdighi, Melih Kandemir, 31 Jul 2025, Directional Ensemble Aggregation for Actor-Critics, https://arxiv.org/abs/2507.23501
  • Navid Yazdanjue, Morteza Rakhshaninejad, Hossein Yazdanjouei, Mohammad Sadegh Khorshidi, Mikko S. Niemela, Fang Chen, Amir H. Gandomi, 19 Jul 2025, A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms, https://arxiv.org/abs/2507.22912
  • Jizhou Guo, 31 Jul 2025, LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration, https://arxiv.org/abs/2507.23167
  • Galadrielle Humblot-Renaux, Gianni Franchi, Sergio Escalera and Thomas B. Moeslund, 30 Jul 2025, COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP, https://arxiv.org/abs/2507.22576
  • Vinicius L. S. Silva, Gabriel S. Seabra, Alexandre A. Emerick, 30 Jul 2025, Mitigating loss of variance in ensemble data assimilation: machine learning-based and distance-free localization, https://arxiv.org/abs/2506.13362
  • \c{C}a\u{g}atay Demirel, 30 Jul 2025, RocketStack: Level-aware deep recursive ensemble learning framework with adaptive feature fusion and model pruning dynamics, https://arxiv.org/abs/2506.16965
  • Md. Ehsanul Haque, S. M. Jahidul Islam, Shakil Mia, Rumana Sharmin, Ashikuzzaman, Md Samir Morshed, Md. Tahmidul Huque, 31 Jul 2025, StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection, https://arxiv.org/abs/2508.00117
  • Maxime Bouscary, Saurabh Amin, 4 Aug 2025, OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling, https://arxiv.org/abs/2508.02503
  • Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis, 2 Aug 2025, Ensemble learning for uncertainty estimation with application to the correction of satellite precipitation products, https://arxiv.org/abs/2403.10567
  • Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima and Katsuyoshi Hotta, 5 Aug 2025, CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification, https://arxiv.org/abs/2508.03064
  • Mari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, and Zheng Wang, 5 Aug 2025, Ensemble Learning for Large Language Models in Text and Code Generation: A Survey, https://arxiv.org/abs/2503.13505
  • Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Syahid Al Irfan, Hindriyanto Dwi Purnomo and Radius Tanone, 6 Aug 2025, CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion, https://arxiv.org/abs/2508.04036
  • Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, and Yanzhi Wang, 5 Aug 2025, VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting, https://arxiv.org/abs/2507.05116
  • Rui Zou, 7 Aug 2025, Self-Error Adjustment: Theory and Practice of Balancing Individual Performance and Diversity in Ensemble Learning, https://arxiv.org/abs/2508.04948
  • Beicheng Xu, Wei Liu, Keyao Ding, Yupeng Lu, Bin Cui, 7 Aug 2025, PSEO: Optimizing Post-hoc Stacking Ensemble Through Hyperparameter Tuning, https://arxiv.org/abs/2508.05144
  • MD Shaikh Rahman, Feiroz Humayara, Syed Maudud E Rabbi, Muhammad Mahbubur Rashid, 6 Aug 2025, Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization, https://arxiv.org/abs/2508.04790
  • Daniil Vlasenko, Vadim Ushakov, Alexey Zaikin, Denis Zakharov, 8 Aug 2025, Ensemble-Based Graph Representation of fMRI Data for Cognitive Brain State Classification, https://arxiv.org/abs/2508.06118
  • Dan MacKinlay, 8 Aug 2025, The Ensemble Kalman Update is an Empirical Matheron Update, https://arxiv.org/abs/2502.03048
  • Keumseo Ryum, Jinu Gong, and Joonhyuk Kang, 12 Aug 2025, SHEFL: Resource-Aware Aggregation and Sparsification in Heterogeneous Ensemble Federated Learning, https://arxiv.org/abs/2508.08552
  • Luigi D'Amico, Daniel De Rosso, Ninad Dixit, Raul Salles de Padua, Samuel Palmer, Samuel Mugel, Rom\'an Or\'us, Holger Eble, and Ali Abedi, 12 Aug 2025, Blockchain Network Analysis using Quantum Inspired Graph Neural Networks & Ensemble Models, https://arxiv.org/abs/2508.09237
  • Md. Milon Islam, Md Rezwanul Haque, S M Taslim Uddin Raju, Fakhri Karray, 12 Aug 2025, FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition, https://arxiv.org/abs/2508.09362
  • Hossein Shokouhinejad, Roozbeh Razavi-Far, Griffin Higgins, Ali A Ghorbani, 13 Aug 2025, Explainable Ensemble Learning for Graph-Based Malware Detection, https://arxiv.org/abs/2508.09801
  • DongSeong-Yoon, 9 Aug 2025, A Cooperative Game-Based Multi-Criteria Weighted Ensemble Approach for Multi-Class Classification, https://arxiv.org/abs/2508.10926
  • Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng, 15 Aug 2025, Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble, https://arxiv.org/abs/2508.11279
  • Jiadong Chen, Xiao He, Hengyu Ye, Fuxin Jiang, Tieying Zhang, Jianjun Chen, Xiaofeng Gao, 18 Aug 2025, Online Ensemble Transformer for Accurate Cloud Workload Forecasting in Predictive Auto-Scaling, https://arxiv.org/abs/2508.12773
  • Dongjae Jeon, Taeheon Kim, Seongwon Cho, Minhyuk Seo, Jonghyun Choi, 18 Aug 2025, TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions, https://arxiv.org/abs/2508.12690
  • Qingyan Meng, Mingqing Xiao, Zhengyu Ma, Huihui Zhou, Yonghong Tian, Zhouchen Lin, 18 Aug 2025, A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks, https://arxiv.org/abs/2508.12609
  • Tzu-Chieh Chen and Wen-Yang Lin, 17 Aug 2025, On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora, https://arxiv.org/abs/2412.16976
  • Atsushi Nitanda, Anzelle Lee, Damian Tan Xing Kai, Mizuki Sakaguchi, Taiji Suzuki, 17 Aug 2025, Propagation of Chaos for Mean-Field Langevin Dynamics and its Application to Model Ensemble, https://arxiv.org/abs/2502.05784
  • Yifei Chen, Guanting Dong, Yutao Zhu, Zhicheng Dou, 19 Aug 2025, Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration, https://arxiv.org/abs/2508.13828
  • Hanseul Kang, Shervin Karimkashi, Ville Vuorinen, 13 Aug 2025, Parameter-Aware Ensemble SINDy for Interpretable Symbolic SGS Closure, https://arxiv.org/abs/2508.14085
  • Orestis Konstantaropoulos, Stelios Manolis Smirnakis, Maria Papadopouli, 19 Aug 2025, Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs, https://arxiv.org/abs/2508.14140
  • Dylan Bouchard, Mohit Singh Chauhan, 20 Aug 2025, Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers, https://arxiv.org/abs/2504.19254
  • Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
  • Junying Yang, Gang Lu, Xiaoqing Yan, Peng Xia, Di Wu, 25 Aug 2025, Adaptive Ensemble Learning with Gaussian Copula for Load Forecasting, https://arxiv.org/abs/2508.17700
  • Xinrui He, Yikun Ban, Jiaru Zou, Tianxin Wei, Curtiss B. Cook, Jingrui He, 23 Aug 2025, LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation, https://arxiv.org/abs/2410.21520
  • Semih Eren and Deniz Kucukahmetler and Nico Scherf, 23 Jul 2025, Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025), https://arxiv.org/abs/2507.17897
  • Ameya Daigavane, Bodhi P. Vani, Darcy Davidson, Saeed Saremi, Joshua Rackers, Joseph Kleinhenz, 21 Jul 2025, JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles, https://arxiv.org/abs/2410.14621
  • Brian Liu, Rahul Mazumder, Peter Radchenko, 29 Jul 2025, Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives, https://arxiv.org/abs/2506.20114
  • Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi and Yongjae Lee, 30 Jul 2025, Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization, https://arxiv.org/abs/2503.13544
  • Sahil Bansal, Sai Shruthi Sistla, Aarti Arikatala, Sebastian Schreiber, 7 Aug 2025, Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning, https://arxiv.org/abs/2508.05888
  • Akshat Dubey, Aleksandar An\v{z}el, Bahar \.Ilgen, Georges Hattab, 13 Aug 2025, UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles, https://arxiv.org/abs/2508.09639
  • Ruipu Li, Daniel Menacho, Alexander Rodr\'iguez, 18 Aug 2025, Adaptive Conformal Prediction Intervals Over Trajectory Ensembles, https://arxiv.org/abs/2508.13362
  • Xinzhu Liang, Joseph M. Lukens, Sanjaya Lohani, Brian T. Kirby, Thomas A. Searles, Xin Qiu, Kody J. H. Law, 21 Aug 2025, Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles, https://arxiv.org/abs/2505.13585
  • Yixuan Sun, Romain Egele, Sri Hari Krishna Narayana, Luke Van Roekel, Carmelo Gonzales, Steven Brus, Balu Nadiga, Sandeep Madireddy, Prasanna Balaprakash, 22 Aug 2025, Ensembles of Neural Surrogates for Parametric Sensitivity in Ocean Modeling, https://arxiv.org/abs/2508.16489
  • Dhruv D. Modi, Rong Pan, 18 Aug 2025, Enhancing Transformer-Based Foundation Models for Time Series Forecasting via Bagging, Boosting and Statistical Ensembles, https://arxiv.org/abs/2508.16641

Deployment: Serving Multiple Cloud Models

When running AI engines on a server, there are multiple models running, and a server has to decide how to allocated queries to the models efficiently. There are some papers on the practical deployment aspects of managing multiple models in a cloud server.

Submodels (Many-Models-in-One)

Although most ensemble architectures do have multiple distinct models, another approach is to have one model act as many models. This is called "submodels" or "many-models-in-one" or "many-in-one models."

Several methods have been tried, including training multiple submodels as part of a larger model, or using cut-down versions of a bigger model as multiple smaller submodels (e.g. using early exit to give submodels along the depth dimension, width pruning along the width dimension, etc). In some such architectures, the same model is simply executed with different parameters, such as the meta-parameters controlling early exit or width pruning.

This idea also appears as a specialization of other optimizations. For example, the self-speculative decoding method involves having the smaller draft model simply an early exit of the larger verifier model. This avoids the cost of having to train two models, and there are advantages in computation reuse when half of the layers of the big model have already been computed in the small model.

Research papers on submodels and many-models-in-one architectures:

  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
  • Lei Xun, Jonathon Hare, Geoff V. Merrett, 17 Jan 2024, Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices, https://arxiv.org/abs/2401.08965
  • Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
  • Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
  • Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Parsa Kavehzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 1 Jun 2024 (v3), SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks, https://arxiv.org/abs/2309.00255
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 11 Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707
  • Janek Haberer, Ali Hojjat, Olaf Landsiedel, 26 Sep 2024, HydraViT: Stacking Heads for a Scalable ViT, https://arxiv.org/abs/2409.17978 https://github.com/ds-kiel/HydraViT
  • Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, Yanzhi Wang, 25 Sep 2024, Search for Efficient Large Language Models, https://arxiv.org/abs/2409.17372 (Looking for subnets inside models as an alternative to NAS.)
  • Shrenik Bhansali, Alwin Jin, Tyler Lizzo, Larry Heck, 23 Oct 2024, LEGO: Language Model Building Blocks, https://arxiv.org/abs/2410.18287 (Extract small models out of large models.)
  • R Cai, Y Ro, GW Kim, P Wang, BE Bejnordi, A Akella, Oct 2024, Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://utns.cs.utexas.edu/assets/papers/neurips24-readme.pdf https://github.com/VITA-Group/READ-ME (Extract multiple smaller MoE expert models from a large LLM.)
  • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 23 Jan 2017, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, https://arxiv.org/abs/1701.06538
  • Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
  • Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
  • Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 6 Feb 2025, CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference, https://arxiv.org/abs/2502.04416 https://github.com/JarvisPei/CMoE
  • Gabe Guo, Stefano Ermon, 29 Apr 2025, Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding, https://arxiv.org/abs/2504.20456
  • Andrii Balashov, 23 Jul 2025, Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models, https://arxiv.org/abs/2507.17107
  • Francesco Corti, Balz Maag, Joachim Schauer, Ulrich Pferschy, Olga Saukh, 28 Jul 2025, REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints, https://arxiv.org/abs/2311.13349

Distributed Inference

Distributed inference is the technique of spreading the inference of a single query over multiple models in different locations. It is a generalization of multi-GPU architectures, to use multiple distributed servers, each with one or more computation engines that handle parts of the inference processing stack.

Research papers on distributed inference algorithms:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: