Aussie AI

Model Routing

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

Research on Model Routing

Research papers include:

  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
  • Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
  • Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
  • Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 12 May 2024, MARS: A Benchmark for Multi-LLM Algorithmic Routing System, ICLR 2024, https://openreview.net/forum?id=C0rs3wM0N8 PDF: https://openreview.net/pdf?id=C0rs3wM0N8
  • Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 28 Mar 2024 (v2), RouterBench: A Benchmark for Multi-LLM Routing System, https://arxiv.org/abs/2403.12031 https://github.com/withmartian/routerbench
  • Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik, 2 Oct 2024, ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation, https://arxiv.org/abs/2410.01731 https://comfygen-paper.github.io/
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
  • Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
  • Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
  • Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
  • Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
  • Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
  • Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
  • Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
  • Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
  • Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
  • KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
  • David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
  • Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
  • Arun Shankar, Oct 2024, Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch, https://medium.com/google-cloud/designing-cognitive-architectures-agentic-workflow-patterns-from-scratch-63baa74c54bc
  • Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
  • Kirill Vasilevski, Dayi Lin, Ahmed Hassan, 14 Nov 2024, Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models, https://arxiv.org/abs/2411.09837
  • AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
  • Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu, 5 Dec 2024, Bench-CoE: a Framework for Collaboration of Experts from Benchmark, https://arxiv.org/abs/2412.04167 https://github.com/ZhangXJ199/Bench-CoE
  • Dimitrios Sikeridis, Dennis Ramdass, Pranay Pareek, 12 Dec 2024, PickLLM: Context-Aware RL-Assisted Large Language Model Routing, https://arxiv.org/abs/2412.12170
  • Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
  • Avital Shafran, Roei Schuster, Thomas Ristenpart, Vitaly Shmatikov, 3 Jan 2025, Rerouting LLM Routers, https://arxiv.org/abs/2501.01818
  • Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
  • J. Pichlmeier, P. Ross and A. Luckow, "Performance Characterization of Expert Router for Scalable LLM Inference," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 1686-1693, doi: 10.1109/BigData62323.2024.10826121. https://ieeexplore.ieee.org/abstract/document/10826121
  • Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, François Jacquenet, 4 Feb 2025 (v2), Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey, https://arxiv.org/abs/2502.00409
  • Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, Subha Maity, 5 Feb 2025, CARROT: A Cost Aware Rate Optimal Router, https://arxiv.org/abs/2502.03261
  • Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica, 20 Feb 2025, Optimizing Model Selection for Compound AI Systems, https://arxiv.org/abs/2502.14815
  • Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
  • Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
  • Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen, 9 Feb 2025, MixLLM: Dynamic Routing in Mixed Large Language Models, https://arxiv.org/abs/2502.18482
  • W Zhang, X Ren, Mar 2025, ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation, ICLR 2025 review, https://openreview.net/pdf?id=cizhOu3CZa (Induce activation sparsity for MoE choice in the model router.)
  • Tella Rajashekhar Reddy, Palak, Rohan Gandhi, Anjaly Parayil, Chaojie Zhang, Mike Shepperd, Liangcheng Yu, Jayashree Mohan, Srinivasan Iyengar, Shivkumar Kalyanaraman, Debopam Bhattacherjee, 15 May 2025, AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with Heron, https://arxiv.org/abs/2505.09989
  • Ben Dickson, July 7, 2025, New 1.5B router model achieves 93% accuracy without costly retraining, https://venturebeat.com/ai/new-1-5b-router-model-achieves-93-accuracy-without-costly-retraining/
  • Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar, 22 Jul 2025, Universal Model Routing for Efficient LLM Inference, https://arxiv.org/abs/2502.08773
  • Linjiang Cao, Maonan Wang and Xi Xiong, 21 Jul 2025, A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows, https://arxiv.org/abs/2505.06178
  • Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni, 9 Aug 2025, A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning, https://arxiv.org/abs/2408.07057
  • Xin He, Junxi Shen, Zhenheng Tang, Xiaowen Chu, Bo Li, Ivor W. Tsang, Yew-Soon Ong, 3 Aug 2025, RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging, https://arxiv.org/abs/2508.01784
  • Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, and Daniele Vigo, 12 Aug 2025, Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem, https://arxiv.org/abs/2508.08659
  • Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, and Daniele Vigo, 12 Aug 2025, Edge-Selector Model Applied for Local Search Neighborhood for Solving Vehicle Routing Problems, https://arxiv.org/abs/2508.14071
  • Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen, 21 Aug 2025, Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model, https://arxiv.org/abs/2411.13787

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research Topics

Read more about: