Aussie AI

Big-Little LLM Architectures

Last Updated 15 August, 2025

by David Spuler, Ph.D.

Research on Big-Little LLM Architectures

Research papers include:

Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
S Kim, K Mangalam, S Moon, J Malik, MW Mahoney, 2023, Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 https://openreview.net/pdf?id=EfMyf9MC3t Code: https://github.com/kssteven418/BigLittleDecoder
Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf, Nov 2023, OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking, https://arxiv.org/abs/2311.09758
Zichao Shen, Neil Howard and Jose Nunez-Yanez, 2022, Big–Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors, J. Low Power Electron. Appl. 2022, 12(2), 28, https://doi.org/10.3390/jlpea12020028 https://www.mdpi.com/2079-9268/12/2/28 Code: https://github.com/DarkSZChao/Big-Little_NN_Strategies
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao, June 2024, Hybrid SLM and LLM for Edge-Cloud Collaborative Inference, EdgeFM ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan, https://dl.acm.org/doi/pdf/10.1145/3662006.3662067 (Small model on edge devices with large model in the cloud, performing collaborative inference.)
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis, 10 Jul 2023 (v2), Contrastive Decoding: Open-ended Text Generation as Optimization, https://arxiv.org/abs/2210.15097
Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong, 14 Oct 2024, big.LITTLE Vision Transformer for Efficient Visual Recognition, https://arxiv.org/abs/2410.10267
Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf
Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
Yang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin Zhang, 24 Apr 2025, Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks, https://arxiv.org/abs/2504.17421