Aussie AI

Big-Little LLM Architectures

  • Last Updated 15 August, 2025
  • by David Spuler, Ph.D.

Research on Big-Little LLM Architectures

Research papers include:

  • Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
  • S Kim, K Mangalam, S Moon, J Malik, MW Mahoney, 2023, Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 https://openreview.net/pdf?id=EfMyf9MC3t Code: https://github.com/kssteven418/BigLittleDecoder
  • Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
  • Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf, Nov 2023, OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking, https://arxiv.org/abs/2311.09758
  • Zichao Shen, Neil Howard and Jose Nunez-Yanez, 2022, Big–Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors, J. Low Power Electron. Appl. 2022, 12(2), 28, https://doi.org/10.3390/jlpea12020028 https://www.mdpi.com/2079-9268/12/2/28 Code: https://github.com/DarkSZChao/Big-Little_NN_Strategies
  • David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
  • Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao, June 2024, Hybrid SLM and LLM for Edge-Cloud Collaborative Inference, EdgeFM ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan, https://dl.acm.org/doi/pdf/10.1145/3662006.3662067 (Small model on edge devices with large model in the cloud, performing collaborative inference.)
  • Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis, 10 Jul 2023 (v2), Contrastive Decoding: Open-ended Text Generation as Optimization, https://arxiv.org/abs/2210.15097
  • Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
  • Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
  • Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
  • Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
  • J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
  • Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
  • He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong, 14 Oct 2024, big.LITTLE Vision Transformer for Efficient Visual Recognition, https://arxiv.org/abs/2410.10267
  • Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf
  • Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
  • Yang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin Zhang, 24 Apr 2025, Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks, https://arxiv.org/abs/2504.17421

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research Topics

Read more about: