Aussie AI
Knowledge Distillation Research
-
Last Updated 30 August, 2025
-
by David Spuler, Ph.D.
Knowledge Distillation (KD) is a model optimization technique where a larger pre-trained model is used to train a smaller more-efficient model. When used successfully, the result is a small model with faster inference that closely matches the accuracy of the larger model.
Distillation is not technically an ensemble method, because the larger model is not used during inference. Hence, it is not the same as "big-small" dual inference architectures.
Distillation also differs from "fine tuning" or "re-training", which involve extra training on the (large) model, whereas knowledge distillation involves training a new, smaller model from scratch.
Recent advances in Knowledge Distillation include novel ways to directly transfer the learning, weighting approaches rather than exact probability transfer, and multi-model distillation approaches whereby the smaller student model can gain information from multiple teachers.
Survey Papers on Knowledge Distillation
Review papers with coverage of KD include:
- Y Tian, S Pei, X Zhang, C Zhang, 2023, Knowledge Distillation on Graphs: A Survey, arXiv preprint, https://arxiv.org/abs/2302.00219
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282 (A survey paper from 2017 that includes KD.)
- Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper on multiple areas, including a section on Knowledge Distillation.)
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including knowledge distillation.)
- Wang L, Yoon KJ. Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks. 2021;44:3048-3068 https://arxiv.org/abs/2004.05937 (Distillation in vision context.)
- Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046
Research on Knowledge Distillation
KD is a longstanding method of optimizing model inference that is one of the most popular techniques. Research papers on KD include:
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. https://arxiv.org/abs/1503.02531 (The early paper that seems to have coined the name.)
- Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, Oct 2019 (revised March 2020), DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019), https://arxiv.org/abs/1910.01108
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv preprint arXiv:1909.10351, Sep 2019 (updated Oct 2020), https://arxiv.org/abs/1909.10351, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, arXiv preprint arXiv:2004.02984 (2020), https://arxiv.org/abs/2004.02984
- Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu, Patient Knowledge Distillation for BERT Model Compression, arXiv preprint arXiv:1908.09355 (Aug 2019), https://arxiv.org/abs/1908.09355
- Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MINILMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers, arXiv preprint arXiv:2002.10957, 2020 (revised June 2021), https://arxiv.org/abs/2012.15828
- Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136 (Mar 2019), https://arxiv.org/abs/1903.12136
- Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
- Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
- Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962, https://arxiv.org/abs/1908.08962v1
- Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and Practical BERT Models for Sequence Labeling. arXiv preprint arXiv:1909.00100. https://arxiv.org/abs/1909.00100
- Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136, https://arxiv.org/abs/1903.12136
- Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. July 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding arXiv preprint arXiv:1907.04829. https://arxiv.org/abs/1907.04829
- Yoon Kim and Alexander M Rush. Sep 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, https://arxiv.org/abs/1606.07947
- Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
- Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pages 742–751. https://dl.acm.org/doi/10.5555/3294771.3294842
- Mao, Y.; Wang, Y.; Wu, C.; Zhang, C.; Wang, Y.; Zhang, Q.; Yang, Y.; Tong, Y.; and Bai, J. 2020. LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression. In COLING, 3225–3234. International Committee on Computational Linguistics. https://arxiv.org/abs/2004.04124 (A combination of weight pruning, matrix factorization and knowledge distillation.)
- Lin, S.C.; Yang, J.H.; Lin, J., Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386 2020. https://arxiv.org/abs/2010.11386
- Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices. In AAAI’18, https://arxiv.org/abs/1708.04728 (Includes a distinct type of distillation.)
- Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 535–541. ACM, 2006. https://www.semanticscholar.org/paper/Model-compression-Bucila-Caruana/30c9bb327b7f2b9f1d1e5b69b9d0c97b410948d9, PDF: http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf (Early 2006 paper on teaching models before it became called "distillation" in 2015.)
- Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
- Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian J. McAuley, and Furu Wei. Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression. CoRR, abs/2109.03228, 2021, https://arxiv.org/abs/2109.03228 (Evaluation of the efficiency of distillation.)
- Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew Gordon Wilson. Does knowledge distillation really work? CoRR, abs/2106.05945, 2021, https://arxiv.org/abs/2106.05945 (Evaluation of the efficacy of distillation.)
- Dae Young Park, Moon-Hyun Cha, Changwook Jeong, Daesin Kim, and Bohyung Han. Learning student-friendly teacher networks for knowledge distillation. CoRR, abs/2102.07650, 2021. https://arxiv.org/abs/2102.07650
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
- Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y Code: https://github.com/anilkagak2/DiSK_Distilling_Scaffolded_Knowledge (See Chapter 13: Distilling Selective/Scaffolded Knowledge)
- Zhang, C.; Yang, Y.; Liu, J.; Wang, J.; Xian, Y.; Wang, B.; and Song, D. 2023. Lifting the Curse of Capacity Gap in Distilling Language Models. arXiv:2305.12129. https://arxiv.org/abs/2305.12129
- Chen X, He B, Hui K, Sun L, Sun Y. Simplified Tinybert: Knowledge Distillation for Document Retrieval. 2020. Arxiv preprint, https://arxiv.org/abs/2009.07531
- Tian Y, Krishnan D, Isola P. Contrastive Representation Distillation. 2019. Arxiv Preprint: https://arxiv.org/pdf/1910.10699.pdf
- Do T, Tran H, Do T, Tjiputra E, Tran Q. Compact Trilinear Interaction for Visual Question Answering. In: Proceedings of the proceedings of the IEEE International Conference on Computer Vision. 2019:392-401. https://arxiv.org/abs/1909.11874
- T Chen, S Liu, Z Chen, W Hu, D Chen, Y Wang, Q Lyu, 2023, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, https://www.oajaiml.com/uploads/archivepdf/27841181.pdf
- B. Heo, M. Lee, S. Yun and J. Y. Choi, "Knowledge transfer via distillation of activation boundaries formed by hidden neurons", Proc. AAAI Conf. Artif. Intell. (AAAI), vol. 33, no. 1, pp. 3779-3787, 2019, https://arxiv.org/abs/1811.03233
- Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, Kun Gai, "Rocket launching: A universal and efficient framework for training well-performing light net", Proc. AAAI Conf. Artif. Intell., pp. 1-8, 2018. https://arxiv.org/abs/1708.04106 (Combined training of teacher and student models.)
- A. Chaulwar et al., "Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices", arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
- T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
- Z Zhao, Q Liu, H Gui, B An, L Hong, H Chi, Oct 2023, Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication, arXiv preprint arXiv:2310.03188, https://arxiv.org/pdf/2310.03188.pdf
- T Udagawa, A Trivedi, M Merler, B Bhattacharjee, Oct 2023, A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models, arXiv preprint arXiv:2310.08797, https://arxiv.org/abs/2310.08797
- Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461
- Jin Wang, Dawei Liao, You Zhang, Dan Xu, Xuejie Zhang, 2024, Layerwised multimodal knowledge distillation for vision-language pretrained model, Neural Networks Available online 26 March 2024, 106272, https://doi.org/10.1016/j.neunet.2024.106272
- Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
- Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, May 2024, Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation, WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024, Pages 235–244, https://doi.org/10.1145/3589335.3648321
- Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, Dec 2023, Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup, https://arxiv.org/abs/2312.05795
- Canwen Xu, 2024, Efficient Natural Language Processing for Language Models, Ph.D. thesis, Computer Science, UNIVERSITY OF CALIFORNIA SAN DIEGO, PDF: https://escholarship.org/uc/item/9dv1k5xv PDF: https://escholarship.org/content/qt9dv1k5xv/qt9dv1k5xv.pdf?t=sc34ay (Evaluates several acceleration methods including early-exit, PEFT, and distillation.)
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
- Zuo, G., Zhang, C., Zheng, Z. et al., 2024, Knowledge distillation based on projector integration and classifier sharing. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01394-3 https://link.springer.com/article/10.1007/s40747-024-01394-3
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu, Nov 2023, Initializing Models with Larger Ones, https://arxiv.org/abs/2311.18823 Code: https://github.com/OscarXZQ/weight-selection
- Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367 Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
- Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
- Sean Farhat, Deming Chen, 4 Apr 2024, On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models, https://arxiv.org/abs/2404.03263
- Rachel Gordon, Publication Date:March 21, 2024, AI generates high-quality images 30 times faster in a single step, MIT News, https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321 (MIT's new image generation framework called "distribution matching distillation" is faster than diffusion models.)
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
- Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence, 22 Nov 2023, Efficient Transformer Knowledge Distillation: A Performance Review, https://arxiv.org/abs/2311.13657
- Chang Liu, Chongyang Tao, Jianxin Liang, Jiazhan Feng, Tao Shen, 2023, Quzhe Huang, Dongyan Zhao,Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4452–4463, December 6-10, 2023, https://aclanthology.org/2023.findings-emnlp.294.pdf (Explores combining static model compression via knowledge distillation with dynamic adaptive inference via token pruning. This creates a modified distillation algorithm that prepares the model for token pruning during training.)
- QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- Erik Pettersson, Sep 2023, Knowledge distillation for anomaly detection, Master's Thesis, Faculty of Science and Technology, Uppsala University, https://www.diva-portal.org/smash/get/diva2:1805667/FULLTEXT01.pdf
- Ba, J. and Caruana, R. (2014). Do deep nets really need to be deep? In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper/2014/file/ ea8fcd92d59581717e06eb187f10666d-Paper.pdf.
- Bucila, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery. URL: https://doi.org/10.1145/1150402.1150464.
- Zeng, X. and Martinez, T. R. (2000). Using a neural networks to approximate an ensemble of classifiers. In Neural Processing Letters, page 2000. URL: https://doi.org/10.1023/A:1026530200837
- K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
- David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
- Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
- Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
- X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
- S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
- X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
- K Wu, J Zhang, H Peng, M Liu, B Xiao, J Fu, 2022, Tinyvit: Fast pretraining distillation for small vision transformers, https://arxiv.org/pdf/2207.10666.pdf%E2%80%8B
- S Norouzi, R Hosseinzadeh, F Perez, 2023, DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers for Machine Translation, https://aclanthology.org/2023.findings-acl.542/
- Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, Tie-yan Liu, 6 Jul 2023 (v2), A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond, https://arxiv.org/pdf/2204.09269.pdf
- Asit Mishra and Debbie Marr. 2017. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. arXiv:1711.05852 [cs] (Nov. 2017). http://arxiv.org/abs/1711.05852 arXiv: 1711.05852.
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Meta, August 14, 2024, How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models, Meta AI Blog, https://ai.meta.com/blog/nvidia-llama/
- Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
- Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
- Sharath Sreenivas, Vinh Nguyen, Saurav Muralidharan, Marcin Chochowski and Raviraj Joshi, How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model, Aug 14, 2024, https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
- Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen, 14 Oct 2024, Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation, https://arxiv.org/abs/2410.10141 https://github.com/ozyyshr/TempSpec
- Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- Shen, J., Liu, Y., Jiang, Y., Chen, Y., Han, W. (2025). Model-Agnostic Knowledge Distillation Between Heterogeneous Models. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15359. Springer, Singapore. https://doi.org/10.1007/978-981-97-9431-7_19 https://link.springer.com/chapter/10.1007/978-981-97-9431-7_19
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville, 12 Dec 2024, Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices, https://arxiv.org/abs/2412.09289
- Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S.-H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren, 12 Dec 2024, SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training, https://arxiv.org/abs/2412.09619
- Guiyu Li, Shang Zheng, Haitao Zou, Hualong Yu, Shang Gao, 2024, Model compression through distillation with cross-layer integrated guidance at word level, Neurocomputing, 129162, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2024.129162. https://www.sciencedirect.com/science/article/abs/pii/S0925231224019337
- Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
- Giordano d'Aloisio, Luca Traini, Federica Sarro, Antinisca Di Marco, 18 Dec 2024, On the Compression of Language Models for Code: An Empirical Study on CodeBERT, https://arxiv.org/abs/2412.13737 (Quantization, pruning and distillation on code generation models.)
- Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
- Kalle Kujanpää, Harri Valpola, Alexander Ilin, 19 Dec 2024, Knowledge Injection via Prompt Distillation, https://arxiv.org/abs/2412.14964
- Yifan Yu, Yu Gan, Lily Tasi, Nikhil Sarda, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler, 22 Jan 2025, EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation, https://arxiv.org/abs/2501.12689 (Using a semantic cache to prepend previously computed answers from similar queries as promopt examples, to improve results from a smaller LLM's final result.)
- DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, et al. (100+ additional authors not shown), 22 Jan 2025, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 (The DeepSeek R1 large reasoning model.)
- Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
- Cristian Leo, Feb 2025, How to Distill a LLM: Step-by-step. The Google Paper that started efficient LLM distillation. Let’s explore how it works, the math behind this technique, and how to implement it with code. https://medium.com/data-science-collective/how-to-distill-a-llm-step-by-step-58f06fcf4bfa
- Jasmine Wu, Deirdre Bosa, Feb 21 2025, How DeepSeek used distillation to train its artificial intelligence model, and what it means for companies such as OpenAI, https://www.cnbc.com/2025/02/21/deepseek-trained-ai-model-using-distillation-now-a-disruptive-force.html
- Kayhan Behdin, Yun Dai, Ata Fatahibaarzi, Aman Gupta, Qingquan Song, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Maziar Sanjabi, Vignesh Kothapalli, Hamed Firooz, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Zhipeng Wang, Rahul Mazumder, Natesh Pillai, Luke Simon, 20 Feb 2025, Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications, https://arxiv.org/abs/2502.14305 (Deploying small models for efficiency via distillation and quantization/pruning.)
- Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
- Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa, 28 May 2025, RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding, https://arxiv.org/abs/2505.22135
- Michael List, July 2025, Distillation for Efficient History Compression in Reinforcement Learning, Master’s Thesis, https://epub.jku.at/obvulihs/content/titleinfo/12295461/full.pdf
- Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang, 14 Aug 2025, Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation, https://arxiv.org/abs/2508.10774
- Juntao Lin, Xianghao Zhan, 22 Jul 2025, Sensor Drift Compensation in Electronic-Nose-Based Gas Recognition Using Knowledge Distillation, https://arxiv.org/abs/2507.17071
- Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, Kede Ma, 23 Jul 2025, Dataset Distillation as Data Compression: A Rate-Utility Perspective, https://arxiv.org/abs/2507.17221
- Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu, 23 Jul 2025, AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation, https://arxiv.org/abs/2503.02832
- Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu, 22 Jul 2025, Cross-Modal Distillation For Widely Differing Modalities, https://arxiv.org/abs/2507.16296
- Norah Alballa, Ahmed M. Abdelmoniem, Marco Canini, 22 Jul 2025, Practical Insights into Knowledge Distillation for Pre-Trained Models, https://arxiv.org/abs/2402.14922
- Yuki Kadokawa, Hirotaka Tahara, Takamitsu Matsubara, 22 Jul 2025, Progressive-Resolution Policy Distillation: Leveraging Coarse-Resolution Simulations for Time-Efficient Fine-Resolution Policy Learning, https://arxiv.org/abs/2412.07477
- Lakshmana Sri Harsha Nemani, P.K. Srijith, Tomasz Ku\'smierczyk, 24 Jul 2025, Efficient Uncertainty in LLMs through Evidential Knowledge Distillation, https://arxiv.org/abs/2507.18366
- Magnus Bengtsson, Kenneth \"Ostberg, 24 Jul 2025, C2G-KD: PCA-Constrained Generator for Data-Free Knowledge Distillation, https://arxiv.org/abs/2507.18533
- Zhen Han, Mattias Teye, Derek Yadgaroff, Judith B\"utepage, 24 Jul 2025, Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation, https://arxiv.org/abs/2507.18352
- Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee, 24 Jul 2025, Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs, https://arxiv.org/abs/2503.16870
- Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong, 19 Jul 2025, Generative Distribution Distillation, https://arxiv.org/abs/2507.14503
- Zihao Hu (1), Jia Yan (2), Ying-Jun Angela Zhang (1), Jun Zhang (3), Khaled B. Letaief (3) ((1) The Chinese University of Hong Kong, (2) The Hong Kong University of Science and Technology (Guangzhou), (3) The Hong Kong University of Science and Technology), 21 Jul 2025, Optimal Transceiver Design in Over-the-Air Federated Distillation, https://arxiv.org/abs/2507.15256
- Songming Zhang and Yuxiao Luo and Ziyu Lyu and Xiaofeng Chen, 19 Jul 2025, ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift, https://arxiv.org/abs/2312.16242
- Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, Hao Wu, 20 Jul 2025, Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting, https://arxiv.org/abs/2507.02939
- Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang, 21 Jul 2025, Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents, https://arxiv.org/abs/2502.19545
- Changli Wang, Fang Yin, Jiafeng Liu, Rui Wu, 20 Jul 2025, HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space, https://arxiv.org/abs/2507.09487
- Hayeon Kim, Ji Ha Jang and Se Young Chun, 21 Jul 2025, Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling, https://arxiv.org/abs/2507.11061
- Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou, 11 Aug 2025, ThinkTuning: Instilling Cognitive Reflections without Distillation, https://arxiv.org/abs/2508.07616
- Ziqi Zhang, Ali Shahin Shamsabadi, Hanxiao Lu, Yifeng Cai, Hamed Haddadi, 9 Aug 2025, Membership and Memorization in LLM Knowledge Distillation, https://arxiv.org/abs/2508.07054
- Christos Tsirigotis, Vaibhav Adlakha, Joao Monteiro, Aaron Courville, Perouz Taslakian, 9 Aug 2025, BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation, https://arxiv.org/abs/2508.06781
- Zihao Hu (1), Jia Yan (2), Ying-Jun Angela Zhang (1) ((1) The Chinese University of Hong Kong, (2) The Hong Kong University of Science and Technology (Guangzhou)), 6 Aug 2025, Communication-Learning Co-Design for Differentially Private Over-the-Air Federated Distillation, https://arxiv.org/abs/2508.06557
- Simon Baur, Alexandra Benova, Emilio Dolgener Cant\'u, Jackie Ma, 6 Aug 2025, On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications, https://arxiv.org/abs/2508.06558
- Deepon Halder, Thanmay Jayakumar, Raj Dabre, 9 Aug 2025, CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation, https://arxiv.org/abs/2506.19952
- Robert Frenken, Sidra Ghayour Bhatti, Hanqin Zhang, Qadeer Ahmed, 25 Jul 2025, KD-GAT: Combining Knowledge Distillation and Graph Attention Transformer for a Controller Area Network Intrusion Detection System, https://arxiv.org/abs/2507.19686
- Yang Zhao, Shusheng Li, Xueshang Feng, 28 Jul 2025, Lightweight Remote Sensing Scene Classification on Edge Devices via Knowledge Distillation and Early-exit, https://arxiv.org/abs/2507.20623
- Joey Chan, Zhen Chen, Ershun Pan, 27 Jul 2025, Foundation Models Knowledge Distillation For Battery Capacity Degradation Forecast, https://arxiv.org/abs/2505.08151
- Ren Zhuang, Ben Wang, Shuifa Sun, 25 Jul 2025, AGORA: Incentivizing Group Emergence Capability in LLMs via Group Distillation, https://arxiv.org/abs/2507.21166
- Siddhartha Pradhan, Shikshya Shiwakoti, Neha Bathuri, 29 Jul 2025, Teach Me to Trick: Exploring Adversarial Transferability via Knowledge Distillation, https://arxiv.org/abs/2507.21992
- Sheng-Feng Yu, Jia-Jiun Yao, and Wei-Chen Chiu, 29 Jul 2025, Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation, https://arxiv.org/abs/2507.21455
- Giovanni Dispoto, Paolo Bonetti, Marcello Restelli, 29 Jul 2025, "So, Tell Me About Your Policy...": Distillation of interpretable policies from Deep Reinforcement Learning agents, https://arxiv.org/abs/2507.07848
- Guopeng Li, Qiang Wang, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia, 29 Jul 2025, Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation, https://arxiv.org/abs/2410.12342
- Kunyang Li, Jeffrey A Chan Santiago, Sarinda Dhanesh Samarasinghe, Gaowen Liu, Mubarak Shah, 30 Jul 2025, GVD: Guiding Video Diffusion Model for Scalable Video Distillation, https://arxiv.org/abs/2507.22360
- Wenchao Gu and Zongyi Lyu and Yanlin Wang and Hongyu Zhang and Cuiyun Gao and Michael R. Lyu, 1 Aug 2025, SPENCER: Self-Adaptive Model Distillation for Efficient Code Retrieval, https://arxiv.org/abs/2508.00546
- Zhen Wu, Ritam Dutt, Luke M. Breitfeller, Armineh Nourbakhsh, Siddharth Parekh, Carolyn Ros\'e, 2 Aug 2025, $R^2$-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation, https://arxiv.org/abs/2508.01475
- Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma, 2 Aug 2025, DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging, https://arxiv.org/abs/2508.01148
- Hung-Chieh Fang, Hsuan-Tien Lin, Irwin King, Yifei Zhang, 2 Aug 2025, Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning, https://arxiv.org/abs/2508.01251
- Kuiyuan DIng, Caili Guo, Yang Yang, Zhongtian Du, and Walid Saad, 4 Aug 2025, Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation, https://arxiv.org/abs/2508.02148
- Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre, 3 Aug 2025, Improving Noise Efficiency in Privacy-preserving Dataset Distillation, https://arxiv.org/abs/2508.01749
- Tobias J\"ulg, Wolfram Burgard, Florian Walter, 4 Aug 2025, Refined Policy Distillation: From VLA Generalists to RL Experts, https://arxiv.org/abs/2503.05833
- Chaoyang Gao, Xiang Chen, Jiyu Wang, Jibin Wang, Guang Yang, 30 Jul 2025, Resource-Efficient Automatic Software Vulnerability Assessment via Knowledge Distillation and Particle Swarm Optimization, https://arxiv.org/abs/2508.02840
- Jiahui Bai, Hai Dong, A. K. Qin, 5 Aug 2025, On the Fast Adaptation of Delayed Clients in Decentralized Federated Learning: A Centroid-Aligned Distillation Approach, https://arxiv.org/abs/2508.02993
- Hyung Gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz, 4 Aug 2025, Adaptive Knowledge Distillation for Device-Directed Speech Detection, https://arxiv.org/abs/2508.02801
- Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu, 5 Aug 2025, V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models, https://arxiv.org/abs/2508.03254
- Seyedhamidreza Mousavi, Seyedali Mousavi and Masoud Daneshtalab, 5 Aug 2025, ProARD: progressive adversarial robustness distillation: provide wide range of robust students, https://arxiv.org/abs/2506.07666
- Ryota Ikeda, 5 Aug 2025, Do GNN-based QEC Decoders Require Classical Knowledge? Evaluating the Efficacy of Knowledge Distillation from MWPM, https://arxiv.org/abs/2508.03782
- Sriram Mandalika, Lalitha V, 6 Aug 2025, CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework, https://arxiv.org/abs/2508.04816
- Haonan Shangguan, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, and Ge Yu, 7 Aug 2025, Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation, https://arxiv.org/abs/2508.05234
- Martin Weyssow, Chengran Yang, Junkai Chen, Ratnadira Widyasari, Ting Zhang, Huihui Huang, Huu Hung Nguyen, Yan Naing Tun, Tan Bui, Yikun Li, Ang Han Wei, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo, 7 Aug 2025, R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation, https://arxiv.org/abs/2504.04699
- Lingyuan Liu, Mengxiang Zhang, 8 Aug 2025, Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models, https://arxiv.org/abs/2508.06135
- Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli\'c, Alessandro Sordoni, 8 Aug 2025, Training Plug-n-Play Knowledge Modules with Deep Context Distillation, https://arxiv.org/abs/2503.08727
- Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran, Yanning Zhang, 12 Aug 2025, Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL, https://arxiv.org/abs/2508.08677
- Jinlin Xiang, Minho Choi, Yubo Zhang, Zhihao Zhou, Arka Majumdar, Eli Shlizerman, 11 Aug 2025, Neural Tangent Knowledge Distillation for Optical Convolutional Networks, https://arxiv.org/abs/2508.08421
- Kristian Miok, Blaz \v{S}krlj, Daniela Zaharie, and Marko Robnik \v{S}ikonja, 30 Jul 2025, TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning, https://arxiv.org/abs/2508.08273
- Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang, 13 Aug 2025, Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning, https://arxiv.org/abs/2508.09883
- Aman Anand, Elyas Rashno, Amir Eskandari, Farhana Zulkernine, 12 Aug 2025, Depth-Guided Self-Supervised Human Keypoint Detection via Cross-Modal Distillation, https://arxiv.org/abs/2410.14700
- Van Duc Cuong, Ta Dinh Tam, Tran Duc Chinh and Nguyen Thi Hanh, 10 Aug 2025, FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning, https://arxiv.org/abs/2508.07264
- Durgesh Mishra, Rishabh Uikey, 15 Aug 2025, Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition, https://arxiv.org/abs/2508.11376
- Siyamalan Manivannan, 15 Aug 2025, Semi-Supervised Learning with Online Knowledge Distillation for Skin Lesion Classification, https://arxiv.org/abs/2508.11511
- Thinh Dao, Khoa D Doan, Kok-Seng Wong, 14 Aug 2025, Clean-Label Physical Backdoor Attacks with Data Distillation, https://arxiv.org/abs/2407.19203
- Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
- Taeheon Kim, San Kim, Minhyuk Seo, Dongjae Jeon, Wonje Jeong, Jonghyun Choi, 18 Aug 2025, Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning, https://arxiv.org/abs/2508.12692
- Xinhe Li, Jiajun Liu, Peng Wang, 18 Aug 2025, Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction, https://arxiv.org/abs/2508.13037
- Anshul Ahluwalia, Payman Behnam, Rohit Das, Alind Khare, Biswadeep Chakraborty, Pan Li, Alexey Tumanov, 16 Aug 2025, STRIDE: Structure and Embedding Distillation with Attention for Graph Neural Networks, https://arxiv.org/abs/2310.15938
- Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin, 18 Aug 2025, Inverse Bridge Matching Distillation, https://arxiv.org/abs/2502.01362
- Jihyun Lim, Junhyuk Jo, Tuo Zhang, Sunwoo Lee, 17 Aug 2025, Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogenous Federated Learning, https://arxiv.org/abs/2503.11151
- Simardeep Singh, 15 Aug 2025, From Teacher to Student: Tracking Memorization Through Model Distillation, https://arxiv.org/abs/2506.16170
- Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, 6 Aug 2025, Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL, https://arxiv.org/abs/2508.13167
- Yunxiang Yang, Ningning Xu, Jidong J. Yang, 19 Aug 2025, Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference, https://arxiv.org/abs/2508.13439
- Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu, 19 Aug 2025, Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation, https://arxiv.org/abs/2507.04680
- Ye Su, Hezhe Qiao, Wei Huang, Lin Chen, 12 Aug 2025, Toward Generalist Semi-supervised Regression via Decoupled Representation Distillation, https://arxiv.org/abs/2508.14082
- Ahmed Mujtaba, Gleb Radchenko, Radu Prodan, Marc Masana, 20 Aug 2025, Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data, https://arxiv.org/abs/2508.14769
- Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert, 20 Aug 2025, Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method, https://arxiv.org/abs/2508.14783
- Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li, 20 Aug 2025, Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning, https://arxiv.org/abs/2507.10348
- Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Furkan Guzelant, Aysegul Dundar, 20 Aug 2025, Identity Preserving 3D Head Stylization with Multiview Score Distillation, https://arxiv.org/abs/2411.13536
- Yifan Zhang, Junhui Hou, 20 Aug 2025, Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?, https://arxiv.org/abs/2412.08973
- Jiacheng Xie, Ziyang Zhang, Biplab Poudel, Congyu Guo, Yang Yu, Guanghui An, Xiaoting Tang, Lening Zhao, Chunhui Xu, Dong Xu, 19 Aug 2025, TOM: An Open-Source Tongue Segmentation Method with Multi-Teacher Distillation and Task-Specific Data Augmentation, https://arxiv.org/abs/2508.14932
- Aqib Nazir Mir, Danish Raza Rizvi, 21 Aug 2025, Explainable Knowledge Distillation for Efficient Medical Image Classification, https://arxiv.org/abs/2508.15251
- Ruiqi Wang, Zezhou Yang, Cuiyun Gao, Xin Xia, Qing Liao, 21 Aug 2025, An Empirical Study of Knowledge Distillation for Code Understanding Tasks, https://arxiv.org/abs/2508.15423
- Nicholas Mehlman, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Kelvin Niu, Alexander H. Miller, Shagun Sodhani, 29 Jul 2025, Scaling and Distilling Transformer Models for sEMG, https://arxiv.org/abs/2507.22094
- Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu, 30 Jul 2025, Towards the Law of Capacity Gap in Distilling Language Models, https://arxiv.org/abs/2311.07052
- Maciej Mozolewski, Szymon Bobek, Grzegorz J. Nalepa, 3 Aug 2025, From SHAP to Rules: Distilling Expert Knowledge from Post-hoc Model Explanations in Time Series Classification, https://arxiv.org/abs/2508.01687
- Sisuo Lyu, Siru Zhong, Weilin Ruan, Qingxiang Liu, Qingsong Wen, Hui Xiong, Yuxuan Liang, 3 Aug 2025, OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting, https://arxiv.org/abs/2508.01727
- Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu, 5 Aug 2025, SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation, https://arxiv.org/abs/2508.03411
- Connor Wilhelm, Dan Ventura, 12 Aug 2025, Distilling Reinforcement Learning into Single-Batch Datasets, https://arxiv.org/abs/2508.09283
- Abdul Matin, Tanjim Bin Faruk, Shrideep Pallickara, Sangmi Lee Pallickara, 13 Aug 2025, HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss, https://arxiv.org/abs/2508.09453
- Duygu Altinok, 18 Aug 2025, Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts, https://arxiv.org/abs/2508.13376
- Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan, Maggie Ziyu Huan, Jiaming Liang, Marcelo Der Torossian Torres, Jiatao Liang, Zachary Ives, Yoseph Barash, Cesar de la Fuente-Nunez, Jacob R. Gardner, Mark Yatskar, 14 Aug 2025, A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design, https://arxiv.org/abs/2508.10899
- Max Rehman Linder, 14 Aug 2025, KL-based self-distillation for large language models, https://arxiv.org/abs/2508.15807
- Stephen Ekaputra Limantoro, 22 Aug 2025, Parameter-Free Logit Distillation via Sorting Mechanism, https://arxiv.org/abs/2508.16544
- Rosni Vasu, Chandrayee Basu, Bhavana Dalvi Mishra, Cristina Sarasua, Peter Clark, Abraham Bernstein, 21 Aug 2025, HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance, https://arxiv.org/abs/2506.12937
- Khoi Do, Binh-Son Hua, 21 Aug 2025, Text-to-3D Generation using Jensen-Shannon Score Distillation, https://arxiv.org/abs/2503.10660
- Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall, 21 Aug 2025, LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression, https://arxiv.org/abs/2310.00369
- Lei Jiang, Wen Ge, Niels Cariou-Kotlarek, Mingxuan Yi, Po-Yu Chen, Lingyi Yang, Francois Buet-Golfouse, Gaurav Mittal, Hao Ni, 23 Aug 2025, Sig-DEG for Distillation: Making Diffusion Models Faster and Lighter, https://arxiv.org/abs/2508.16939
- Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo, 25 Aug 2025, FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation, https://arxiv.org/abs/2508.17868
- Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani, 25 Aug 2025, CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation, https://arxiv.org/abs/2503.09878
- Xuhui Fan, Zhangkai Wu and Hongyu Wu, 15 Aug 2025, A Survey on Pre-Trained Diffusion Model Distillations, https://arxiv.org/abs/2502.08364
Ensemble Knowledge Distillation (Multi-Model)
Rather than a single teacher-student pair of models, there is research to suggest that it can be even more effective to use multiple teacher models, or various other ensemble distillation techniques.
- Wenxian Shi, Yuxuan Song, Hao Zhou, Bohan Li, and Lei Li. Learning from deep model via exploring local targets, 2021. https://openreview.net/forum?id=5slGDu_bVc6 (Distillation with multiple teachers)
- Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5191– 5198. AAAI Press, 2020. https://arxiv.org/abs/1902.03393 (multiple teachers)
- Jangho Kim, Minsung Hyun, Inseop Chung, and Nojun Kwak. Feature fusion for online mutual knowledge distillation. In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, pp. 4619–4625. IEEE, 2020. https://arxiv.org/abs/1904.09058 (Ensemble methods for distillation.)
- Inseop Chung, Seonguk Park, Jangho Kim, and Nojun Kwak. Feature-map-level online adversarial knowledge distillation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2006–2015. PMLR, 2020. https://arxiv.org/abs/2002.01775 (Multiple teacher models.)
- Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. Online knowledge distillation with diverse peers. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 3430–3437. AAAI Press, 2020a https://arxiv.org/abs/1912.00350 (Ensemble distillation with multiple "peer" teachers.)
- Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Orm ´ andi, George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neural network training through online distillation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. https://arxiv.org/abs/1804.03235
- Mehdi Rezagholizadeh, Aref Jafari, Puneeth Salad, Pranav Sharma, Ali Saheb Pasand, Ali Ghodsi, Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher, arXiv preprint arXiv:2110.08532, 2021. https://arxiv.org/abs/2110.08532
- Y. Zhang, T. Xiang, T. M. Hospedales and H. Lu, "Deep mutual learning", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4320-4328, Jun. 2018. https://arxiv.org/abs/1706.00384
- L. Yuan, F. E. Tay, G. Li, T. Wang and J. Feng, "Revisiting knowledge distillation via label smoothing regularization", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3903-3911, Jun. 2020. https://arxiv.org/abs/1909.11723 (Improved learning, and also looks at reverse student-to-teacher learning.)
Unnatural Data Set Creation
Training one model on the output of another is not exactly distillation, but it is a widespread practice. Research papers on "unnatural instructions" that omit human curation, include:
- Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)
Dataset Distillation
The technique of "dataset distillation" borrows the same terminology, but is a different technique to knowledge distillation. This term refers to methods to reduce a training dataset to a derived set of training data, such as to avoid privacy or copyright concerns. The dataset is smaller and theoretically can be used to train a similarly capable model.
Papers on dataset distillation:
- T. Wang, J.-Y. Zhu, A. Torralba and A. A. Efros, "Dataset distillation", arXiv:1811.10959, 2018. https://arxiv.org/abs/1811.10959
- Yu R, Liu S, Wang X, 2023, Dataset Distillation: A Comprehensive Review. https://arxiv.org/abs/2301.07014
- Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)
- Mami Nagoya, Keiichi Shiohara, Xing Chen, 2017, "A method for reducing the amounts of training samples for developing AI systems", 2017 International Electronics Symposium on Knowledge Creation and Intelligent Computing (IES-KCIC), pp.13-20, 2017. https://ieeexplore.ieee.org/document/8228448
- David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Black Box Knowledge Distillation
Black box knowledge distillation is any type of KD whereby the student model is treated as a black box. Such architectures use the teacher model to create inputs for the student model, but do not directly modify any internal weights of the student.
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
White Box Knowledge Distillation
White box knowledge distillation is any use of KD where the weights of the student model can be directly modified. For example, such architectures may involve swapping weights directly with the teacher model, or otherwise editing the internal architecture of the student model.
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: