Aussie AI

Knowledge Distillation Research

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Knowledge Distillation (KD) is a model optimization technique where a larger pre-trained model is used to train a smaller more-efficient model. When used successfully, the result is a small model with faster inference that closely matches the accuracy of the larger model.

Distillation is not technically an ensemble method, because the larger model is not used during inference. Hence, it is not the same as "big-small" dual inference architectures.

Distillation also differs from "fine tuning" or "re-training", which involve extra training on the (large) model, whereas knowledge distillation involves training a new, smaller model from scratch.

Recent advances in Knowledge Distillation include novel ways to directly transfer the learning, weighting approaches rather than exact probability transfer, and multi-model distillation approaches whereby the smaller student model can gain information from multiple teachers.

Survey Papers on Knowledge Distillation

Review papers with coverage of KD include:

Y Tian, S Pei, X Zhang, C Zhang, 2023, Knowledge Distillation on Graphs: A Survey, arXiv preprint, https://arxiv.org/abs/2302.00219
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282 (A survey paper from 2017 that includes KD.)
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper on multiple areas, including a section on Knowledge Distillation.)
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including knowledge distillation.)
Wang L, Yoon KJ. Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks. 2021;44:3048-3068 https://arxiv.org/abs/2004.05937 (Distillation in vision context.)
Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046

Research on Knowledge Distillation

KD is a longstanding method of optimizing model inference that is one of the most popular techniques. Research papers on KD include:

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. https://arxiv.org/abs/1503.02531 (The early paper that seems to have coined the name.)
Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, Oct 2019 (revised March 2020), DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019), https://arxiv.org/abs/1910.01108
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv preprint arXiv:1909.10351, Sep 2019 (updated Oct 2020), https://arxiv.org/abs/1909.10351, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, arXiv preprint arXiv:2004.02984 (2020), https://arxiv.org/abs/2004.02984
Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu, Patient Knowledge Distillation for BERT Model Compression, arXiv preprint arXiv:1908.09355 (Aug 2019), https://arxiv.org/abs/1908.09355
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MINILMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers, arXiv preprint arXiv:2002.10957, 2020 (revised June 2021), https://arxiv.org/abs/2012.15828
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136 (Mar 2019), https://arxiv.org/abs/1903.12136
Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962, https://arxiv.org/abs/1908.08962v1
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and Practical BERT Models for Sequence Labeling. arXiv preprint arXiv:1909.00100. https://arxiv.org/abs/1909.00100
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, arXiv preprint arXiv:1903.12136, https://arxiv.org/abs/1903.12136
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. July 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding arXiv preprint arXiv:1907.04829. https://arxiv.org/abs/1907.04829
Yoon Kim and Alexander M Rush. Sep 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, https://arxiv.org/abs/1606.07947
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pages 742–751. https://dl.acm.org/doi/10.5555/3294771.3294842
Mao, Y.; Wang, Y.; Wu, C.; Zhang, C.; Wang, Y.; Zhang, Q.; Yang, Y.; Tong, Y.; and Bai, J. 2020. LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression. In COLING, 3225–3234. International Committee on Computational Linguistics. https://arxiv.org/abs/2004.04124 (A combination of weight pruning, matrix factorization and knowledge distillation.)
Lin, S.C.; Yang, J.H.; Lin, J., Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386 2020. https://arxiv.org/abs/2010.11386
Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices. In AAAI’18, https://arxiv.org/abs/1708.04728 (Includes a distinct type of distillation.)
Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 535–541. ACM, 2006. https://www.semanticscholar.org/paper/Model-compression-Bucila-Caruana/30c9bb327b7f2b9f1d1e5b69b9d0c97b410948d9, PDF: http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf (Early 2006 paper on teaching models before it became called "distillation" in 2015.)
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian J. McAuley, and Furu Wei. Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression. CoRR, abs/2109.03228, 2021, https://arxiv.org/abs/2109.03228 (Evaluation of the efficiency of distillation.)
Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew Gordon Wilson. Does knowledge distillation really work? CoRR, abs/2106.05945, 2021, https://arxiv.org/abs/2106.05945 (Evaluation of the efficacy of distillation.)
Dae Young Park, Moon-Hyun Cha, Changwook Jeong, Daesin Kim, and Bohyung Han. Learning student-friendly teacher networks for knowledge distillation. CoRR, abs/2102.07650, 2021. https://arxiv.org/abs/2102.07650
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y Code: https://github.com/anilkagak2/DiSK_Distilling_Scaffolded_Knowledge (See Chapter 13: Distilling Selective/Scaffolded Knowledge)
Zhang, C.; Yang, Y.; Liu, J.; Wang, J.; Xian, Y.; Wang, B.; and Song, D. 2023. Lifting the Curse of Capacity Gap in Distilling Language Models. arXiv:2305.12129. https://arxiv.org/abs/2305.12129
Chen X, He B, Hui K, Sun L, Sun Y. Simplified Tinybert: Knowledge Distillation for Document Retrieval. 2020. Arxiv preprint, https://arxiv.org/abs/2009.07531
Tian Y, Krishnan D, Isola P. Contrastive Representation Distillation. 2019. Arxiv Preprint: https://arxiv.org/pdf/1910.10699.pdf
Do T, Tran H, Do T, Tjiputra E, Tran Q. Compact Trilinear Interaction for Visual Question Answering. In: Proceedings of the proceedings of the IEEE International Conference on Computer Vision. 2019:392-401. https://arxiv.org/abs/1909.11874
T Chen, S Liu, Z Chen, W Hu, D Chen, Y Wang, Q Lyu, 2023, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, https://www.oajaiml.com/uploads/archivepdf/27841181.pdf
B. Heo, M. Lee, S. Yun and J. Y. Choi, "Knowledge transfer via distillation of activation boundaries formed by hidden neurons", Proc. AAAI Conf. Artif. Intell. (AAAI), vol. 33, no. 1, pp. 3779-3787, 2019, https://arxiv.org/abs/1811.03233
Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, Kun Gai, "Rocket launching: A universal and efficient framework for training well-performing light net", Proc. AAAI Conf. Artif. Intell., pp. 1-8, 2018. https://arxiv.org/abs/1708.04106 (Combined training of teacher and student models.)
A. Chaulwar et al., "Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices", arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
Z Zhao, Q Liu, H Gui, B An, L Hong, H Chi, Oct 2023, Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication, arXiv preprint arXiv:2310.03188, https://arxiv.org/pdf/2310.03188.pdf
T Udagawa, A Trivedi, M Merler, B Bhattacharjee, Oct 2023, A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models, arXiv preprint arXiv:2310.08797, https://arxiv.org/abs/2310.08797
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461
Jin Wang, Dawei Liao, You Zhang, Dan Xu, Xuejie Zhang, 2024, Layerwised multimodal knowledge distillation for vision-language pretrained model, Neural Networks Available online 26 March 2024, 106272, https://doi.org/10.1016/j.neunet.2024.106272
Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, May 2024, Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation, WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024, Pages 235–244, https://doi.org/10.1145/3589335.3648321
Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, Dec 2023, Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup, https://arxiv.org/abs/2312.05795
Canwen Xu, 2024, Efficient Natural Language Processing for Language Models, Ph.D. thesis, Computer Science, UNIVERSITY OF CALIFORNIA SAN DIEGO, PDF: https://escholarship.org/uc/item/9dv1k5xv PDF: https://escholarship.org/content/qt9dv1k5xv/qt9dv1k5xv.pdf?t=sc34ay (Evaluates several acceleration methods including early-exit, PEFT, and distillation.)
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
Zuo, G., Zhang, C., Zheng, Z. et al., 2024, Knowledge distillation based on projector integration and classifier sharing. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01394-3 https://link.springer.com/article/10.1007/s40747-024-01394-3
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu, Nov 2023, Initializing Models with Larger Ones, https://arxiv.org/abs/2311.18823 Code: https://github.com/OscarXZQ/weight-selection
Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367 Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
Sean Farhat, Deming Chen, 4 Apr 2024, On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models, https://arxiv.org/abs/2404.03263
Rachel Gordon, Publication Date:March 21, 2024, AI generates high-quality images 30 times faster in a single step, MIT News, https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321 (MIT's new image generation framework called "distribution matching distillation" is faster than diffusion models.)
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence, 22 Nov 2023, Efficient Transformer Knowledge Distillation: A Performance Review, https://arxiv.org/abs/2311.13657
Chang Liu, Chongyang Tao, Jianxin Liang, Jiazhan Feng, Tao Shen, 2023, Quzhe Huang, Dongyan Zhao,Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4452–4463, December 6-10, 2023, https://aclanthology.org/2023.findings-emnlp.294.pdf (Explores combining static model compression via knowledge distillation with dynamic adaptive inference via token pruning. This creates a modified distillation algorithm that prepares the model for token pruning during training.)
QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Erik Pettersson, Sep 2023, Knowledge distillation for anomaly detection, Master's Thesis, Faculty of Science and Technology, Uppsala University, https://www.diva-portal.org/smash/get/diva2:1805667/FULLTEXT01.pdf
Ba, J. and Caruana, R. (2014). Do deep nets really need to be deep? In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper/2014/file/ ea8fcd92d59581717e06eb187f10666d-Paper.pdf.
Bucila, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery. URL: https://doi.org/10.1145/1150402.1150464.
Zeng, X. and Martinez, T. R. (2000). Using a neural networks to approximate an ensemble of classifiers. In Neural Processing Letters, page 2000. URL: https://doi.org/10.1023/A:1026530200837
K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
K Wu, J Zhang, H Peng, M Liu, B Xiao, J Fu, 2022, Tinyvit: Fast pretraining distillation for small vision transformers, https://arxiv.org/pdf/2207.10666.pdf%E2%80%8B
S Norouzi, R Hosseinzadeh, F Perez, 2023, DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers for Machine Translation, https://aclanthology.org/2023.findings-acl.542/
Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, Tie-yan Liu, 6 Jul 2023 (v2), A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond, https://arxiv.org/pdf/2204.09269.pdf
Asit Mishra and Debbie Marr. 2017. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. arXiv:1711.05852 [cs] (Nov. 2017). http://arxiv.org/abs/1711.05852 arXiv: 1711.05852.
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Meta, August 14, 2024, How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models, Meta AI Blog, https://ai.meta.com/blog/nvidia-llama/
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
Sharath Sreenivas, Vinh Nguyen, Saurav Muralidharan, Marcin Chochowski and Raviraj Joshi, How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model, Aug 14, 2024, https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen, 14 Oct 2024, Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation, https://arxiv.org/abs/2410.10141 https://github.com/ozyyshr/TempSpec
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Shen, J., Liu, Y., Jiang, Y., Chen, Y., Han, W. (2025). Model-Agnostic Knowledge Distillation Between Heterogeneous Models. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15359. Springer, Singapore. https://doi.org/10.1007/978-981-97-9431-7_19 https://link.springer.com/chapter/10.1007/978-981-97-9431-7_19
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville, 12 Dec 2024, Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices, https://arxiv.org/abs/2412.09289
Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S.-H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren, 12 Dec 2024, SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training, https://arxiv.org/abs/2412.09619
Guiyu Li, Shang Zheng, Haitao Zou, Hualong Yu, Shang Gao, 2024, Model compression through distillation with cross-layer integrated guidance at word level, Neurocomputing, 129162, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2024.129162. https://www.sciencedirect.com/science/article/abs/pii/S0925231224019337
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Giordano d'Aloisio, Luca Traini, Federica Sarro, Antinisca Di Marco, 18 Dec 2024, On the Compression of Language Models for Code: An Empirical Study on CodeBERT, https://arxiv.org/abs/2412.13737 (Quantization, pruning and distillation on code generation models.)
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
Kalle Kujanpää, Harri Valpola, Alexander Ilin, 19 Dec 2024, Knowledge Injection via Prompt Distillation, https://arxiv.org/abs/2412.14964
Yifan Yu, Yu Gan, Lily Tasi, Nikhil Sarda, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler, 22 Jan 2025, EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation, https://arxiv.org/abs/2501.12689 (Using a semantic cache to prepend previously computed answers from similar queries as promopt examples, to improve results from a smaller LLM's final result.)
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, et al. (100+ additional authors not shown), 22 Jan 2025, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 (The DeepSeek R1 large reasoning model.)
Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
Cristian Leo, Feb 2025, How to Distill a LLM: Step-by-step. The Google Paper that started efficient LLM distillation. Let’s explore how it works, the math behind this technique, and how to implement it with code. https://medium.com/data-science-collective/how-to-distill-a-llm-step-by-step-58f06fcf4bfa
Jasmine Wu, Deirdre Bosa, Feb 21 2025, How DeepSeek used distillation to train its artificial intelligence model, and what it means for companies such as OpenAI, https://www.cnbc.com/2025/02/21/deepseek-trained-ai-model-using-distillation-now-a-disruptive-force.html
Kayhan Behdin, Yun Dai, Ata Fatahibaarzi, Aman Gupta, Qingquan Song, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Maziar Sanjabi, Vignesh Kothapalli, Hamed Firooz, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Zhipeng Wang, Rahul Mazumder, Natesh Pillai, Luke Simon, 20 Feb 2025, Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications, https://arxiv.org/abs/2502.14305 (Deploying small models for efficiency via distillation and quantization/pruning.)
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa, 28 May 2025, RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding, https://arxiv.org/abs/2505.22135
Michael List, July 2025, Distillation for Efficient History Compression in Reinforcement Learning, Master’s Thesis, https://epub.jku.at/obvulihs/content/titleinfo/12295461/full.pdf
Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang, 14 Aug 2025, Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation, https://arxiv.org/abs/2508.10774
Juntao Lin, Xianghao Zhan, 22 Jul 2025, Sensor Drift Compensation in Electronic-Nose-Based Gas Recognition Using Knowledge Distillation, https://arxiv.org/abs/2507.17071
Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, Kede Ma, 23 Jul 2025, Dataset Distillation as Data Compression: A Rate-Utility Perspective, https://arxiv.org/abs/2507.17221
Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu, 23 Jul 2025, AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation, https://arxiv.org/abs/2503.02832
Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu, 22 Jul 2025, Cross-Modal Distillation For Widely Differing Modalities, https://arxiv.org/abs/2507.16296
Norah Alballa, Ahmed M. Abdelmoniem, Marco Canini, 22 Jul 2025, Practical Insights into Knowledge Distillation for Pre-Trained Models, https://arxiv.org/abs/2402.14922
Yuki Kadokawa, Hirotaka Tahara, Takamitsu Matsubara, 22 Jul 2025, Progressive-Resolution Policy Distillation: Leveraging Coarse-Resolution Simulations for Time-Efficient Fine-Resolution Policy Learning, https://arxiv.org/abs/2412.07477
Lakshmana Sri Harsha Nemani, P.K. Srijith, Tomasz Ku\'smierczyk, 24 Jul 2025, Efficient Uncertainty in LLMs through Evidential Knowledge Distillation, https://arxiv.org/abs/2507.18366
Magnus Bengtsson, Kenneth \"Ostberg, 24 Jul 2025, C2G-KD: PCA-Constrained Generator for Data-Free Knowledge Distillation, https://arxiv.org/abs/2507.18533
Zhen Han, Mattias Teye, Derek Yadgaroff, Judith B\"utepage, 24 Jul 2025, Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation, https://arxiv.org/abs/2507.18352
Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee, 24 Jul 2025, Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs, https://arxiv.org/abs/2503.16870
Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong, 19 Jul 2025, Generative Distribution Distillation, https://arxiv.org/abs/2507.14503
Zihao Hu (1), Jia Yan (2), Ying-Jun Angela Zhang (1), Jun Zhang (3), Khaled B. Letaief (3) ((1) The Chinese University of Hong Kong, (2) The Hong Kong University of Science and Technology (Guangzhou), (3) The Hong Kong University of Science and Technology), 21 Jul 2025, Optimal Transceiver Design in Over-the-Air Federated Distillation, https://arxiv.org/abs/2507.15256
Songming Zhang and Yuxiao Luo and Ziyu Lyu and Xiaofeng Chen, 19 Jul 2025, ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift, https://arxiv.org/abs/2312.16242
Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, Hao Wu, 20 Jul 2025, Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting, https://arxiv.org/abs/2507.02939
Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang, 21 Jul 2025, Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents, https://arxiv.org/abs/2502.19545
Changli Wang, Fang Yin, Jiafeng Liu, Rui Wu, 20 Jul 2025, HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space, https://arxiv.org/abs/2507.09487
Hayeon Kim, Ji Ha Jang and Se Young Chun, 21 Jul 2025, Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling, https://arxiv.org/abs/2507.11061
Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou, 11 Aug 2025, ThinkTuning: Instilling Cognitive Reflections without Distillation, https://arxiv.org/abs/2508.07616
Ziqi Zhang, Ali Shahin Shamsabadi, Hanxiao Lu, Yifeng Cai, Hamed Haddadi, 9 Aug 2025, Membership and Memorization in LLM Knowledge Distillation, https://arxiv.org/abs/2508.07054
Christos Tsirigotis, Vaibhav Adlakha, Joao Monteiro, Aaron Courville, Perouz Taslakian, 9 Aug 2025, BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation, https://arxiv.org/abs/2508.06781
Zihao Hu (1), Jia Yan (2), Ying-Jun Angela Zhang (1) ((1) The Chinese University of Hong Kong, (2) The Hong Kong University of Science and Technology (Guangzhou)), 6 Aug 2025, Communication-Learning Co-Design for Differentially Private Over-the-Air Federated Distillation, https://arxiv.org/abs/2508.06557
Simon Baur, Alexandra Benova, Emilio Dolgener Cant\'u, Jackie Ma, 6 Aug 2025, On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications, https://arxiv.org/abs/2508.06558
Deepon Halder, Thanmay Jayakumar, Raj Dabre, 9 Aug 2025, CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation, https://arxiv.org/abs/2506.19952
Robert Frenken, Sidra Ghayour Bhatti, Hanqin Zhang, Qadeer Ahmed, 25 Jul 2025, KD-GAT: Combining Knowledge Distillation and Graph Attention Transformer for a Controller Area Network Intrusion Detection System, https://arxiv.org/abs/2507.19686
Yang Zhao, Shusheng Li, Xueshang Feng, 28 Jul 2025, Lightweight Remote Sensing Scene Classification on Edge Devices via Knowledge Distillation and Early-exit, https://arxiv.org/abs/2507.20623
Joey Chan, Zhen Chen, Ershun Pan, 27 Jul 2025, Foundation Models Knowledge Distillation For Battery Capacity Degradation Forecast, https://arxiv.org/abs/2505.08151
Ren Zhuang, Ben Wang, Shuifa Sun, 25 Jul 2025, AGORA: Incentivizing Group Emergence Capability in LLMs via Group Distillation, https://arxiv.org/abs/2507.21166
Siddhartha Pradhan, Shikshya Shiwakoti, Neha Bathuri, 29 Jul 2025, Teach Me to Trick: Exploring Adversarial Transferability via Knowledge Distillation, https://arxiv.org/abs/2507.21992
Sheng-Feng Yu, Jia-Jiun Yao, and Wei-Chen Chiu, 29 Jul 2025, Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation, https://arxiv.org/abs/2507.21455
Giovanni Dispoto, Paolo Bonetti, Marcello Restelli, 29 Jul 2025, "So, Tell Me About Your Policy...": Distillation of interpretable policies from Deep Reinforcement Learning agents, https://arxiv.org/abs/2507.07848
Guopeng Li, Qiang Wang, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia, 29 Jul 2025, Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation, https://arxiv.org/abs/2410.12342
Kunyang Li, Jeffrey A Chan Santiago, Sarinda Dhanesh Samarasinghe, Gaowen Liu, Mubarak Shah, 30 Jul 2025, GVD: Guiding Video Diffusion Model for Scalable Video Distillation, https://arxiv.org/abs/2507.22360
Wenchao Gu and Zongyi Lyu and Yanlin Wang and Hongyu Zhang and Cuiyun Gao and Michael R. Lyu, 1 Aug 2025, SPENCER: Self-Adaptive Model Distillation for Efficient Code Retrieval, https://arxiv.org/abs/2508.00546
Zhen Wu, Ritam Dutt, Luke M. Breitfeller, Armineh Nourbakhsh, Siddharth Parekh, Carolyn Ros\'e, 2 Aug 2025, $R^2$-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation, https://arxiv.org/abs/2508.01475
Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma, 2 Aug 2025, DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging, https://arxiv.org/abs/2508.01148
Hung-Chieh Fang, Hsuan-Tien Lin, Irwin King, Yifei Zhang, 2 Aug 2025, Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning, https://arxiv.org/abs/2508.01251
Kuiyuan DIng, Caili Guo, Yang Yang, Zhongtian Du, and Walid Saad, 4 Aug 2025, Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation, https://arxiv.org/abs/2508.02148
Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre, 3 Aug 2025, Improving Noise Efficiency in Privacy-preserving Dataset Distillation, https://arxiv.org/abs/2508.01749
Tobias J\"ulg, Wolfram Burgard, Florian Walter, 4 Aug 2025, Refined Policy Distillation: From VLA Generalists to RL Experts, https://arxiv.org/abs/2503.05833
Chaoyang Gao, Xiang Chen, Jiyu Wang, Jibin Wang, Guang Yang, 30 Jul 2025, Resource-Efficient Automatic Software Vulnerability Assessment via Knowledge Distillation and Particle Swarm Optimization, https://arxiv.org/abs/2508.02840
Jiahui Bai, Hai Dong, A. K. Qin, 5 Aug 2025, On the Fast Adaptation of Delayed Clients in Decentralized Federated Learning: A Centroid-Aligned Distillation Approach, https://arxiv.org/abs/2508.02993
Hyung Gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz, 4 Aug 2025, Adaptive Knowledge Distillation for Device-Directed Speech Detection, https://arxiv.org/abs/2508.02801
Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu, 5 Aug 2025, V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models, https://arxiv.org/abs/2508.03254
Seyedhamidreza Mousavi, Seyedali Mousavi and Masoud Daneshtalab, 5 Aug 2025, ProARD: progressive adversarial robustness distillation: provide wide range of robust students, https://arxiv.org/abs/2506.07666
Ryota Ikeda, 5 Aug 2025, Do GNN-based QEC Decoders Require Classical Knowledge? Evaluating the Efficacy of Knowledge Distillation from MWPM, https://arxiv.org/abs/2508.03782
Sriram Mandalika, Lalitha V, 6 Aug 2025, CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework, https://arxiv.org/abs/2508.04816
Haonan Shangguan, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, and Ge Yu, 7 Aug 2025, Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation, https://arxiv.org/abs/2508.05234
Martin Weyssow, Chengran Yang, Junkai Chen, Ratnadira Widyasari, Ting Zhang, Huihui Huang, Huu Hung Nguyen, Yan Naing Tun, Tan Bui, Yikun Li, Ang Han Wei, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo, 7 Aug 2025, R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation, https://arxiv.org/abs/2504.04699
Lingyuan Liu, Mengxiang Zhang, 8 Aug 2025, Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models, https://arxiv.org/abs/2508.06135
Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli\'c, Alessandro Sordoni, 8 Aug 2025, Training Plug-n-Play Knowledge Modules with Deep Context Distillation, https://arxiv.org/abs/2503.08727
Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran, Yanning Zhang, 12 Aug 2025, Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL, https://arxiv.org/abs/2508.08677
Jinlin Xiang, Minho Choi, Yubo Zhang, Zhihao Zhou, Arka Majumdar, Eli Shlizerman, 11 Aug 2025, Neural Tangent Knowledge Distillation for Optical Convolutional Networks, https://arxiv.org/abs/2508.08421
Kristian Miok, Blaz \v{S}krlj, Daniela Zaharie, and Marko Robnik \v{S}ikonja, 30 Jul 2025, TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning, https://arxiv.org/abs/2508.08273
Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang, 13 Aug 2025, Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning, https://arxiv.org/abs/2508.09883
Aman Anand, Elyas Rashno, Amir Eskandari, Farhana Zulkernine, 12 Aug 2025, Depth-Guided Self-Supervised Human Keypoint Detection via Cross-Modal Distillation, https://arxiv.org/abs/2410.14700
Van Duc Cuong, Ta Dinh Tam, Tran Duc Chinh and Nguyen Thi Hanh, 10 Aug 2025, FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning, https://arxiv.org/abs/2508.07264
Durgesh Mishra, Rishabh Uikey, 15 Aug 2025, Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition, https://arxiv.org/abs/2508.11376
Siyamalan Manivannan, 15 Aug 2025, Semi-Supervised Learning with Online Knowledge Distillation for Skin Lesion Classification, https://arxiv.org/abs/2508.11511
Thinh Dao, Khoa D Doan, Kok-Seng Wong, 14 Aug 2025, Clean-Label Physical Backdoor Attacks with Data Distillation, https://arxiv.org/abs/2407.19203
Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
Taeheon Kim, San Kim, Minhyuk Seo, Dongjae Jeon, Wonje Jeong, Jonghyun Choi, 18 Aug 2025, Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning, https://arxiv.org/abs/2508.12692
Xinhe Li, Jiajun Liu, Peng Wang, 18 Aug 2025, Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction, https://arxiv.org/abs/2508.13037
Anshul Ahluwalia, Payman Behnam, Rohit Das, Alind Khare, Biswadeep Chakraborty, Pan Li, Alexey Tumanov, 16 Aug 2025, STRIDE: Structure and Embedding Distillation with Attention for Graph Neural Networks, https://arxiv.org/abs/2310.15938
Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin, 18 Aug 2025, Inverse Bridge Matching Distillation, https://arxiv.org/abs/2502.01362
Jihyun Lim, Junhyuk Jo, Tuo Zhang, Sunwoo Lee, 17 Aug 2025, Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogenous Federated Learning, https://arxiv.org/abs/2503.11151
Simardeep Singh, 15 Aug 2025, From Teacher to Student: Tracking Memorization Through Model Distillation, https://arxiv.org/abs/2506.16170
Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, 6 Aug 2025, Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL, https://arxiv.org/abs/2508.13167
Yunxiang Yang, Ningning Xu, Jidong J. Yang, 19 Aug 2025, Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference, https://arxiv.org/abs/2508.13439
Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu, 19 Aug 2025, Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation, https://arxiv.org/abs/2507.04680
Ye Su, Hezhe Qiao, Wei Huang, Lin Chen, 12 Aug 2025, Toward Generalist Semi-supervised Regression via Decoupled Representation Distillation, https://arxiv.org/abs/2508.14082
Ahmed Mujtaba, Gleb Radchenko, Radu Prodan, Marc Masana, 20 Aug 2025, Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data, https://arxiv.org/abs/2508.14769
Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert, 20 Aug 2025, Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method, https://arxiv.org/abs/2508.14783
Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li, 20 Aug 2025, Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning, https://arxiv.org/abs/2507.10348
Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Furkan Guzelant, Aysegul Dundar, 20 Aug 2025, Identity Preserving 3D Head Stylization with Multiview Score Distillation, https://arxiv.org/abs/2411.13536
Yifan Zhang, Junhui Hou, 20 Aug 2025, Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?, https://arxiv.org/abs/2412.08973
Jiacheng Xie, Ziyang Zhang, Biplab Poudel, Congyu Guo, Yang Yu, Guanghui An, Xiaoting Tang, Lening Zhao, Chunhui Xu, Dong Xu, 19 Aug 2025, TOM: An Open-Source Tongue Segmentation Method with Multi-Teacher Distillation and Task-Specific Data Augmentation, https://arxiv.org/abs/2508.14932
Aqib Nazir Mir, Danish Raza Rizvi, 21 Aug 2025, Explainable Knowledge Distillation for Efficient Medical Image Classification, https://arxiv.org/abs/2508.15251
Ruiqi Wang, Zezhou Yang, Cuiyun Gao, Xin Xia, Qing Liao, 21 Aug 2025, An Empirical Study of Knowledge Distillation for Code Understanding Tasks, https://arxiv.org/abs/2508.15423
Nicholas Mehlman, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Kelvin Niu, Alexander H. Miller, Shagun Sodhani, 29 Jul 2025, Scaling and Distilling Transformer Models for sEMG, https://arxiv.org/abs/2507.22094
Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu, 30 Jul 2025, Towards the Law of Capacity Gap in Distilling Language Models, https://arxiv.org/abs/2311.07052
Maciej Mozolewski, Szymon Bobek, Grzegorz J. Nalepa, 3 Aug 2025, From SHAP to Rules: Distilling Expert Knowledge from Post-hoc Model Explanations in Time Series Classification, https://arxiv.org/abs/2508.01687
Sisuo Lyu, Siru Zhong, Weilin Ruan, Qingxiang Liu, Qingsong Wen, Hui Xiong, Yuxuan Liang, 3 Aug 2025, OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting, https://arxiv.org/abs/2508.01727
Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu, 5 Aug 2025, SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation, https://arxiv.org/abs/2508.03411
Connor Wilhelm, Dan Ventura, 12 Aug 2025, Distilling Reinforcement Learning into Single-Batch Datasets, https://arxiv.org/abs/2508.09283
Abdul Matin, Tanjim Bin Faruk, Shrideep Pallickara, Sangmi Lee Pallickara, 13 Aug 2025, HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss, https://arxiv.org/abs/2508.09453
Duygu Altinok, 18 Aug 2025, Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts, https://arxiv.org/abs/2508.13376
Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan, Maggie Ziyu Huan, Jiaming Liang, Marcelo Der Torossian Torres, Jiatao Liang, Zachary Ives, Yoseph Barash, Cesar de la Fuente-Nunez, Jacob R. Gardner, Mark Yatskar, 14 Aug 2025, A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design, https://arxiv.org/abs/2508.10899
Max Rehman Linder, 14 Aug 2025, KL-based self-distillation for large language models, https://arxiv.org/abs/2508.15807
Stephen Ekaputra Limantoro, 22 Aug 2025, Parameter-Free Logit Distillation via Sorting Mechanism, https://arxiv.org/abs/2508.16544
Rosni Vasu, Chandrayee Basu, Bhavana Dalvi Mishra, Cristina Sarasua, Peter Clark, Abraham Bernstein, 21 Aug 2025, HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance, https://arxiv.org/abs/2506.12937
Khoi Do, Binh-Son Hua, 21 Aug 2025, Text-to-3D Generation using Jensen-Shannon Score Distillation, https://arxiv.org/abs/2503.10660
Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall, 21 Aug 2025, LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression, https://arxiv.org/abs/2310.00369
Lei Jiang, Wen Ge, Niels Cariou-Kotlarek, Mingxuan Yi, Po-Yu Chen, Lingyi Yang, Francois Buet-Golfouse, Gaurav Mittal, Hao Ni, 23 Aug 2025, Sig-DEG for Distillation: Making Diffusion Models Faster and Lighter, https://arxiv.org/abs/2508.16939
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo, 25 Aug 2025, FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation, https://arxiv.org/abs/2508.17868
Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani, 25 Aug 2025, CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation, https://arxiv.org/abs/2503.09878
Xuhui Fan, Zhangkai Wu and Hongyu Wu, 15 Aug 2025, A Survey on Pre-Trained Diffusion Model Distillations, https://arxiv.org/abs/2502.08364
Justin Kur, Kaiqi Zhao, 4 Sep 2025, Data-Augmented Quantization-Aware Knowledge Distillation, https://arxiv.org/abs/2509.03850
Hong Ye Tan, Emma Slade, 3 Sep 2025, Dataset Distillation as Pushforward Optimal Quantization, https://arxiv.org/abs/2501.07681
Xiaoxiong Zhang, Zhiwei Zeng, Xin Zhou, Zhiqi Shen, 5 Sep 2025, Low-Dimensional Federated Knowledge Graph Embedding via Knowledge Distillation, https://arxiv.org/abs/2408.05748
Md Anwar Hossen, Fatema Siddika, Wensheng Zhang, Anuj Sharma, and Ali Jannesari, 26 Aug 2025, FedProtoKD: Dual Knowledge Distillation with Adaptive Class-wise Prototype Margin for Heterogeneous Federated Learning, https://arxiv.org/abs/2508.19009
Viktor N. Zhuravlev and Artur R. Khairullin and Ernest A. Dyagin and Alena N. Sitkina and Nikita I. Kulin, 26 Aug 2025, Automatic Prompt Optimization with Prompt Distillation, https://arxiv.org/abs/2508.18992
Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, 25 Aug 2025, UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation, https://arxiv.org/abs/2506.09284
Wangyang Ying, Jinghan Zhang, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Kunpeng Liu, Chandan K. Reddy, Yanjie Fu, 27 Aug 2025, Data-Efficient Symbolic Regression via Foundation Model Distillation, https://arxiv.org/abs/2508.19487
Felix N\"utzel, Mischa Dombrowski, Bernhard Kainz, 27 Aug 2025, Ontology-Based Concept Distillation for Radiology Report Retrieval and Labeling, https://arxiv.org/abs/2508.19915
Yanfei Li, Teng Yin, Wenyi Shang, Jingyu Liu, Xi Wang, Kaiyang Zhao, 27 Aug 2025, PGAD: Prototype-Guided Adaptive Distillation for Multi-Modal Learning in AD Diagnosis, https://arxiv.org/abs/2503.04836
Suyoung Kim, Seonguk Park, Junhoo Lee, Nojun Kwak, 27 Aug 2025, The Role of Teacher Calibration in Knowledge Distillation, https://arxiv.org/abs/2508.20224
Leyang Wang, Mingtian Zhang, Zijing Ou and David Barber, 28 Aug 2025, VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation, https://arxiv.org/abs/2508.20646
Ayaka Tsutsumi, Guang Li, Ren Togo, Takahiro Ogawa, Satoshi Kondo, Miki Haseyama, 28 Aug 2025, Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification, https://arxiv.org/abs/2508.20461
Jiahao Xiao, Jiangming Liu, 28 Aug 2025, Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data, https://arxiv.org/abs/2508.20557
Kang-Hyun Lee, Faez Ahmed, 27 Aug 2025, MicroLad: 2D-to-3D Microstructure Reconstruction and Generation via Latent Diffusion and Score Distillation, https://arxiv.org/abs/2508.20138
Holger Severin Bovbjerg (1), Jan {\O}stergaard (1), Jesper Jensen (1, 2), Shinji Watanabe (3), Zheng-Hua Tan ((1) Aalborg University (2) Eriksholm Research Centre, (3) Carnegie Mellon University), 28 Aug 2025, Learning Robust Spatial Representations from Binaural Audio through Feature Distillation, https://arxiv.org/abs/2508.20914
Yifei Yuan, Jiatong Li, Weijia Zhang, Mohammad Aliannejadi, Evangelos Kanoulas, Renjun Hu, 29 Aug 2025, Summarize-Exemplify-Reflect: Data-driven Insight Distillation Empowers LLMs for Few-shot Tabular Classification, https://arxiv.org/abs/2508.21561
Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li, 30 Aug 2025, An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment, https://arxiv.org/abs/2509.00560
Armin Had\v{z}i\'c, Milan Papez, Tom\'a\v{s} Pevn\'y, 1 Sep 2025, Distillation of a tractable model from the VQ-VAE, https://arxiv.org/abs/2509.01400
Xinlu Zhang, Na Yan, Yang Su, Yansha Deng, Toktam Mahmoodi, 1 Sep 2025, Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks, https://arxiv.org/abs/2509.01750
Long Jiang, Yang Yang, Ting Fong May Chui, Morgan Thornwell, Hoshin Vijai Gupta, 2 Sep 2025, Knowledge distillation as a pathway toward next-generation intelligent ecohydrological modeling systems, https://arxiv.org/abs/2509.01972
Hainan Wang, Mehdi Hosseinzadeh, Reza Rawassizadeh, 31 Aug 2025, TinyMusician: On-Device Music Generation with Knowledge Distillation and Mixed Precision Quantization, https://arxiv.org/abs/2509.00914
Kanchon Gharami, Hansaka Aluvihare, Shafika Showkat Moni, Berker Pek\"oz, 31 Aug 2025, Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation, https://arxiv.org/abs/2509.00973
Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji, 30 Aug 2025, Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design, https://arxiv.org/abs/2507.00445
Shiva Raj Pokhrel, Deol Satish, Jonathan Kua and Anwar Walid, 2 Sep 2025, Distilling Large Language Models for Network Active Queue Management, https://arxiv.org/abs/2501.16734
Mingfeng Lin, 3 Sep 2025, Deep Self-knowledge Distillation: A hierarchical supervised learning for coronary artery segmentation, https://arxiv.org/abs/2509.03173
Zongheng Guo, Tao Chen, Manuela Ferrario, 8 Sep 2025, QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients, https://arxiv.org/abs/2509.06516
Saghar Ganji, Mohammad Naisipour, Alireza Hassani, Arash Adib, 7 Sep 2025, Distillation of CNN Ensemble Results for Enhanced Long-Term Prediction of the ENSO Phenomenon, https://arxiv.org/abs/2509.06227
Eduardo Fernandes Montesuma, 8 Sep 2025, KD$^{2}$M: A unifying framework for feature knowledge distillation, https://arxiv.org/abs/2504.01757
Devansh, Sep 2025, The Chocolate Milk Cult’s Guide to Inference Scaling for AI Models: How to Reduce the costs of Running LLMs https://machine-learning-made-simple.medium.com/the-chocolate-milk-cults-guide-to-inference-scaling-for-ai-models-50aa2290eb50 (Deep analysis of using many progressive optimizations to real-life LLM inference.)
Xinyu Zhang and Changzhi Zhou and Linmei Hu and Luhao Zhang and Xiancai Chen and Haomin Fu and Yang Yang and Mengdi Zhang, 9 Sep 2025, SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs, https://arxiv.org/abs/2509.07858
Akram Khatami-Rizi, Ahmad Mahmoudi-Aznaveh, 9 Sep 2025, Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution, https://arxiv.org/abs/2503.14779
Xin Jin, Bohan Li, BAAO Xie, Wenyao Zhang, Jinming Liu, Ziqiang Li, Tao Yang, Wenjun Zeng, 9 Sep 2025, Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback, https://arxiv.org/abs/2402.02346
Yang Chen, Shuai Fu, Yu Zhang, 12 Sep 2025, MoPD: Mixture-of-Prompts Distillation for Vision-Language Models, https://arxiv.org/abs/2412.19087
Haipeng Liu, Ting Long, Jing Fu, 11 Sep 2025, Constructing a Question-Answering Simulator through the Distillation of LLMs, https://arxiv.org/abs/2509.09226
Seung Gyu Jeong, Seong Eun Kim, 11 Sep 2025, Adaptive Knowledge Distillation using a Device-Aware Teacher for Low-Complexity Acoustic Scene Classification, https://arxiv.org/abs/2509.09262
Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao, 11 Sep 2025, Merge-of-Thought Distillation, https://arxiv.org/abs/2509.08814
Anas Anwarul Haq Khan, Utkarsh Verma, Ganesh Ramakrishnan, 11 Sep 2025, Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization, https://arxiv.org/abs/2504.21831
Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, and Amit Ranjan Trivedi, 19 Sep 2025, RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation, https://arxiv.org/abs/2509.15724
Qiaolin Wang, Xilin Jiang, Linyang He, Junkai Wu, Nima Mesgarani, 19 Sep 2025, SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models, https://arxiv.org/abs/2509.15661
Luca Della Libera, Cem Subakan, Mirco Ravanelli, 19 Sep 2025, FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation, https://arxiv.org/abs/2509.16195
Zhangkai Wu, Xuhui Fan, Hongyu Wu, Longbing Cao, 18 Sep 2025, SCoT: Straight Consistent Trajectory for Pre-Trained Diffusion Model Distillations, https://arxiv.org/abs/2502.16972
Xiang Xue, Yatu Ji, Qing-dao-er-ji Ren, Bao Shi, Min Lu, Nier Wu, Xufei Zhuang, Haiteng Xu, Gan-qi-qi-ge Cha, 16 Sep 2025, iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining, https://arxiv.org/abs/2509.12553
Florian Zager and Hamza A. A. Gardi, 15 Sep 2025, GhostNetV3-Small: A Tailored Architecture and Comparative Study of Distillation Strategies for Tiny Images, https://arxiv.org/abs/2509.12380
Liming Lu, Shuchao Pang, Xu Zheng, Xiang Gu, Anan Du, Yunhuai Liu, Yongbin Zhou, 16 Sep 2025, CIARD: Cyclic Iterative Adversarial Robustness Distillation, https://arxiv.org/abs/2509.12633
Robin Vujanic, Thomas Rueckstiess, 16 Sep 2025, LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations, https://arxiv.org/abs/2509.12539
Lin Luo, Xin Wang, Bojia Zi, Shihao Zhao, Xingjun Ma, Yu-Gang Jiang, 16 Sep 2025, Adversarial Prompt Distillation for Vision-Language Models, https://arxiv.org/abs/2411.15244
Jing Zou, Shungeng Zhang, Meikang Qiu, Chong Li, 15 Sep 2025, DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks, https://arxiv.org/abs/2509.11525
Chenghan Li and Garnet Kin-Lic Chan, 13 Sep 2025, Predictive Free Energy Simulations Through Hierarchical Distillation of Quantum Hamiltonians, https://arxiv.org/abs/2509.10967
Tong Wang, K. Sudhir and Dat Hong, 13 Sep 2025, Can Advanced LLMs Coach Smaller LLMs? Knowledge Distillation for Goal-Oriented Dialogs, https://arxiv.org/abs/2408.07238
Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji, 13 Sep 2025, FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering, https://arxiv.org/abs/2412.07030
Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu, 12 Sep 2025, From Correction to Mastery: Reinforced Distillation of Large Language Model Agents, https://arxiv.org/abs/2509.14257
Yihan Cao, Yanbin Kang, Zhengming Xing, Ruijie Jiang, 18 Sep 2025, Delta Knowledge Distillation for Large Language Models, https://arxiv.org/abs/2509.14526
Enzhi Wang, Qicheng Li, Zhiyuan Tang, Yuhang Jia, 18 Sep 2025, Cross-Modal Knowledge Distillation for Speech Large Language Models, https://arxiv.org/abs/2509.14930
Botao Zhu, Jeslyn Wang, Dusit Niyato, Xianbin Wang, 9 Sep 2025, Trust Semantics Distillation for Collaborator Selection via Memory-Augmented Agentic AI, https://arxiv.org/abs/2509.08151
Matthew Nolan and Lina Yao and Robert Davidson, 10 Sep 2025, Ensemble Distribution Distillation for Self-Supervised Human Activity Recognition, https://arxiv.org/abs/2509.08225
Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan, 10 Sep 2025, Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation, https://arxiv.org/abs/2504.02438

Ensemble Knowledge Distillation (Multi-Model)

Rather than a single teacher-student pair of models, there is research to suggest that it can be even more effective to use multiple teacher models, or various other ensemble distillation techniques.

Wenxian Shi, Yuxuan Song, Hao Zhou, Bohan Li, and Lei Li. Learning from deep model via exploring local targets, 2021. https://openreview.net/forum?id=5slGDu_bVc6 (Distillation with multiple teachers)
Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5191– 5198. AAAI Press, 2020. https://arxiv.org/abs/1902.03393 (multiple teachers)
Jangho Kim, Minsung Hyun, Inseop Chung, and Nojun Kwak. Feature fusion for online mutual knowledge distillation. In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, pp. 4619–4625. IEEE, 2020. https://arxiv.org/abs/1904.09058 (Ensemble methods for distillation.)
Inseop Chung, Seonguk Park, Jangho Kim, and Nojun Kwak. Feature-map-level online adversarial knowledge distillation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2006–2015. PMLR, 2020. https://arxiv.org/abs/2002.01775 (Multiple teacher models.)
Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. Online knowledge distillation with diverse peers. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 3430–3437. AAAI Press, 2020a https://arxiv.org/abs/1912.00350 (Ensemble distillation with multiple "peer" teachers.)
Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Orm ´ andi, George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neural network training through online distillation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. https://arxiv.org/abs/1804.03235
Mehdi Rezagholizadeh, Aref Jafari, Puneeth Salad, Pranav Sharma, Ali Saheb Pasand, Ali Ghodsi, Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher, arXiv preprint arXiv:2110.08532, 2021. https://arxiv.org/abs/2110.08532
Y. Zhang, T. Xiang, T. M. Hospedales and H. Lu, "Deep mutual learning", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4320-4328, Jun. 2018. https://arxiv.org/abs/1706.00384
L. Yuan, F. E. Tay, G. Li, T. Wang and J. Feng, "Revisiting knowledge distillation via label smoothing regularization", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3903-3911, Jun. 2020. https://arxiv.org/abs/1909.11723 (Improved learning, and also looks at reverse student-to-teacher learning.)

Unnatural Data Set Creation

Training one model on the output of another is not exactly distillation, but it is a widespread practice. Research papers on "unnatural instructions" that omit human curation, include:

Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)

Dataset Distillation

The technique of "dataset distillation" borrows the same terminology, but is a different technique to knowledge distillation. This term refers to methods to reduce a training dataset to a derived set of training data, such as to avoid privacy or copyright concerns. The dataset is smaller and theoretically can be used to train a similarly capable model.

Papers on dataset distillation:

T. Wang, J.-Y. Zhu, A. Torralba and A. A. Efros, "Dataset distillation", arXiv:1811.10959, 2018. https://arxiv.org/abs/1811.10959
Yu R, Liu S, Wang X, 2023, Dataset Distillation: A Comprehensive Review. https://arxiv.org/abs/2301.07014
Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)
Mami Nagoya, Keiichi Shiohara, Xing Chen, 2017, "A method for reducing the amounts of training samples for developing AI systems", 2017 International Electronics Symposium on Knowledge Creation and Intelligent Computing (IES-KCIC), pp.13-20, 2017. https://ieeexplore.ieee.org/document/8228448
David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

Black Box Knowledge Distillation

Black box knowledge distillation is any type of KD whereby the student model is treated as a black box. Such architectures use the teacher model to create inputs for the student model, but do not directly modify any internal weights of the student.

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418

White Box Knowledge Distillation

White box knowledge distillation is any use of KD where the weights of the student model can be directly modified. For example, such architectures may involve swapping weights directly with the teacher model, or otherwise editing the internal architecture of the student model.

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418