Aussie AI

LLM Model Pruning Research

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Model pruning is a type of model compression for Large Language Models (LLMs) and other neural networks. Conceptually, this involves removal of weights within the model, which reduces the total computation required for model inference (i.e. "running" the AI model to generate a response).

Types of Model Pruning

The top-level classification involves whether pruning is done to the more, or during inference.

Static pruning. Changing the model during or after training, but not during inference.
Dynamic pruning. Inference decisions that effectively prune parts of the network.

There is also the very general category according to strategy:

Unstructured pruning. Remove low weight links ("magnitude pruning") regardless of where they are in the structure of the model.
Structured pruning. Everything else, which means removing weight groups in a specific part of the structure of the model (e.g. all weights in a layer, channel, filter, etc.)

Note that a hybrid strategy is to do unstructured magnitude pruning of all the weights in one particular type of structure, rather than removing an entire structural component (e.g. layer-specific magnitude pruning).

Unstructured Pruning

Weight pruning or "magnitude pruning" is unstructured pruning that removes very low weights, including small negative weights. Technically, weight pruning is the general class of pruning decisions made about a single weight at a time, whereas magnitude pruning is the specific case of pruning based on cutting small near-zero absolute-value weights (using some threshold). Magnitude pruning also obviously removes zero weights (see also zero skipping), or indeed the nebulous oddity called "negative-zero" if that is found. Read more about magnitude pruning.

Note that quantization techniques can also increase the number of zero weights throughout the quantized model. Quantization may round low-magnitude weights down to zero, with a smaller number of discrete weights possible. However, it depends on the level of quantization, and the choices made about rounding versus truncation in the quantization method.

The idea is simply that weights with a low magnitude are not contributing much to the overall inference, and hence skipping these calculations should not greatly affect the overall accuracy of the model. Theory says that this works quite well. This is the simplest and most longstanding type of pruning, and has a large body of research.

An important practical aspect of unstructured pruning is that it has no latency benefit unless multiplication by zeros in vector dot products can be efficiently avoided. For example, any benefit from zeroing some weights in a vector that is all still sent to an accelerator (i.e. with some zeroed elements) depends on the characteristics of the hardware, or on the algorithms used by the deep learning compiler (or graph compiler), and the latency improvement may be limited.

In addition to the simplistic idea of cutting all low weights throughout the model, much research has been done on various techniques to increase the number of such low weights, by making the weight matrices sparser. Research on such issues is addressed under topics such as "magnitude pruning", "sparsification" and "sparse matrix" theory.

Movement Pruning

Another type of unstructured pruning is "movement pruning". This is a pruning of weights during training, based on how they change during the training process. Weights that are "moving" towards zero as training progresses are removed (pruned), whereas weights "moving away from zero" are retained. Note that this relates to both positive and negative weights (with opposite movement directions). It's also possible to combine a movement threshold and an absolute magnitude threshold, so that important weights with large absolute values are not subsequently removed if they change a tiny amount towards zero. Overall, this means movement pruning focuses on the changes to weights as they are trained, rather than on their absolute value at the end of training.

Research papers on movement pruning:

Francois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021. https://arxiv.org/abs/2109.04838, Code: https://github.com/huggingface/nn_pruning
Victor Sanh, Thomas Wolf, and Alexander M Rush. “Movement pruning: Adaptive sparsity by fine-tuning”. In: arXiv preprint arXiv:2005.07683 (2020). https://arxiv.org/abs/2005.07683
David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
David Spuler, March 2024, Movement Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch33-movement-pruning

Semi-Structured Pruning

Semi-structured pruning is part-way between structured and unstructured pruning. It involves zeroing some weights in an unstructured manner, but doing so in particular structures, or according to various patterns or limitations

Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325

Types of Structured Pruning

The more specific types of structured pruning depend on what type of "structure" has the weights being pruned:

Layer pruning
Early exit (effectively dynamic layer pruning)
Attention head pruning
Channel pruning
Token pruning
Embeddings pruning
Block pruning
Filter pruning
FFN pruning
Normalization pruning

Another technique similar to structured pruning is to cut weights by using smaller matrices. Advanced matrix algebra can be used to factorize the large matrices into smaller "low-rank" matrices, with fewer rows and columns (hence, less weights); see matrix algebra.

Hybrid pruning

There are various hybrid pruning strategies, where it is possible to combine two types of pruning (or indeed with other non-pruning optimizations such as quantization). For example, you can try to prune both depth and width at the same time (called "dual pruning"). There's also "triple pruning" of depth, width, and length. Or you can combine structured and unstructured pruning, e.g. structured pruning of a particular part of the model (e.g. a layer), but then only do unstructured weight pruning (magnitude pruning) of that structure, such that only low-value unimportant weights are pruned. This is an area of intensive ongoing research, and many hybrid pruning strategies are being tested.

Multidimensional Pruning

There are many types of pruning, and some are orthogonal to each other. This means it is possible, and desirable, to prune a model from multiple dimensions. Some of the types include:

Dual pruning (usually depth and width pruning, such as combining layer pruning and channel pruning).
Triple pruning

Quadruple pruning? Is there a way to do four? For example, can you combine the four dimensions:

Depth: layer
Width: attention heads, neurons, or FFNs
Length: tokens, and
Model dimension: embeddings dimension

Survey Papers on Model Pruning

Review papers with coverage of pruning include:

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various types of pruning.)
T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
J Liu, S Tripathi, U Kurup, M Shah, 2020, Pruning algorithms to accelerate convolutional neural networks for edge applications: A survey, arXiv preprint arXiv:2005.04275, https://arxiv.org/abs/2005.04275
Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762, PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
Yu Cheng; Duo Wang; Pan Zhou; Tao Zhang, 2018, Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine (Volume 35, Issue 1, January 2018), https://ieeexplore.ieee.org/document/8253600
G Menghani, 2023, Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
L Deng, G Li, S Han, L Shi, Y Xie, 2020, Model compression and hardware acceleration for neural networks: A comprehensive survey, Proceedings of the IEEE (Volume 108, Issue 4, April 2020), https://ieeexplore.ieee.org/abstract/document/9043731
S Xu, A Huang, L Chen, B Zhang, 2020, Convolutional neural network pruning: A survey 2020 39th Chinese Control Conference (CCC), https://ieeexplore.ieee.org/document/9189610
K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
H Wang, C Qin, Y Bai, Y Zhang, Y Fu, 2022, Recent advances on neural network pruning at initialization, arXiv preprint arXiv:2103.06460, 2021 (updated May 2022), https://arxiv.org/abs/2103.06460
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046
Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin, 19 Feb 2025, GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices, https://arxiv.org/abs/2502.15816
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891

Model Pruning General Research

Model pruning is a long-standing algorithm for model compression, with extensive literature, and numerous types of pruning.

Trim insignificant weights (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning
Pruning comprehensive guide (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide
Song Han, Huizi Mao, William J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", arXiv:1510.00149v5 [cs.CV], 15 Feb 2016, https://arxiv.org/abs/1510.00149
Kim Martineau, "A foolproof way to shrink deep learning models", MIT News, April 30, 2020, https://news.mit.edu/2020/foolproof-way-shrink-deep-learning-models-0430
Jonathan Frankle, Michael Carbin, "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", March 2019, https://arxiv.org/abs/1803.03635
Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, M Xia, T Gao, Z Zeng, D Chen, arXiv preprint arXiv:2310.06694, Oct 2023, https://arxiv.org/pdf/2310.06694.pdf, Code: https://github.com/princeton-nlp/LLM-Shearing
Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, May 2024, Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation, WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024, Pages 235–244, https://doi.org/10.1145/3589335.3648321
Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, Dec 2023, Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup, https://arxiv.org/abs/2312.05795
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-Based Weight Pruning. Association for Computing Machinery, New York, NY, USA, 907–922. https://doi.org/10.1145/3373376.3378534 (Pattern-based pruning method.)
W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39
T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
Abhiroop Bhattacharjee, Yeshwanth Venkatesha, Abhishek Moitra, Priyadarshini Panda, MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In: DAC (2022) https://arxiv.org/abs/2204.05274
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
Gale, T., Elsen, E., and Hooker, S., The state of sparsity in deep neural networks, arXiv preprint arXiv:1902.09574, 2019, https://arxiv.org/abs/1902.09574
Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer, K., and Gholami, A., 2022, A fast post-training pruning framework for transformers, arXiv preprint arXiv:2204.09656, https://arxiv.org/abs/2204.09656
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, Thinet: A filter level pruning method for deep neural network compression. In ICCV, pages 5058–5066, https://arxiv.org/abs/1707.06342
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang, 19 Dec 2023, Fluctuation-based Adaptive Structured Pruning for Large Language Models, https://arxiv.org/abs/2312.11983 Code: https://github.com/CASIA-IVA-Lab/FLAP
Vladimír Boža, 1 Jan 2024, Fast and Optimal Weight Update for Pruned Large Language Models, https://arxiv.org/abs/2401.02938 Code: https://github.com/fmfi-compbio/admm-pruning (Fast algorithm for fine-tuning after pruning to recover any lost model accuracy efficiently.)
S Guo, J Xu, LL Zhang, M Yang, Oct 2023, Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models, arXiv preprint arXiv:2310.05015, https://arxiv.org/pdf/2310.05015.pdf Code: https://github.com/microsoft/Moonlit/tree/main/Compresso
Zixiao Wang, Jingwei Zhang, Wenqian Zhao, Farzan Farnia, Bei Yu, 11 Jun 2024, MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations, https://arxiv.org/abs/2406.07017 Code: https://github.com/ShiningSord/MoreauPruner
K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Xinyin Ma, Gongfan Fang, Xinchao Wang, May 2023, LLM-Pruner: On the Structural Pruning of Large Language Models, https://arxiv.org/abs/2305.11627
J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Sharath Sreenivas, Vinh Nguyen, Saurav Muralidharan, Marcin Chochowski and Raviraj Joshi, How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model, Aug 14, 2024, https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
Mostafa Hussien, Mahmoud Afifi, Kim Khoa Nguyen, Mohamed Cheriet, 21 Oct 2024, Small Contributions, Small Networks: Efficient Neural Network Pruning Based on Relative Importance, https://arxiv.org/abs/2410.16151
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei, 7 Jan 2025 (v2), Instruction-Following Pruning for Large Language Models, https://arxiv.org/abs/2501.02086
Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao, Feihu Che, 29 Jan 2025, DReSS: Data-driven Regularized Structured Streamlining for Large Language Models, https://arxiv.org/abs/2501.17905 (Regularize before pruning, then fine-tune.)
Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer, 24 Jun 2025, Beyond Token Pruning: Operation Pruning in Vision-Language Models, https://arxiv.org/abs/2507.02909 https://github.com/zxcvfd13502/GSOP (Pruning of entire kernel operations.)
Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, Mengdi Zhang, 13 Aug 2025, Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization, https://arxiv.org/abs/2508.10164
Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou, 14 Aug 2025, Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models, https://arxiv.org/abs/2508.10243
Eugen Barbulescu, Antonio Alexoaie and Lucian Busoniu, 12 Aug 2025, Hyperflux: Pruning Reveals the Importance of Weights, https://arxiv.org/abs/2504.05349
Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi, 14 Aug 2025, Data Pruning by Information Maximization, https://arxiv.org/abs/2506.01701
Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee, 22 Jul 2025, IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning, https://arxiv.org/abs/2507.14171
Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li, 24 Jul 2025, When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning, https://arxiv.org/abs/2503.07588
Jaeheun Jung, Donghun Lee, 10 Jul 2025, Catalyst: a Novel Regularizer for Structured Pruning with Auxiliary Extension of Parameter Space, https://arxiv.org/abs/2507.14170
Daniel Fein, Gabriela Aranguiz-Dias, 18 Jul 2025, Influence Functions for Preference Dataset Pruning, https://arxiv.org/abs/2507.14344
Yiding Song, 19 Jul 2025, Pruning Increases Orderedness in Recurrent Computation, https://arxiv.org/abs/2507.14747
Ganesh Sundaram, Jonas Ulmen, Amjad Haider, Daniel G\"orges, 20 Jul 2025, Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization, https://arxiv.org/abs/2507.14882
Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen, 21 Jul 2025, EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent, https://arxiv.org/abs/2507.15428
Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He, 21 Jul 2025, STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, https://arxiv.org/abs/2409.06211
Tingna Wang (1 and 2), Sikai Zhang (4), Mingming Song (1 and 3), Limin Sun (1, 2 and 3) ((1) College of Civil Engineering, Tongji University, Shanghai, China, (2) Shanghai Qi Zhi Institute, Shanghai, China, (3) State Key Laboratory of Disaster Reduction in Civil Engineering, Tongji University, Shanghai, China, (2) Baosight Software, Shanghai, China), 21 Jul 2025, Dictionary-Learning-Based Data Pruning for System Identification, https://arxiv.org/abs/2502.11484
Ganesh Sundaram, Jonas Ulmen, and Daniel G\"orges, 20 Jul 2025, Enhanced Pruning Strategy for Multi-Component Neural Architectures Using Component-Aware Graph Analysis, https://arxiv.org/abs/2504.13296
Dimitris Tsaras, Xing Li, Lei Chen, Zhiyao Xie, Mingxuan Yuan, 11 Aug 2025, ELF: Efficient Logic Synthesis by Pruning Redundancy in Refactoring, https://arxiv.org/abs/2508.08073
Ganesh Sundaram, Jonas Ulmen, Amjad Haider and Daniel G\"orges, 11 Aug 2025, COMponent-Aware Pruning for Accelerated Control Tasks in Latent Space Models, https://arxiv.org/abs/2508.08144
Enjun Du, Siyi Liu, Yongqi Zhang, 28 Jul 2025, Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning, https://arxiv.org/abs/2507.20498
Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang, 28 Jul 2025, TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model, https://arxiv.org/abs/2507.20630
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 29 Jul 2025, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang, 31 Jul 2025, FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning, https://arxiv.org/abs/2507.23318
Kuan-Ting Tu, Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien, 30 Jul 2025, FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression, https://arxiv.org/abs/2507.22527
\c{C}a\u{g}atay Demirel, 30 Jul 2025, RocketStack: Level-aware deep recursive ensemble learning framework with adaptive feature fusion and model pruning dynamics, https://arxiv.org/abs/2506.16965
Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo, 4 Aug 2025, Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation, https://arxiv.org/abs/2508.01992
Yike Zhang and Zhiyuan He and Huiqiang Jiang and Chengruidong Zhang and Yuqing Yang and Jianyong Wang and Lili Qiu, 4 Aug 2025, LeanK: Learnable K Cache Channel Pruning for Efficient Decoding, https://arxiv.org/abs/2508.02215
Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, and Ruixing Ming, 4 Aug 2025, Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method, https://arxiv.org/abs/2508.02291
Zuxin Ma, Yunhe Cui, Yongbin Qin, 4 Aug 2025, Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLMs, https://arxiv.org/abs/2508.02381
Taehan Lee, Hyukjun Lee, 3 Aug 2025, Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance, https://arxiv.org/abs/2504.01690
Jiaxi Li, Lu Yin, Xilu Wang, 4 Aug 2025, OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework, https://arxiv.org/abs/2411.07711
Yiheng Liu, Junhao Ning, Sichen Xia, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu, 7 Aug 2025, Pruning Large Language Models by Identifying and Preserving Functional Networks, https://arxiv.org/abs/2508.05239
Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu, 8 Aug 2025, Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal, https://arxiv.org/abs/2508.05988
Jucheng Hu, Surong Yang, Lijun Wu, Dongzhan Zhou, 8 Aug 2025, DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning, https://arxiv.org/abs/2504.14810
Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville, 11 Aug 2025, Adaptive Computation Pruning for the Forgetting Transformer, https://arxiv.org/abs/2504.06949
Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-wei Kuo, Nan Guan, Chun Jason Xue, 12 Aug 2025, EvoP: Robust LLM Inference via Evolutionary Pruning, https://arxiv.org/abs/2502.14910
Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi, 12 Aug 2025, Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization, https://arxiv.org/abs/2508.09330
Omar Bazarbachi, Zijun Sun, Yanning Shen, 13 Aug 2025, EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models, https://arxiv.org/abs/2508.09471
Devvrat Joshi and Islem Rekik, 13 Aug 2025, NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation, https://arxiv.org/abs/2508.09715
Bailey J. Eccles, Leon Wong, Blesson Varghese, 12 Aug 2025, Mosaic: Composite Projection Pruning for Resource-efficient LLMs, https://arxiv.org/abs/2504.06323
Aakash Kumar, Emanuele Natale, 14 Aug 2025, Quantization vs Pruning: Insights from the Strong Lottery Ticket Hypothesis, https://arxiv.org/abs/2508.11020
Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang, An Zhang, Kun Wang and Qingsong Wen, 15 Aug 2025, SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication, https://arxiv.org/abs/2508.11733
Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang, 16 Aug 2025, EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models, https://arxiv.org/abs/2508.11886
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca, 17 Aug 2025, 2SSP: A Two-Stage Framework for Structured Pruning of LLMs, https://arxiv.org/abs/2501.17771
Miko{\l}aj Janusz, Tomasz Wojnar, Yawei Li, Luca Benini, Kamil Adamczewski, 19 Aug 2025, One Shot vs. Iterative: Rethinking Pruning Strategies for Model Compression, https://arxiv.org/abs/2508.13836
Nicol\`o Romandini, Cristian Borcea, Rebecca Montanari, Luca Foschini, 19 Aug 2025, FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks, https://arxiv.org/abs/2508.13853
Shuang Ao, Gopal Rumchurn, 20 Aug 2025, S3LoRA: Safe Spectral Sharpness-Guided Pruning in Adaptation of Agent Planner, https://arxiv.org/abs/2508.15068
Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu, 21 Aug 2025, SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning, https://arxiv.org/abs/2508.15212
Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed, 18 Aug 2025, Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining, https://arxiv.org/abs/2508.15828
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li, 22 Aug 2025, SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning, https://arxiv.org/abs/2508.16201
Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang, 24 Aug 2025, CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models, https://arxiv.org/abs/2508.17243
Po-Hsien Yu, Yu-Syuan Tseng, and Shao-Yi Chien, 24 Aug 2025, FedKLPR: Personalized Federated Learning for Person Re-Identification with Adaptive Pruning, https://arxiv.org/abs/2508.17431
Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae, 4 Sep 2025, PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference, https://arxiv.org/abs/2509.04377
Makbule Gulcin Ozsoy, 4 Sep 2025, Text2Cypher: Data Pruning using Hard Example Selection, https://arxiv.org/abs/2505.05122
Mounvik K and N Harshit, 5 Sep 2025, Accuracy-Constrained CNN Pruning for Efficient and Reliable EEG-Based Seizure Detection, https://arxiv.org/abs/2509.05190
Hao Zhang, Mengsi Lyu, Yulong Ao, Yonghua Lin, 29 Aug 2025, Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference, https://arxiv.org/abs/2509.04467
Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, Lars Schmidt-Thieme, 5 Sep 2025, STADE: Standard Deviation as a Pruning Metric, https://arxiv.org/abs/2503.22451
Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi, 27 Aug 2025, Pruning Strategies for Backdoor Defense in LLMs, https://arxiv.org/abs/2508.20032
Yi Jiang, Malyaban Bal, Brian Matejek, Susmit Jha, Adam Cobb, Abhronil Sengupta, 23 Aug 2025, Spatio-Temporal Pruning for Compressed Spiking Large Language Models, https://arxiv.org/abs/2508.20122
Wuque Cai, Hongze Sun, Jiayi He, Qianqian Liao, Yunliang Zang, Duo Chen, Dezhong Yao, Daqing Guo, 29 Aug 2025, NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration, https://arxiv.org/abs/2508.21566
Manish Verma, Vivek Sharma, Vishal Singh, 31 Aug 2025, A Hybrid Ai Framework For Strategic Patent Portfolio Pruning: Integrating Learning To-Rank And Market Need Analysis For Technology Transfer Optimization, https://arxiv.org/abs/2509.00958
Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, and Pan Li, 27 Aug 2025, Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs, https://arxiv.org/abs/2509.00096
Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou and Guohao Dai, 6 Sep 2025, SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning, https://arxiv.org/abs/2509.05614
Jaemin Son, Sujin Choi, Inyong Yun, 8 Sep 2025, Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models, https://arxiv.org/abs/2509.06415
Eugene Kwek and Wenpeng Yin, 8 Sep 2025, COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens, https://arxiv.org/abs/2509.06836
Pavithra Elumalai, Sudharsan Vijayaraghavan, Madhumita Mondal, Areejit Samal, 30 Aug 2025, Application of discrete Ricci curvature in pruning randomly wired neural networks: A case study with chest x-ray classification of COVID-19, https://arxiv.org/abs/2509.05322
Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder, 8 Sep 2025, ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models, https://arxiv.org/abs/2406.07831
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma, 5 Sep 2025, Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments, https://arxiv.org/abs/2505.08299
Tianjun Yao, Haoxuan Li, Yongqiang Chen, Tongliang Liu, Le Song, Eric Xing, Zhiqiang Shen, 6 Sep 2025, Pruning Spurious Subgraphs for Graph Out-of-Distribution Generalization, https://arxiv.org/abs/2506.05957
Jiarui Chen, Yikeng Chen, Yingshuang Zou, Ye Huang, Peng Wang, Yuan Liu, Yujing Sun, Wenping Wang, 7 Sep 2025, MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning, https://arxiv.org/abs/2509.07021
Ozgu Goksu, Nicolas Pugeault, 9 Sep 2025, Hybrid-Regularized Magnitude Pruning for Robust Federated Learning under Covariate Shift, https://arxiv.org/abs/2412.15010
Hengyu Fang, Yijiang Liu, Yuan Du, Li Du and Huanrui Yang, 11 Sep 2025, SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models, https://arxiv.org/abs/2509.09090
Wenda Qin, Andrea Burns, Bryan A. Plummer, Margrit Betke, 18 Sep 2025, Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning, https://arxiv.org/abs/2509.15250
Avinash Madasu, Vasudev Lal, Phillip Howard, 19 Sep 2025, Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias, https://arxiv.org/abs/2503.11103
Dieter Balemans, Thomas Huybrechts, Jan Steckel, Siegfried Mercelis, 4 Sep 2025, Resource-Aware Neural Network Pruning Using Graph-based Reinforcement Learning, https://arxiv.org/abs/2509.10526
Jinying Xiao, Ping Li, Jie Nie, Zhe Tang, 15 Sep 2025, SEVEN: Pruning Transformer Model by Reserving Sentinels, https://arxiv.org/abs/2403.12688
Jinying Xiao, Ping Li, Zhe Tang, Jie Nie, 15 Sep 2025, LNPT: Label-free Network Pruning and Training, https://arxiv.org/abs/2403.12690
Wei Huang, Anda Cheng, Yinggui Wang, 10 Sep 2025, Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning, https://arxiv.org/abs/2509.08255
Ahmed Sadaqa, Di Liu, 10 Sep 2025, Compressing CNN models for resource-constrained systems by channel and layer pruning, https://arxiv.org/abs/2509.08714
Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He, 17 Sep 2025, NIRVANA: Structured pruning reimagined for large language models compression, https://arxiv.org/abs/2509.14230
Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov, 17 Sep 2025, Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency, https://arxiv.org/abs/2509.13990
Michal Szczepanski, Martyna Poreba and Karim Haroun, 17 Sep 2025, Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions, https://arxiv.org/abs/2509.14165

Structured Pruning

Research papers on structured pruning:

Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Abigail See, Minh-Thang Luong, and Christopher D. Manning. Compression of neural machine translation models via pruning. In CoNLL, pages 291–301. ACL, 2016 https://arxiv.org/abs/1606.09274
Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017, https://arxiv.org/abs/1704.05119
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015, https://arxiv.org/abs/1506.02626
Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar, 9 Feb 2024 (v2), Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes, https://arxiv.org/abs/2402.05406 Code: https://github.com/ldery/Bonsai (Structured pruning of very large LLMs down to 1B or 2B.)
Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin, Oct 2023, Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices, https://arxiv.org/abs/2310.09667 (Using structured pruning via layerwise filter pruning to run a model on a Raspberry Pi.)
H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
Meta, August 14, 2024, How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models, Meta AI Blog, https://ai.meta.com/blog/nvidia-llama/
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419
Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu, 25 Jan 2025, ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning, https://arxiv.org/abs/2501.15316
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca, 29 Jan 2025, 2SSP: A Two-Stage Framework for Structured Pruning of LLMs, https://arxiv.org/abs/2501.17771 https://github.com/FabrizioSandri/2SSP (Dual width-depth pruning of neurons and attention heads.)
Zihuai Xu, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Zuan Xie, 25 Jan 2025, Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models, https://arxiv.org/abs/2501.15255
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang, 28 May 2025, SlimLLM: Accurate Structured Pruning for Large Language Models, https://arxiv.org/abs/2505.22689
Jiujun He, Huazhen Lin, 10 Jun 2025, Olica: Efficient Structured Pruning of Large Language Models without Retraining, https://arxiv.org/abs/2506.08436
Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata, 28 Jul 2025, Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study, https://arxiv.org/abs/2507.20749
Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer, 24 Jun 2025, Beyond Token Pruning: Operation Pruning in Vision-Language Models, https://arxiv.org/abs/2507.02909 https://github.com/zxcvfd13502/GSOP (Pruning of entire kernel operations.)
Jaeheun Jung, Donghun Lee, 10 Jul 2025, Catalyst: a Novel Regularizer for Structured Pruning with Auxiliary Extension of Parameter Space, https://arxiv.org/abs/2507.14170
Ganesh Sundaram, Jonas Ulmen, Amjad Haider, Daniel G\"orges, 20 Jul 2025, Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization, https://arxiv.org/abs/2507.14882
Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He, 21 Jul 2025, STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, https://arxiv.org/abs/2409.06211
Omar Bazarbachi, Zijun Sun, Yanning Shen, 13 Aug 2025, EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models, https://arxiv.org/abs/2508.09471
Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae, 4 Sep 2025, PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference, https://arxiv.org/abs/2509.04377
Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He, 17 Sep 2025, NIRVANA: Structured pruning reimagined for large language models compression, https://arxiv.org/abs/2509.14230

Block Pruning

Block pruning is structured pruning at very low-level granurality, at the vector level or below. A "block" is a small chunk of a tensor, either a vector sub-sequence, or a small rectangular block (also called a "tile"). Pruning of model weights at this level of granularity is called "block pruning."

Research papers on block pruning:

Shikhar Tuli, Niraj K. Jha, EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms, arXiv preprint arXiv:2303.13745, 2023, https://arxiv.org/abs/2303.13745
Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
AnonymousAuthor, 2024, AdeeperlookatdepthpruningofLLMs, https://openreview.net/pdf?id=9B7ayWclwN
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma, 27 Sep 2024, Token Caching for Diffusion Transformer Acceleration, https://arxiv.org/abs/2409.18523
Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han, 30 Jan 2025, SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer, https://arxiv.org/abs/2501.18427 (Diffusion model optimization using block-level depth pruning and inference-time scaling.)
Juyun Wee, Minjae Park, Jaeho Lee, 4 Feb 2025, Prompt-based Depth Pruning of Large Language Models, https://arxiv.org/abs/2502.04348
Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han, 20 Feb 2025, LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention, https://arxiv.org/abs/2502.14866
Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae, 4 Sep 2025, PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference, https://arxiv.org/abs/2509.04377

Vector Pruning

Vector pruning is structured pruning at very low-level granurality, at the vector level. Block pruning is slight lower granularity, since it can mean a sub-portion of a vector.

Research papers on vector pruning:

David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)

Dynamic Pruning

Dynamic pruning refers to pruning of network weights, links, or entire layers at runtime during inference. This differs from "static pruning" that is done offline during training, or in a post-training optimization to create a modified model. The types of dynamic pruning may include:

Dynamic weight pruning (dynamic unstructured pruning): This means suppressing near-zero weights and computed probabilities dynamically, which creates sparsity, sometimes called "network sparsification". This allows optimizations related to sparse matrices to reduce overall computations. Weight magnitude pruning is often done statically, as in most of the research, but can also be done dynamically during inference.
Dynamic depth pruning: Skipping of inference of entire layers of the model using an "early exit" of the inference loop. See also depth pruning, layer pruning, layer skipping, layer fusion, and shallow decoders.
Dynamic width pruning: Dynamically reducing the "width" of the model based on the input. See width pruning, attention head pruning, channel pruning, filter pruning.
Dynamic length pruning: Adaptive to the input to modify internal dimensions related to tokens, embeddings, etc. See length pruning, token pruning, embeddings pruning, autoregressive algorithms.

Note that all types of dynamic pruning suffer some extra inference cost in the calculations that decide whether to prune or not. The hope is that the benefit of pruning will exceed the cost of decision logic. For example, choosing an "early exit" criteria for layer pruning will require extra computation at each layer, which is hopefully recouped by skipping layers often enough.

Research Papers on Dynamic Pruning

Some general research on dynamic or adaptive pruning methods:

Yiwen Guo, Anbang Yao, and Yurong Chen. 2016, Dynamic network surgery for efficient DNNs. Advances in neural information processing systems, 29, 2016. https://arxiv.org/abs/1608.04493 (Dynamic pruning.)
Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020, Dynamic model pruning with feedback. In International Conference on Learning Representations, https://arxiv.org/abs/2006.07253
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
H Wang, C Qin, Y Bai, Y Zhang, Y Fu, 2022, Recent advances on neural network pruning at initialization, arXiv preprint arXiv:2103.06460, 2021 (updated May 2022), https://arxiv.org/abs/2103.06460
Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 22 Jan 2024, APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, https://arxiv.org/abs/2401.12200
David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Jinting Chen, Zhaocheng Zhu, Cheng Li, Yuming Zhao, Oct 2019, Self-Adaptive Network Pruning, https://arxiv.org/abs/1910.08906
Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823. https://aclanthology.org/2021.findings-acl.425/ PDF: https://aclanthology.org/2021.findings-acl.425.pdf
Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, July 2024, APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60812-60831, 2024. https://proceedings.mlr.press/v235/zhao24g.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhao24g/zhao24g.pdf
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
David Spuler, March 2024, Static vs Dynamic Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch33-static-vs-dynamic-pruning
Y An, Z Chen, C Xiong, B Chen, Jan 2025, Herd: Grouping before Pruning for Batch Inference, https://oasis-git.github.io/data/herd.pdf (Dynamic activation pruning across multiple queries in a batch.)
Haihang Wu, Wei Wang, Tamasha Malepathirana, Sachith Seneviratne, Denny Oetomo, Saman Halgamuge, 10 Dec 2024, TT-MPD: Test Time Model Pruning and Distillation, https://arxiv.org/abs/2412.07114
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
Huanrong Liu, Chunlin Tian, Xuyang Wei, Jiaheng Dai, Qin Liu, Tianqi Wei, Qingbiao Li, Li Li, 6 May 2025 (v2), RAP: Runtime-Adaptive Pruning for LLM Inference, https://arxiv.org/abs/2505.17138

Combined Dynamic Width/Depth Pruning (Dual Pruning)

It's possible to simultaneously prune the depth (e.g. layer pruning, early-exit) and width (i.e., width pruning, channel pruning, token pruning, slimming networks, etc.). This is also called "hybrid pruning" or "dual pruning". See more research papers on dual pruning and triple pruning.