Aussie AI

LLM Model Pruning Research

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

Model pruning is a type of model compression for Large Language Models (LLMs) and other neural networks. Conceptually, this involves removal of weights within the model, which reduces the total computation required for model inference (i.e. "running" the AI model to generate a response).

Types of Model Pruning

The top-level classification involves whether pruning is done to the more, or during inference.

  • Static pruning. Changing the model during or after training, but not during inference.
  • Dynamic pruning. Inference decisions that effectively prune parts of the network.

There is also the very general category according to strategy:

  • Unstructured pruning. Remove low weight links ("magnitude pruning") regardless of where they are in the structure of the model.
  • Structured pruning. Everything else, which means removing weight groups in a specific part of the structure of the model (e.g. all weights in a layer, channel, filter, etc.)

Note that a hybrid strategy is to do unstructured magnitude pruning of all the weights in one particular type of structure, rather than removing an entire structural component (e.g. layer-specific magnitude pruning).

Unstructured Pruning

Weight pruning or "magnitude pruning" is unstructured pruning that removes very low weights, including small negative weights. Technically, weight pruning is the general class of pruning decisions made about a single weight at a time, whereas magnitude pruning is the specific case of pruning based on cutting small near-zero absolute-value weights (using some threshold). Magnitude pruning also obviously removes zero weights (see also zero skipping), or indeed the nebulous oddity called "negative-zero" if that is found. Read more about magnitude pruning.

Note that quantization techniques can also increase the number of zero weights throughout the quantized model. Quantization may round low-magnitude weights down to zero, with a smaller number of discrete weights possible. However, it depends on the level of quantization, and the choices made about rounding versus truncation in the quantization method.

The idea is simply that weights with a low magnitude are not contributing much to the overall inference, and hence skipping these calculations should not greatly affect the overall accuracy of the model. Theory says that this works quite well. This is the simplest and most longstanding type of pruning, and has a large body of research.

An important practical aspect of unstructured pruning is that it has no latency benefit unless multiplication by zeros in vector dot products can be efficiently avoided. For example, any benefit from zeroing some weights in a vector that is all still sent to an accelerator (i.e. with some zeroed elements) depends on the characteristics of the hardware, or on the algorithms used by the deep learning compiler (or graph compiler), and the latency improvement may be limited.

In addition to the simplistic idea of cutting all low weights throughout the model, much research has been done on various techniques to increase the number of such low weights, by making the weight matrices sparser. Research on such issues is addressed under topics such as "magnitude pruning", "sparsification" and "sparse matrix" theory.

Movement Pruning

Another type of unstructured pruning is "movement pruning". This is a pruning of weights during training, based on how they change during the training process. Weights that are "moving" towards zero as training progresses are removed (pruned), whereas weights "moving away from zero" are retained. Note that this relates to both positive and negative weights (with opposite movement directions). It's also possible to combine a movement threshold and an absolute magnitude threshold, so that important weights with large absolute values are not subsequently removed if they change a tiny amount towards zero. Overall, this means movement pruning focuses on the changes to weights as they are trained, rather than on their absolute value at the end of training.

Research papers on movement pruning:

Semi-Structured Pruning

Semi-structured pruning is part-way between structured and unstructured pruning. It involves zeroing some weights in an unstructured manner, but doing so in particular structures, or according to various patterns or limitations

  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325

Types of Structured Pruning

The more specific types of structured pruning depend on what type of "structure" has the weights being pruned:

Another technique similar to structured pruning is to cut weights by using smaller matrices. Advanced matrix algebra can be used to factorize the large matrices into smaller "low-rank" matrices, with fewer rows and columns (hence, less weights); see matrix algebra.

Hybrid pruning

There are various hybrid pruning strategies, where it is possible to combine two types of pruning (or indeed with other non-pruning optimizations such as quantization). For example, you can try to prune both depth and width at the same time (called "dual pruning"). There's also "triple pruning" of depth, width, and length. Or you can combine structured and unstructured pruning, e.g. structured pruning of a particular part of the model (e.g. a layer), but then only do unstructured weight pruning (magnitude pruning) of that structure, such that only low-value unimportant weights are pruned. This is an area of intensive ongoing research, and many hybrid pruning strategies are being tested.

Multidimensional Pruning

There are many types of pruning, and some are orthogonal to each other. This means it is possible, and desirable, to prune a model from multiple dimensions. Some of the types include:

Quadruple pruning? Is there a way to do four? For example, can you combine the four dimensions:

  • Depth: layer
  • Width: attention heads, neurons, or FFNs
  • Length: tokens, and
  • Model dimension: embeddings dimension

Survey Papers on Model Pruning

Review papers with coverage of pruning include:

Model Pruning General Research

Model pruning is a long-standing algorithm for model compression, with extensive literature, and numerous types of pruning.

  • Trim insignificant weights (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning
  • Pruning comprehensive guide (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide
  • Song Han, Huizi Mao, William J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", arXiv:1510.00149v5 [cs.CV], 15 Feb 2016, https://arxiv.org/abs/1510.00149
  • Kim Martineau, "A foolproof way to shrink deep learning models", MIT News, April 30, 2020, https://news.mit.edu/2020/foolproof-way-shrink-deep-learning-models-0430
  • Jonathan Frankle, Michael Carbin, "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", March 2019, https://arxiv.org/abs/1803.03635
  • Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
  • Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, M Xia, T Gao, Z Zeng, D Chen, arXiv preprint arXiv:2310.06694, Oct 2023, https://arxiv.org/pdf/2310.06694.pdf, Code: https://github.com/princeton-nlp/LLM-Shearing
  • Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
  • Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, May 2024, Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation, WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024, Pages 235–244, https://doi.org/10.1145/3589335.3648321
  • Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, Dec 2023, Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup, https://arxiv.org/abs/2312.05795
  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
  • Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-Based Weight Pruning. Association for Computing Machinery, New York, NY, USA, 907–922. https://doi.org/10.1145/3373376.3378534 (Pattern-based pruning method.)
  • W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39
  • T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
  • Abhiroop Bhattacharjee, Yeshwanth Venkatesha, Abhishek Moitra, Priyadarshini Panda, MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In: DAC (2022) https://arxiv.org/abs/2204.05274
  • Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
  • Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
  • Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
  • Gale, T., Elsen, E., and Hooker, S., The state of sparsity in deep neural networks, arXiv preprint arXiv:1902.09574, 2019, https://arxiv.org/abs/1902.09574
  • Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer, K., and Gholami, A., 2022, A fast post-training pruning framework for transformers, arXiv preprint arXiv:2204.09656, https://arxiv.org/abs/2204.09656
  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
  • Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, Thinet: A filter level pruning method for deep neural network compression. In ICCV, pages 5058–5066, https://arxiv.org/abs/1707.06342
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
  • Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang, 19 Dec 2023, Fluctuation-based Adaptive Structured Pruning for Large Language Models, https://arxiv.org/abs/2312.11983 Code: https://github.com/CASIA-IVA-Lab/FLAP
  • Vladimír Boža, 1 Jan 2024, Fast and Optimal Weight Update for Pruned Large Language Models, https://arxiv.org/abs/2401.02938 Code: https://github.com/fmfi-compbio/admm-pruning (Fast algorithm for fine-tuning after pruning to recover any lost model accuracy efficiently.)
  • S Guo, J Xu, LL Zhang, M Yang, Oct 2023, Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models, arXiv preprint arXiv:2310.05015, https://arxiv.org/pdf/2310.05015.pdf Code: https://github.com/microsoft/Moonlit/tree/main/Compresso
  • Zixiao Wang, Jingwei Zhang, Wenqian Zhao, Farzan Farnia, Bei Yu, 11 Jun 2024, MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations, https://arxiv.org/abs/2406.07017 Code: https://github.com/ShiningSord/MoreauPruner
  • K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
  • Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
  • David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Xinyin Ma, Gongfan Fang, Xinchao Wang, May 2023, LLM-Pruner: On the Structural Pruning of Large Language Models, https://arxiv.org/abs/2305.11627
  • J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
  • Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Sharath Sreenivas, Vinh Nguyen, Saurav Muralidharan, Marcin Chochowski and Raviraj Joshi, How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model, Aug 14, 2024, https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
  • Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
  • Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
  • Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
  • David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
  • Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
  • Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
  • Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
  • Mostafa Hussien, Mahmoud Afifi, Kim Khoa Nguyen, Mohamed Cheriet, 21 Oct 2024, Small Contributions, Small Networks: Efficient Neural Network Pruning Based on Relative Importance, https://arxiv.org/abs/2410.16151
  • Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei, 7 Jan 2025 (v2), Instruction-Following Pruning for Large Language Models, https://arxiv.org/abs/2501.02086
  • Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao, Feihu Che, 29 Jan 2025, DReSS: Data-driven Regularized Structured Streamlining for Large Language Models, https://arxiv.org/abs/2501.17905 (Regularize before pruning, then fine-tune.)
  • Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer, 24 Jun 2025, Beyond Token Pruning: Operation Pruning in Vision-Language Models, https://arxiv.org/abs/2507.02909 https://github.com/zxcvfd13502/GSOP (Pruning of entire kernel operations.)
  • Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, Mengdi Zhang, 13 Aug 2025, Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization, https://arxiv.org/abs/2508.10164
  • Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou, 14 Aug 2025, Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models, https://arxiv.org/abs/2508.10243
  • Eugen Barbulescu, Antonio Alexoaie and Lucian Busoniu, 12 Aug 2025, Hyperflux: Pruning Reveals the Importance of Weights, https://arxiv.org/abs/2504.05349
  • Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi, 14 Aug 2025, Data Pruning by Information Maximization, https://arxiv.org/abs/2506.01701
  • Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee, 22 Jul 2025, IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning, https://arxiv.org/abs/2507.14171
  • Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li, 24 Jul 2025, When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning, https://arxiv.org/abs/2503.07588
  • Jaeheun Jung, Donghun Lee, 10 Jul 2025, Catalyst: a Novel Regularizer for Structured Pruning with Auxiliary Extension of Parameter Space, https://arxiv.org/abs/2507.14170
  • Daniel Fein, Gabriela Aranguiz-Dias, 18 Jul 2025, Influence Functions for Preference Dataset Pruning, https://arxiv.org/abs/2507.14344
  • Yiding Song, 19 Jul 2025, Pruning Increases Orderedness in Recurrent Computation, https://arxiv.org/abs/2507.14747
  • Ganesh Sundaram, Jonas Ulmen, Amjad Haider, Daniel G\"orges, 20 Jul 2025, Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization, https://arxiv.org/abs/2507.14882
  • Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen, 21 Jul 2025, EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent, https://arxiv.org/abs/2507.15428
  • Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He, 21 Jul 2025, STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, https://arxiv.org/abs/2409.06211
  • Tingna Wang (1 and 2), Sikai Zhang (4), Mingming Song (1 and 3), Limin Sun (1, 2 and 3) ((1) College of Civil Engineering, Tongji University, Shanghai, China, (2) Shanghai Qi Zhi Institute, Shanghai, China, (3) State Key Laboratory of Disaster Reduction in Civil Engineering, Tongji University, Shanghai, China, (2) Baosight Software, Shanghai, China), 21 Jul 2025, Dictionary-Learning-Based Data Pruning for System Identification, https://arxiv.org/abs/2502.11484
  • Ganesh Sundaram, Jonas Ulmen, and Daniel G\"orges, 20 Jul 2025, Enhanced Pruning Strategy for Multi-Component Neural Architectures Using Component-Aware Graph Analysis, https://arxiv.org/abs/2504.13296
  • Dimitris Tsaras, Xing Li, Lei Chen, Zhiyao Xie, Mingxuan Yuan, 11 Aug 2025, ELF: Efficient Logic Synthesis by Pruning Redundancy in Refactoring, https://arxiv.org/abs/2508.08073
  • Ganesh Sundaram, Jonas Ulmen, Amjad Haider and Daniel G\"orges, 11 Aug 2025, COMponent-Aware Pruning for Accelerated Control Tasks in Latent Space Models, https://arxiv.org/abs/2508.08144
  • Enjun Du, Siyi Liu, Yongqi Zhang, 28 Jul 2025, Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning, https://arxiv.org/abs/2507.20498
  • Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang, 28 Jul 2025, TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model, https://arxiv.org/abs/2507.20630
  • Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 29 Jul 2025, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
  • Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang, 31 Jul 2025, FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning, https://arxiv.org/abs/2507.23318
  • Kuan-Ting Tu, Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien, 30 Jul 2025, FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression, https://arxiv.org/abs/2507.22527
  • \c{C}a\u{g}atay Demirel, 30 Jul 2025, RocketStack: Level-aware deep recursive ensemble learning framework with adaptive feature fusion and model pruning dynamics, https://arxiv.org/abs/2506.16965
  • Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo, 4 Aug 2025, Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation, https://arxiv.org/abs/2508.01992
  • Yike Zhang and Zhiyuan He and Huiqiang Jiang and Chengruidong Zhang and Yuqing Yang and Jianyong Wang and Lili Qiu, 4 Aug 2025, LeanK: Learnable K Cache Channel Pruning for Efficient Decoding, https://arxiv.org/abs/2508.02215
  • Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, and Ruixing Ming, 4 Aug 2025, Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method, https://arxiv.org/abs/2508.02291
  • Zuxin Ma, Yunhe Cui, Yongbin Qin, 4 Aug 2025, Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLMs, https://arxiv.org/abs/2508.02381
  • Taehan Lee, Hyukjun Lee, 3 Aug 2025, Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance, https://arxiv.org/abs/2504.01690
  • Jiaxi Li, Lu Yin, Xilu Wang, 4 Aug 2025, OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework, https://arxiv.org/abs/2411.07711
  • Yiheng Liu, Junhao Ning, Sichen Xia, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu, 7 Aug 2025, Pruning Large Language Models by Identifying and Preserving Functional Networks, https://arxiv.org/abs/2508.05239
  • Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu, 8 Aug 2025, Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal, https://arxiv.org/abs/2508.05988
  • Jucheng Hu, Surong Yang, Lijun Wu, Dongzhan Zhou, 8 Aug 2025, DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning, https://arxiv.org/abs/2504.14810
  • Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville, 11 Aug 2025, Adaptive Computation Pruning for the Forgetting Transformer, https://arxiv.org/abs/2504.06949
  • Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-wei Kuo, Nan Guan, Chun Jason Xue, 12 Aug 2025, EvoP: Robust LLM Inference via Evolutionary Pruning, https://arxiv.org/abs/2502.14910
  • Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi, 12 Aug 2025, Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization, https://arxiv.org/abs/2508.09330
  • Omar Bazarbachi, Zijun Sun, Yanning Shen, 13 Aug 2025, EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models, https://arxiv.org/abs/2508.09471
  • Devvrat Joshi and Islem Rekik, 13 Aug 2025, NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation, https://arxiv.org/abs/2508.09715
  • Bailey J. Eccles, Leon Wong, Blesson Varghese, 12 Aug 2025, Mosaic: Composite Projection Pruning for Resource-efficient LLMs, https://arxiv.org/abs/2504.06323
  • Aakash Kumar, Emanuele Natale, 14 Aug 2025, Quantization vs Pruning: Insights from the Strong Lottery Ticket Hypothesis, https://arxiv.org/abs/2508.11020
  • Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
  • Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang, An Zhang, Kun Wang and Qingsong Wen, 15 Aug 2025, SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication, https://arxiv.org/abs/2508.11733
  • Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang, 16 Aug 2025, EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models, https://arxiv.org/abs/2508.11886
  • Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca, 17 Aug 2025, 2SSP: A Two-Stage Framework for Structured Pruning of LLMs, https://arxiv.org/abs/2501.17771
  • Miko{\l}aj Janusz, Tomasz Wojnar, Yawei Li, Luca Benini, Kamil Adamczewski, 19 Aug 2025, One Shot vs. Iterative: Rethinking Pruning Strategies for Model Compression, https://arxiv.org/abs/2508.13836
  • Nicol\`o Romandini, Cristian Borcea, Rebecca Montanari, Luca Foschini, 19 Aug 2025, FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks, https://arxiv.org/abs/2508.13853
  • Shuang Ao, Gopal Rumchurn, 20 Aug 2025, S3LoRA: Safe Spectral Sharpness-Guided Pruning in Adaptation of Agent Planner, https://arxiv.org/abs/2508.15068
  • Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu, 21 Aug 2025, SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning, https://arxiv.org/abs/2508.15212
  • Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed, 18 Aug 2025, Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining, https://arxiv.org/abs/2508.15828
  • Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li, 22 Aug 2025, SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning, https://arxiv.org/abs/2508.16201
  • Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang, 24 Aug 2025, CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models, https://arxiv.org/abs/2508.17243
  • Po-Hsien Yu, Yu-Syuan Tseng, and Shao-Yi Chien, 24 Aug 2025, FedKLPR: Personalized Federated Learning for Person Re-Identification with Adaptive Pruning, https://arxiv.org/abs/2508.17431

Structured Pruning

Research papers on structured pruning:

  • Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Abigail See, Minh-Thang Luong, and Christopher D. Manning. Compression of neural machine translation models via pruning. In CoNLL, pages 291–301. ACL, 2016 https://arxiv.org/abs/1606.09274
  • Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017, https://arxiv.org/abs/1704.05119
  • Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015, https://arxiv.org/abs/1506.02626
  • Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar, 9 Feb 2024 (v2), Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes, https://arxiv.org/abs/2402.05406 Code: https://github.com/ldery/Bonsai (Structured pruning of very large LLMs down to 1B or 2B.)
  • Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin, Oct 2023, Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices, https://arxiv.org/abs/2310.09667 (Using structured pruning via layerwise filter pruning to run a model on a Raspberry Pi.)
  • H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
  • David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
  • Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
  • Meta, August 14, 2024, How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models, Meta AI Blog, https://ai.meta.com/blog/nvidia-llama/
  • Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
  • Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
  • Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
  • Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419
  • Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu, 25 Jan 2025, ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning, https://arxiv.org/abs/2501.15316
  • Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca, 29 Jan 2025, 2SSP: A Two-Stage Framework for Structured Pruning of LLMs, https://arxiv.org/abs/2501.17771 https://github.com/FabrizioSandri/2SSP (Dual width-depth pruning of neurons and attention heads.)
  • Zihuai Xu, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Zuan Xie, 25 Jan 2025, Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models, https://arxiv.org/abs/2501.15255
  • Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
  • Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang, 28 May 2025, SlimLLM: Accurate Structured Pruning for Large Language Models, https://arxiv.org/abs/2505.22689
  • Jiujun He, Huazhen Lin, 10 Jun 2025, Olica: Efficient Structured Pruning of Large Language Models without Retraining, https://arxiv.org/abs/2506.08436
  • Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata, 28 Jul 2025, Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study, https://arxiv.org/abs/2507.20749
  • Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer, 24 Jun 2025, Beyond Token Pruning: Operation Pruning in Vision-Language Models, https://arxiv.org/abs/2507.02909 https://github.com/zxcvfd13502/GSOP (Pruning of entire kernel operations.)
  • Jaeheun Jung, Donghun Lee, 10 Jul 2025, Catalyst: a Novel Regularizer for Structured Pruning with Auxiliary Extension of Parameter Space, https://arxiv.org/abs/2507.14170
  • Ganesh Sundaram, Jonas Ulmen, Amjad Haider, Daniel G\"orges, 20 Jul 2025, Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization, https://arxiv.org/abs/2507.14882
  • Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He, 21 Jul 2025, STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, https://arxiv.org/abs/2409.06211
  • Omar Bazarbachi, Zijun Sun, Yanning Shen, 13 Aug 2025, EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models, https://arxiv.org/abs/2508.09471

Block Pruning

Block pruning is structured pruning at very low-level granurality, at the vector level or below. A "block" is a small chunk of a tensor, either a vector sub-sequence, or a small rectangular block (also called a "tile"). Pruning of model weights at this level of granularity is called "block pruning."

Research papers on block pruning:

  • Shikhar Tuli, Niraj K. Jha, EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms, arXiv preprint arXiv:2303.13745, 2023, https://arxiv.org/abs/2303.13745
  • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
  • Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
  • AnonymousAuthor, 2024, AdeeperlookatdepthpruningofLLMs, https://openreview.net/pdf?id=9B7ayWclwN
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
  • Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
  • Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
  • Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma, 27 Sep 2024, Token Caching for Diffusion Transformer Acceleration, https://arxiv.org/abs/2409.18523
  • Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
  • Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419
  • Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
  • Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han, 30 Jan 2025, SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer, https://arxiv.org/abs/2501.18427 (Diffusion model optimization using block-level depth pruning and inference-time scaling.)
  • Juyun Wee, Minjae Park, Jaeho Lee, 4 Feb 2025, Prompt-based Depth Pruning of Large Language Models, https://arxiv.org/abs/2502.04348
  • Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han, 20 Feb 2025, LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention, https://arxiv.org/abs/2502.14866

Vector Pruning

Vector pruning is structured pruning at very low-level granurality, at the vector level. Block pruning is slight lower granularity, since it can mean a sub-portion of a vector.

Research papers on vector pruning:

  • David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
  • Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)

Dynamic Pruning

Dynamic pruning refers to pruning of network weights, links, or entire layers at runtime during inference. This differs from "static pruning" that is done offline during training, or in a post-training optimization to create a modified model. The types of dynamic pruning may include:

Note that all types of dynamic pruning suffer some extra inference cost in the calculations that decide whether to prune or not. The hope is that the benefit of pruning will exceed the cost of decision logic. For example, choosing an "early exit" criteria for layer pruning will require extra computation at each layer, which is hopefully recouped by skipping layers often enough.

Research Papers on Dynamic Pruning

Some general research on dynamic or adaptive pruning methods:

Combined Dynamic Width/Depth Pruning (Dual Pruning)

It's possible to simultaneously prune the depth (e.g. layer pruning, early-exit) and width (i.e., width pruning, channel pruning, token pruning, slimming networks, etc.). This is also called "hybrid pruning" or "dual pruning". See more research papers on dual pruning and triple pruning.

More Research on Pruning Types

More AI Pruning Research

Read more about:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging