Aussie AI

Channel Pruning

Last Updated 25 April, 2026

by David Spuler, Ph.D.

What is Channel Pruning?

Channel pruning is a type of LLM inference optimization that reduces calculations along the width dimension of models. It is primarily related to CNNs, and is analogous to attention head pruning in Transformer architectures.

Chanel Pruning: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, March 2024, Chapter 48. Width Pruning, in book "Generative AI in C++", https://www.aussieai.com/book/ch48-width-pruning
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Research on Channel Pruning

Research papers on channel pruning include:

M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert D. Mullins, and Cheng-Zhong Xu. 2019. Dynamic Channel Pruning: Feature Boosting and Suppression. Proceedings of the 7th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1810.05331
Mogaka, O.M., Zewail, R., Inoue, K. et al. TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification. J Real-Time Image Proc 21, 51 (2024). https://doi.org/10.1007/s11554-024-01430-y https://link.springer.com/article/10.1007/s11554-024-01430-y#citeas (Use of both power-of-two quantization and channel pruning for fast image analysis.)
Ji Liu, Dehua Tang, Yuanxian Huang, Li Zhang, Xiaocheng Zeng, Dong Li, Mingjie Lu, Jinzhang Peng, Yu Wang, Fan Jiang, Lu Tian, Ashish Sirasao, 12 Jan 2024, UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer, https://arxiv.org/abs/2401.06426 (Block pruning strategy gives a type of depth pruning.)
LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs 2023, Journal of Supercomputing https://doi.org/10.1007/s11227-023-05212-4
LRP-based network pruning and policy distillation of robust and non-robust DRL agents for embedded systems 2023, Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.7351
David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Y Li, K Adamczewski, W Li, S Gu, 2022, Revisiting random channel pruning for neural network compression, http://openaccess.thecvf.com/content/CVPR2022/html/Li_Revisiting_Random_Channel_Pruning_for_Neural_Network_Compression_CVPR_2022_paper.html
Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In ECCV, 2018, https://arxiv.org/abs/1804.03230
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
Bejnordi, B.E., Blankevoort, T., Welling, M.: Batch-shaping for learning conditional channel gated networks. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=Bke89JBtvB
Gao, X., Zhao, Y., Dudziak, Ł., Mullins, R., Xu, C.-Z: Dynamic channel pruning: Feature boosting and suppression. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=BJxh2j0qYm
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Xiaotong Luo; Zekun Ai; Qiuyuan Liang; Yuan Xie, 06 August 2024, EdgeFormer: Edge-aware Efficient Transformer for Image Super-resolution, IEEE Transactions on Instrumentation and Measurement ( Early Access), DOI: 10.1109/TIM.2024.3436070, https://ieeexplore.ieee.org/abstract/document/10623619 https://github.com/xiaotongtt/EdgeFormer
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang, 16 Sep 2024, CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios, https://arxiv.org/abs/2409.10593 (KV cache compression on the "channel" or "width" dimension.)
Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Yike Zhang and Zhiyuan He and Huiqiang Jiang and Chengruidong Zhang and Yuqing Yang and Jianyong Wang and Lili Qiu, 4 Aug 2025, LeanK: Learnable K Cache Channel Pruning for Efficient Decoding, https://arxiv.org/abs/2508.02215
Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu, 21 Aug 2025, SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning, https://arxiv.org/abs/2508.15212
Ahmed Sadaqa, Di Liu, 10 Sep 2025, Compressing CNN models for resource-constrained systems by channel and layer pruning, https://arxiv.org/abs/2509.08714
David Spuler, March 2024, Chapter 48. Width Pruning, in book "Generative AI in C++", https://www.aussieai.com/book/ch48-width-pruning
Jintao Guo, Lei Qi, Yinghuan Shi, Yang Gao, 17 Sep 2023 (v2), PLACE dropout: A Progressive Layer-wise and Channel-wise Dropout for Domain Generalization, https://arxiv.org/abs/2112.03676 https://github.com/lingeringlight/PLACEdropout
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf