Aussie AI

Sparsity Optimizations in LLMs

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

What is Sparsity?

Sparse matrices are model weight matrices that have a lot of zeros in them. Various techniques are used to avoid performing multiplication on the zero values, and thereby a more efficient model is created.

Sparsity of the weight matrices is called "static sparsity" because the weights that are zeroed do not change at runtime. However, sparsity of the activations is called "dynamic sparsity" because it changes dependent on the input.

Static Sparsification

Various techniques can be used to "sparsify" a matrix, so that the model is more sparse. This is a form of model compression if done after training, but adding sparsity can also occur during training.

The simplest sparsification technique is "magnitude pruning" whereby small near-zero values are converted to zero. The result is a model with more efficient inference, at the cost of some accuracy. Another common technique is top-K pruning, and there are many other sparsification techniques.

If a matrix has enough zeros in it, the odds are high that some rows and/or columns are all zeros (or near-zeros). In such cases, a smaller-dimension matrix can replace the full matrix without much loss of accuracy. This related optimization technique is called "low-rank matrices".

Dynamic Sparsification

Dynamic sparsification is creating zeros on the fly, rather than removing them from the model weights. There are several ways to do this dynamically:

  • Activation sparsity
  • Dynamic structural pruning (i.e., when pruning is used in adaptive inference techniques)

Pruning and Sparsity

There is a very close associations and much overlap between sparsification and pruning. After all, the effect of pruning is to create zeros for weights, and if you do enough of this, sparsity results. Hence, pruning lots of weights is what sparsification is about.

Static pruning is removal of weights from the model files. Magnitude pruning is unstructured pruning that remove weights in any structure. Static structured pruning involves removing whole structures, such as static layer pruning (removing entire layers of weights).

Dynamic pruning is an adaptive inference optimization at runtime. Dynamic unstructured pruning is not very useful, but there are many types of dynamic structured pruning. In fact, there are 4 dimensions of pruning:

KV Caching and Sparsity

There are analogous sparsification optimizations for KV cache data. Research has shown that K and V vectors are often sparse, because the issue of attention computations is usually sparse. Hence, KV sparsification can be a good way to reduce the in-memory size of the KV cache and thereby reduce its computation cost for faster inference. Read more about these KV cache research areas:

Research on KV sparsity:

Sparse Attention

The attention computations are core to Transformer inference, and research has shown that they are often sparse. Hence, there is much research on "sparse attention" optimizations.

Sparse attention is somewhat related to fully deactivating attention heads (or neurons) in attention head pruning and also other types of pruning on the width dimension; see width pruning.

Research papers on sparse attention:

  • Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
  • Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • M Pagliardini, D Paliotta, M Jaggi, F Fleuret, 2023, Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention, https://openreview.net/pdf?id=UINHuKeWUa
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
  • Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
  • Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
  • 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
  • S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
  • Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang, 14 Jun 2024, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, https://arxiv.org/abs/2406.09827 (Sparse attention using the top-k features and a tree-based structure.)
  • Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
  • Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
  • Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
  • Bokyeong Yoon; Ah-Hyun Lee; Jinsung Kim; Gordon Euhyun Mo, 9 July 2024, Exploring Attention Sparsity to Accelerate Transformer Training on GPUs, IEEE Access ( Early Access ), DOI: 10.1109/ACCESS.2024.3425638, https://ieeexplore.ieee.org/document/10589623
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Minh Lenhat, Viet Anh Nguyen, Khoa Nguyen, Duong Duc Hieu, Dao Huu Hung, Truong Son Hy, 10 Aug 2024, SAMSA: Efficient Transformer for Many Data Modalities, https://arxiv.org/abs/2408.05391 https://github.com/HySonLab/SAMSA
  • Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
  • Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
  • Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
  • Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
  • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
  • Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
  • Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov, 19 Feb 2025, RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression, https://arxiv.org/abs/2502.14051 (Combines KV token pruning with sparse attention algorithms.)
  • Nicolas Lapautre, Maria Marchenko, Carlos Miguel Pati\~no, Xin Zhou, 14 Aug 2025, Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets, https://arxiv.org/abs/2508.10758
  • Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang, 14 Aug 2025, Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation, https://arxiv.org/abs/2508.10774
  • Luyi Ma, Wanjia Zhang, Kai Zhao, Abhishek Kulkarni, Lalitesh Morishetti, Anjana Ganesh, Ashish Ranjan, Aashika Padmanabhan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sumit Dutta, Kamiya Motwani, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan, 19 Jul 2025, GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization, https://arxiv.org/abs/2507.14758
  • Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali, 9 Aug 2025, Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning, https://arxiv.org/abs/2508.07101
  • Jingze Shi and Yifan Wu and Bingheng Wu and Yiran Peng and Liangdong Wang and Guang Liu and Yuyu Luo, 4 Aug 2025, Trainable Dynamic Mask Sparse Attention, https://arxiv.org/abs/2508.02124
  • Minsuk Jang, Changick Kim, 3 Aug 2025, SPARTA: Advancing Sparse Attention in Spiking Neural Networks via Spike-Timing-Based Prioritization, https://arxiv.org/abs/2508.01646
  • Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim, 12 Aug 2025, Retrospective Sparse Attention for Efficient Long-Context Generation, https://arxiv.org/abs/2508.09001
  • Ziyi Cao, Qingyi Si, Jingbin Zhang, Bingquan Liu, 6 Aug 2025, Sparse Attention across Multiple-context KV Cache, https://arxiv.org/abs/2508.11661
  • Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang, 16 Aug 2025, Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention, https://arxiv.org/abs/2507.00449
  • Siddharth Chaudhary, Bennett Browning, 20 Aug 2025, Hydra: A 1.6B-Parameter State-Space Language Model with Sparse Attention, Mixture-of-Experts, and Memory, https://arxiv.org/abs/2508.15099
  • Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu, 22 Aug 2025, Dynamic Sparse Attention on Mobile SoCs, https://arxiv.org/abs/2508.16703
  • Ran Yan, Youhe Jiang, Binhang Yuan, 25 Aug 2025, Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel, https://arxiv.org/abs/2508.18224
  • Chuang Chen and Xiaolin Qin and Jing Hu and Wenyi Ge, 23 Jul 2025, SRMambaV2: Biomimetic Attention for Sparse Point Cloud Upsampling in Autonomous Driving, https://arxiv.org/abs/2507.17479
  • Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang and Guoqi Li, 22 Jul 2025, Scaling Linear Attention with Sparse State Expansion, https://arxiv.org/abs/2507.16577
  • Edward McDugald, Arvind Mohan, Darren Engwirda, Agnese Marcato, and Javier Santos, 19 Jul 2025, Attention-Based Reconstruction of Full-Field Tsunami Waves from Sparse Tsunameter Networks, https://arxiv.org/abs/2411.12948
  • Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta and Ravi Narayanan, 10 Aug 2025, DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention, https://arxiv.org/abs/2508.07185
  • Leon Dimitrov, 24 Aug 2025, Scaling Graph Transformers: A Comparative Study of Sparse and Dense Attention, https://arxiv.org/abs/2508.17175

Feed-Forward Network Sparsity

FFN sparsity is limiting sparsification to the FFN modules in model layers. There is a close relationship between FFN sparsity and FFN pruning optimizations.

Research on FFN sparsity:

Activation Sparsity

Activation sparsity refers to dynamic analysis of the "activations" during inference. It is a particular type of "dynamic sparsity" optimization (other types are optimizations that dynamically remove model data, such as dynamic structural pruning).

There is a close relationship between "activation sparsity" and pruning along the same dimension; see embedding-dimension pruning optimizations.

Research on activation sparsity:

  • Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
  • Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun, 27 Feb 2024 (v2), ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, https://arxiv.org/abs/2402.13516 (Increases activation sparsity by using RELU and other techniques.)
  • Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
  • Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka, 26 Jun 2024, Learning Neural Networks with Sparse Activations, https://arxiv.org/abs/2406.17989
  • James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
  • Cody Wild, Jesper Anderson, 10 Jul 2024, Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers, https://arxiv.org/abs/2407.07848
  • Xiaolong Yu, Cong Tian, 30 May 2024, Dual sparse training framework: inducing activation map sparsity via Transformed ℓ1 regularization, https://arxiv.org/abs/2405.19652
  • Rongyu Zhang, Aosong Cheng, Yulin Luo, Gaole Dai, Huanrui Yang, Jiaming Liu, Ran Xu, Li Du, Yuan Du, Yanbing Jiang, Shanghang Zhang, 26 May 2024, Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation, https://arxiv.org/abs/2405.16486 https://github.com/RoyZry98/MoASE-Pytorch
  • Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, David Kappel, Anand Subramoney, 1 May 2024, Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models, https://arxiv.org/abs/2405.00433
  • Andreas Müller, Erwin Quiring, 27 Mar 2024, The Impact of Uniform Inputs on Activation Sparsity and Energy-Latency Attacks in Computer Vision, https://arxiv.org/abs/2403.18587
  • Ilan Price, Nicholas Daultry Ball, Samuel C.H. Lam, Adam C. Jones, Jared Tanner, 25 Feb 2024, Deep Neural Network Initialization with Sparsity Inducing Activations, https://arxiv.org/abs/2402.16184
  • Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney, 7 Dec 2023 (v2), Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference, https://arxiv.org/abs/2311.07625
  • Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, 6 Oct 2023, ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564
  • Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
  • Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
  • Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
  • Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
  • Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
  • Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang, Shuo Wang, Yufei Huang, Guanyu Lin, Yingfa Chen, Weilin Zhao, Yuge Tu, Zexuan Zhong, Ao Zhang, Chenglei Si, Khai Hao Moo, Chenyang Zhao, Huimin Chen, Yankai Lin, Zhiyuan Liu, Jingbo Shang, Maosong Sun, Sep 2024, Configurable Foundation Models: Building LLMs from a Modular Perspective, https://arxiv.org/pdf/2409.02877
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
  • Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, 4 Nov 2024, Sparsing Law: Towards Large Language Models with Greater Activation Sparsity, https://arxiv.org/abs/2411.02335
  • Jiho Shin, Hoeseok Yang, Youngmin Yi, 19 Nov 2024, SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference, https://arxiv.org/abs/2411.12692
  • Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
  • Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 10 Dec 2024 (v2), Mixture of Hidden-Dimensions Transformer, https://arxiv.org/abs/2412.05644
  • Vui Seng Chua, Yujie Pan, Nilesh Jain, 10 Dec 2024, Post-Training Statistical Calibration for Higher Activation Sparsity, https://arxiv.org/abs/2412.07174 https://github.com/IntelLabs/SCAP
  • Nobel Dhar, Bobin Deng, Md Romyull Islam, Kazi Fahim Ahmad Nasif, Liang Zhao, Kun Suo, 13 Dec 2024, Activation Sparsity Opportunities for Compressing General Large Language Models, https://arxiv.org/abs/2412.12178
  • Zihao Zheng, Yuanchun Li, Jiayu Chen, Peng Zhou, Xiang Chen, Yunxin Liu, 18 Dec 2024, Threshold Neuron: A Brain-inspired Artificial Neuron for Efficient On-device Inference, https://arxiv.org/abs/2412.13902 (Multiplication-free model architecture using comparisons and subtraction, including a threshold mechanism that make it analogous to activation sparsification.)
  • Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
  • Y An, Z Chen, C Xiong, B Chen, Jan 2025, Herd: Grouping before Pruning for Batch Inference, https://oasis-git.github.io/data/herd.pdf (Dynamic activation pruning across multiple queries in a batch.)
  • Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
  • Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu, 25 Jan 2025, ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning, https://arxiv.org/abs/2501.15316
  • W Zhang, X Ren, Mar 2025, ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation, ICLR 2025 review, https://openreview.net/pdf?id=cizhOu3CZa (Induce activation sparsity for MoE choice in the model router.)
  • Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun, 30 Jul 2025, BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity, https://arxiv.org/abs/2507.08771
  • Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang, 4 Aug 2025, Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models, https://arxiv.org/abs/2508.02128
  • Nobel Dhar, Bobin Deng, Md Romyull Islam, Xinyue Zhang, Kazi Fahim Ahmad Nasif, and Kun Suo, 11 Jul 2025, A Sparsity Predicting Approach for Large Language Models via Activation Pattern Clustering, https://arxiv.org/abs/2507.14179

Sparse Matrix Multiplication

There are special optimizations made possible by sparsity in the MatMul/GEMM kernels. Research on sparse matrix computations:

Block Sparsity

Research on block-level sparsity:

  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
  • Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
  • Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
  • Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia, 12 Jul 2024, Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators, https://arxiv.org/abs/2407.09453
  • Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
  • Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
  • Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
  • Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
  • Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang, 18 Oct 2024 (v2), SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs, https://arxiv.org/abs/2410.13276
  • Q. Dong, S. Zhang and Z. Wang, "An Efficient Window-Based Vision Transformer Accelerator via Mixed-Granularity Sparsity," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3527541. https://ieeexplore.ieee.org/abstract/document/10844888
  • Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang, 14 Aug 2025, Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation, https://arxiv.org/abs/2508.10774

Vector Sparsity

Vector sparsity is similar to block sparsity, but only along a single dimension. Research on vector-level sparsity:

  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
  • Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
  • Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
  • M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
  • Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
  • Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
  • Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica, 19 Dec 2024, HashAttention: Semantic Sparsity for Faster Inference, https://arxiv.org/abs/2412.14468
  • Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 31 Jul 2025, Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions, https://arxiv.org/abs/2507.23539

Tensor Sparsity

Research on sparse tensors:

SLIDE (Sparse Hashing for Back-Propagation in Training)

Research papers on SLIDE:

Dynamic Sparsity Research

Papers on dynamic sparsity include:

General Research on Sparsity Techniques

  • Li, Y.; Yu, Y.; Zhang, Q.; Liang, C.; He, P.; Chen, W.; and Zhao, T. 2023. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR. https://arxiv.org/abs/2306.11222
  • Frantar, E.; and Alistarh, D. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774. https://arxiv.org/abs/2301.00774
  • X. Dai, H. Yin and N. K. Jha, "NeST: A neural network synthesis tool based on a grow-and-prune paradigm", IEEE Trans. Comput., vol. 68, no. 10, pp. 1487-1497, Oct. 2019. https://arxiv.org/abs/1711.02017
  • S. Cao et al., "Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity", Proc. Int. Symp. Field-Prog. Gate Arrays, pp. 63-72, 2019. PDF: https://wencongxiao.github.io/res/fpga19/FPGA19.pdf
  • W. Wen, C. Wu, Y. Wang, Y. Chen and H. Li, "Learning structured sparsity in deep neural networks", Proc. Adv. Neural Inf. Process. Syst., vol. 29, pp. 2074-2082, 2016. https://arxiv.org/abs/1608.03665
  • M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
  • Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally, "Exploring the regularity of sparse structure in convolutional neural networks", arXiv:1705.08922, 2017. https://arxiv.org/abs/1705.08922
  • H. Wang, Q. Zhang, Y. Wang, L. Yu and H. Hu, "Structured pruning for efficient ConvNets via incremental regularization", Proc. Int. Joint Conf. Neural Netw. (IJCNN), pp. 1-8, Jul. 2019. https://openreview.net/pdf?id=S1e_xM7_iQ
  • S. Narang, E. Elsen, G. Diamos and S. Sengupta, "Exploring sparsity in recurrent neural networks", arXiv:1704.05119, 2017. https://arxiv.org/abs/1704.05119
  • Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally, "ESE: Efficient speech recognition engine with sparse LSTM on FPGA", Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), pp. 75-84, 2017. https://arxiv.org/abs/1612.00694
  • M. Zhu and S. Gupta, "To prune or not to prune: Exploring the efficacy of pruning for model compression", arXiv:1710.01878, 2017. https://arxiv.org/abs/1710.01878
  • S. Zhang et al., "Cambricon-X: An accelerator for sparse neural networks", Proc. Int. Symp. Microarchitecture, pp. 1-12, 2016. https://ieeexplore.ieee.org/document/7783723
  • Z.-G. Liu, P. N. Whatmough and M. Mattina, "Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference", IEEE Comput. Archit. Lett., vol. 19, no. 1, pp. 34-37, Jan. 2020. https://arxiv.org/abs/2005.08098
  • Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, "NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps", IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
  • Y. Lu, C. Wang, L. Gong, X. Zhou, SparseNN: a performance-efficient accelerator for large-scale sparse neural networks, Int. J. Parallel Program. 46 (4) (2018) 648–659. https://arxiv.org/abs/1711.01263
  • J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, Cnvlutin: ineffectual-neuron-free deep neural network computing, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1–13. https://ieeexplore.ieee.org/document/7551378
  • S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
  • A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
  • Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. In International Conference on Machine Learning. 3299–3308. https://arxiv.org/abs/1706.06197 Code: https://github.com/lancopku/meProp (Structural sparsification.)
  • Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 2074–2082. https://arxiv.org/abs/1608.03665 (Structural sparsification.)
  • P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. Accelerating SpMV on FPGAs by compressing nonzero values. In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values.)
  • S Liu, Z Wang, 2023, Ten lessons we have learned in the new" sparseland": A short handbook for sparse neural network researchers arXiv preprint arXiv:2302.02596, https://arxiv.org/abs/2302.02596
  • Enmao Diao, Ganghua Wang, Jiawei Zhan, Yuhong Yang, Jie Ding, Vahid Tarokh, Aug 2023, Pruning Deep Neural Networks from a Sparsity Perspective, https://arxiv.org/abs/2302.05601
  • B Yoon, Y Han, GE Moon, 2023, SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling arXiv preprint arXiv:2309.12578, https://arxiv.org/pdf/2309.12578.pdf
  • Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
  • A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang, Oct 2023, Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, https://browse.arxiv.org/pdf/2310.01382.pdf
  • Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, Oct 2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564 (Recommends reinstating the simpler RELU rather than GELU or SiLU, with a focus on inference efficiency.)
  • Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 (High sparsity on Llama2 models.)
  • Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
  • Ganesh Jawahar, April 2024, Methods for design of efficient on-device natural language processing architectures, Ph.D. thesis, Computer Science, The University of British Columbia (Vancouver) https://open.library.ubc.ca/media/download/pdf/24/1.0441384/4
  • Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu, 16 Apr 2024, SparseDM: Toward Sparse Efficient Diffusion Models, https://arxiv.org/abs/2404.10445
  • Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini, 12 Apr 2024, CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models, https://arxiv.org/abs/2404.08763 (Sparsity with dynamic control over the thresholds with an effect that is similar to intra-model MoE.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Panjie Qi; Edwin Hsing-Mean Sha; Qingfeng Zhuge; Hongwu Peng; Shaoyi Hua, 2021, Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/document/9643586
  • Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti, Mar 2023, Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers, CVPR 2023, https://arxiv.org/abs/2303.13755 https://openaccess.thecvf.com/content/CVPR2023/papers/Wei_Sparsifiner_Learning_Sparse_Instance-Dependent_Attention_for_Efficient_Vision_Transformers_CVPR_2023_paper.pdf
  • Rahul Chand, Yashoteja Prabhu, Pratyush Kumar, 20 Dec 2023, DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization, https://arxiv.org/abs/2312.13211
  • Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
  • Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun, Dec 2019, Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection, https://arxiv.org/abs/1912.11637
  • Georgios Georgiadis. 2019. Accelerating Convolutional Neural Networks via Activation Map Compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7085–7095. https://arxiv.org/abs/1812.04056
  • Bunyodbek Ibrokhimov, Cheonghwan Hur, and Sanggil Kang. 2020. Effective node selection technique towards sparse learning. APPLIED INTELLIGENCE (2020), https://dl.acm.org/doi/abs/10.1007/s10489-020-01720-5
  • Zehao Huang and Naiyan Wang. 2018. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 304–320. https://arxiv.org/abs/1707.01213
  • Shiyao Xu; Jingfei Jiang; Jinwei Xu; Chaorun Liu; Yuanhong He; Xiaohang Liu, 2022, Sparkle: A High Efficient Sparse Matrix Multiplication Accelerator for Deep Learning, 2022 IEEE 40th International Conference on Computer Design (ICCD) https://ieeexplore.ieee.org/document/9978530
  • C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian and B. Yuan, "PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices", Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 189-202, Oct. 2018. https://arxiv.org/abs/2004.10936
  • Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
  • C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A power-efficient recurrent neural network accelerator,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261
  • Gale, T., Elsen, E., and Hooker, S., The state of sparsity in deep neural networks, arXiv preprint arXiv:1902.09574, 2019, https://arxiv.org/abs/1902.09574
  • Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer, K., and Gholami, A., 2022, A fast post-training pruning framework for transformers, arXiv preprint arXiv:2204.09656, https://arxiv.org/abs/2204.09656
  • Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Feb 2023, Hyena Hierarchy: Towards Larger Convolutional Language Models, https://arxiv.org/abs/2302.10866 -
  • Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen, 11 Jun 2024 (v2), Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters, https://arxiv.org/abs/2406.05955
  • Splash: Sparse Flash Attention, 2024, https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py
  • 1 Jun 2023, Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret, https://arxiv.org/abs/2306.01160
  • Mingxuan He, Mithuna Thottethodi, T.N. Vijaykumar, 6 Apr 2024, Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference, https://arxiv.org/abs/2404.04708
  • Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
  • Reece Shuttleworth, CHARACTERIZING SPARSITY IN TRANSFORMERS https://reeceshuttle.me/assets/9.58-Final-Project-Report.pdf Code: https://github.com/reeceshuttle/958
  • Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Junyi Chen, Lingkun Long, Han Wan, Bei Yu, Weisheng Zhao, Nov 2023, TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices, https://arxiv.org/abs/2311.01759
  • Jun Liu; Guohao Dai; Hao Xia; Lidong Guo; Xiangsheng Shi; Jiaming Xu; Nov 2023, TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/abstract/document/10323775 (Managing sparse tensors efficiently by using two-level data structures that allows granular control of sparsity.)
  • Eunji Kwon; Jongho Yoon; Seokhyeong Kang, Dec 2023, Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10375766
  • Luca Dordoni, Dec 2023, Sparsification of deep neural network via ternary quantization, Masters Thesis, POLITECNICO DI TORINO, Italy, https://webthesis.biblio.polito.it/29424/1/tesi.pdf
  • Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen, Aug 2020, Sparse GPU Kernels for Deep Learning, https://arxiv.org/abs/2006.10901
  • Ziheng Wang, Aug 2020, SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference, https://arxiv.org/abs/2008.11849
  • Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu, Oct 2023, Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity, https://arxiv.org/abs/2310.05175
  • Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
  • Anonymous, SPARSITY IN LARGE LANGUAGE MODELS RS BACK, RELU STRIKES BACK: EXPLOITING ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS, https://openreview.net/pdf?id=osoWxY8q2E
  • Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. Sparse is enough in scaling transformers. In Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=-b5OSCydOMe. https://arxiv.org/abs/2111.12763
  • M Piórczyński, F Szatkowski, K Bałazy, B Wójcik, 2023, Exploiting Transformer Activation Sparsity with Dynamic Inference https://arxiv.org/pdf/2310.04361.pdf
  • KAA Fuad, L Chen, 2023, A Survey on Sparsity Exploration in Transformer-Based Accelerators https://www.mdpi.com/2079-9292/12/10/2299
  • Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
  • Zehao Huang. 2018. Data-Driven Sparse Structure Selection for Deep Neural Networks. Papers with Code. https://paperswithcode.com/paper/data-driven-sparse-structure-selection-for (2021).
  • M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
  • 25 May 2024, Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection, Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen, https://arxiv.org/abs/2405.16178
  • Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin, 2019, Adaptive attention span in transformers. CoRR, abs/1905.07799, 2019, http://arxiv.org/abs/1905.07799.
  • Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019, Augmenting self-attention with persistent memory. CoRR, abs/1907.01470, 2019. http://arxiv.org/abs/1907.01470
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers, 2019, https://arxiv.org/abs/1904.10509
  • Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. CoRR, abs/1907.05242, 2019. http://arxiv.org/abs/1907.05242
  • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
  • Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus, 29 Apr 2022 (v2), ST-MoE: Designing Stable and Transferable Sparse Expert Models, https://arxiv.org/abs/2202.08906
  • Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
  • Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna, 19 Jun 2024, SDQ: Sparse Decomposed Quantization for LLM Inference, https://arxiv.org/abs/2406.13868 (Combining sparsity and quantization.)
  • Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
  • Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
  • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang, 16 Jul 2024, Scaling Diffusion Transformers to 16 Billion Parameters, https://arxiv.org/abs/2407.11633 Project: https://github.com/feizc/DiT-MoE
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
  • Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024 (v2), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514 (Storing model parameters in flash memory on phones.)
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Peter Belcak, Roger Wattenhofer, Aug 2024, UltraSparseBERT: 99% Conditionally Sparse Language Modelling, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 104–108, August 11-16, 2024, https://aclanthology.org/2024.acl-short.10.pdf
  • Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
  • Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
  • Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 22 Aug 2024, A Tighter Complexity Analysis of SparseGPT, https://arxiv.org/abs/2408.12151
  • Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
  • Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
  • James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
  • Amir Basic, 2024, Sparsification with Variational Dropout, Master’s thesis, Data Science, Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Oslo, Norway, https://www.duo.uio.no/bitstream/handle/10852/112199/1/Amir_Basic_Masteroppgave.pdf
  • Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
  • Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
  • My Social, May 17, 2024, Sparse Llama: Revolutionizing LLMs with 70% Sparsity, https://medium.com/aimonks/sparse-llama-revolutionizing-llms-with-70-sparsity-e6e9664f38e1
  • Cerebras, May 15, 2024, Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy, https://cerebras.ai/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy
  • Neural Magic, 2024, Sparse Foundational Llama 2 Models, https://docs.neuralmagic.com/llms/models/sparse-foundational-llama-2/
  • Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
  • Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh, 31 May 2024, Effective Interplay between Sparsity and Quantization: From Theory to Practice, https://arxiv.org/abs/2405.20935
  • Krisna Pinasthika, Blessius Sheldo Putra Laksono, Riyandi Banovbi Putera Irsal, Syifa Hukma Shabiyya, Novanto Yudistira, 11 Sep 2023, SparseSwin: Swin Transformer with Sparse Transformer Block, https://arxiv.org/abs/2309.05224 https://www.sciencedirect.com/science/article/abs/pii/S0925231224002042
  • Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
  • Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
  • Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
  • Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah, 8 Sep 2024, BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration, https://arxiv.org/abs/2409.05227
  • Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
  • Y. Jin, R. Zhong, S. Long and J. Zhai, "Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3462092. https://ieeexplore.ieee.org/document/10682058 https://www.computer.org/csdl/journal/td/5555/01/10682058/20jHtbSkOJO https://doi.ieeecomputersociety.org/10.1109/TPDS.2024.3462092
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
  • Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Juan Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 1 Oct 2024, SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models, https://arxiv.org/abs/2410.03750 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
  • Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang, 9 Oct 2024 (v2), SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference, https://arxiv.org/abs/2410.04417 https://github.com/Gumpest/SparseVLMs
  • C. Zhang et al., "DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475814. https://ieeexplore.ieee.org/abstract/document/10709841 (Sparse kernels in hardware.)
  • Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
  • Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
  • Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
  • Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
  • Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 16 Dec 2024, CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation, https://arxiv.org/abs/2412.11741
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu, 23 Dec 2024, GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference, https://arxiv.org/abs/2412.17560
  • Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak, 25 Jan 2025 (v2), Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models, https://arxiv.org/abs/2501.12370
  • Tiernan Ray, Feb. 19, 2025, What is sparsity? DeepSeek AI's secret, revealed by Apple researchers: The AI model that shook the world is part of a broad trend to squeeze more out of chips. Here's how it works. https://www.zdnet.com/article/what-is-sparsity-deepseek-ais-secret-revealed-by-apple-researchers/
  • Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah, 18 Feb 2025, SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs, https://arxiv.org/abs/2502.12444
  • Ruibo Fan, Xiangrui Yu, Peijie Dong, Zeyu Li, Gu Gong, Qiang Wang, Wei Wang, and Xiaowen Chu. 2025. SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25). Association for Computing Machinery, New York, NY, USA, 243–260. https://doi.org/10.1145/3689031.3717481 https://dl.acm.org/doi/abs/10.1145/3689031.3717481
  • Petr Kasalický, Martin Spišák, Vojtěch Vančura, Daniel Bohuněk, Rodrigo Alves, Pavel Kordík, 16 May 2025, The Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems, https://arxiv.org/abs/2505.11388
  • Zhiyong Jin, Runhua Xu, Chao Li, Yizhong Liu, Jianxin Li, 18 Jul 2025, Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning, https://arxiv.org/abs/2505.01454
  • Andrii Balashov, 23 Jul 2025, Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models, https://arxiv.org/abs/2507.17107
  • Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal, 9 Aug 2025, ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization, https://arxiv.org/abs/2311.13171
  • Selahattin Akkas and Ariful Azad, 28 Jul 2025, Shapley-Value-Based Graph Sparsification for GNN Inference, https://arxiv.org/abs/2507.20460
  • Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, Jong Hwan Ko, 30 Jul 2025, MSQ: Memory-Efficient Bit Sparsification Quantization, https://arxiv.org/abs/2507.22349
  • Joyentanuj Das, Suranjan De, He Sun, 7 Aug 2025, Online Sparsification of Bipartite-Like Clusters in Graphs, https://arxiv.org/abs/2508.05437
  • Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu, 8 Aug 2025, One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging, https://arxiv.org/abs/2508.06163
  • Keumseo Ryum, Jinu Gong, and Joonhyuk Kang, 12 Aug 2025, SHEFL: Resource-Aware Aggregation and Sparsification in Heterogeneous Ensemble Federated Learning, https://arxiv.org/abs/2508.08552
  • Jiexia Ye, Weiqi Zhang, Ziyue Li, Jia Li, Fugee Tsung, 17 Aug 2025, MedSpaformer: a Transferable Transformer with Multi-granularity Token Sparsification for Medical Time Series Classification, https://arxiv.org/abs/2503.15578
  • Aly M. Kassem, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi, 19 Jun 2025, Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing, https://arxiv.org/abs/2507.21084
  • Peter Plantinga, Jen-Kai Chen, Roozbeh Sattari, Mirco Ravanelli, Denise Klein, 16 Jul 2025, From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinson's Disease, https://arxiv.org/abs/2507.16836
  • Suchit Gupte, Vishnu Kabir Chhabra and Mohammad Mahdi Khalili, 21 Jul 2025, On the transferability of Sparse Autoencoders for interpreting compressed models, https://arxiv.org/abs/2507.15977
  • Jintao Zhang and Zirui Liu and Mingyue Cheng and Shilong Zhang and Tingyue Pan and Yitong zhou and Qi Liu and Yanhu Xie, 22 Jul 2025, Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model, https://arxiv.org/abs/2505.22116
  • Galip \"Umit Yolcu, Moritz Weckbecker, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin, 24 Jul 2025, DualXDA: Towards Sparse, Efficient and Explainable Data Attribution in Large AI Models, https://arxiv.org/abs/2402.12118
  • Kenza Bouzid, Shruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland, 18 Jul 2025, Insights into a radiology-specialised multimodal large language model with sparse autoencoders, https://arxiv.org/abs/2507.12950
  • Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari, 8 Aug 2025, Generalizing Scaling Laws for Dense and Sparse Large Language Models, https://arxiv.org/abs/2508.06617
  • Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta and Ravi Narayanan, 10 Aug 2025, DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention, https://arxiv.org/abs/2508.07185
  • Viktoria Schuster, 29 Jul 2025, Can sparse autoencoders make sense of gene expression latent variable models?, https://arxiv.org/abs/2410.11468
  • Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei, 31 Jul 2025, Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders, https://arxiv.org/abs/2507.23220
  • Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang, 14 Aug 2025, Interpretable Reward Model via Sparse Autoencoder, https://arxiv.org/abs/2508.08746
  • David Ye, Jan Williams, Mars Gao, Stefano Riva, Matteo Tomasetto, David Zoro, J. Nathan Kutz, 28 Jul 2025, PySHRED: A Python package for SHallow REcurrent Decoding for sparse sensing, model reduction and scientific discovery, https://arxiv.org/abs/2507.20954
  • Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao, 28 Jul 2025, Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder, https://arxiv.org/abs/2507.20973
  • Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, Pushmeet Kohli, 29 Jul 2025, AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data, https://arxiv.org/abs/2507.22291
  • Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du, 29 Jul 2025, Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering, https://arxiv.org/abs/2505.15038
  • Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang, 4 Aug 2025, LOST: Low-rank and Sparse Pre-training for Large Language Models, https://arxiv.org/abs/2508.02668
  • Sadegh Ebrahimkhani and John Lataire, 2 Aug 2025, Kernel-Based Sparse Additive Nonlinear Model Structure Detection through a Linearization Approach, https://arxiv.org/abs/2508.01453
  • Xin Zhang, Quanyu Zhu, Liangbei Xu, Zain Huda, Wang Zhou, Jin Fang, Dennis van der Staay, Yuxi Hu, Jade Nie, Jiyan Yang and Chunzhi Yang, 5 Aug 2025, Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training, https://arxiv.org/abs/2508.03854
  • Adit Krishnan, Chu Wang, Chris Kong, 12 Aug 2025, Classifier Language Models: Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks, https://arxiv.org/abs/2508.08635
  • Andela Ilic, Jiaxi Jiang, Paul Streli, Xintong Liu, Christian Holz, 13 Aug 2025, Human Motion Capture from Loose and Sparse Inertial Sensors with Garment-aware Diffusion Models, https://arxiv.org/abs/2506.15290
  • Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng, 15 Aug 2025, Probing the Representational Power of Sparse Autoencoders in Vision Models, https://arxiv.org/abs/2508.11277
  • Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang, 16 Aug 2025, Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention, https://arxiv.org/abs/2507.00449
  • Siddharth Chaudhary, Bennett Browning, 20 Aug 2025, Hydra: A 1.6B-Parameter State-Space Language Model with Sparse Attention, Mixture-of-Experts, and Memory, https://arxiv.org/abs/2508.15099

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: