Aussie AI
Zero-Padding Removal
-
Last Updated 25 April, 2026
-
by David Spuler, Ph.D.
What is Zero Padding?
One technique for speeding up Transformer inference is to avoid using zero padding in the input vectors (see also length pruning). Any additional zero padding in vectors represents unnecessary computation that simply adds zero to the result. The need for padding arises in some architectures where it can be helpful in keeping vectors the same size, because that consistency can help with pipelining calculations through the GPU. However, research has shown that it can also lead to inefficiency from performing redundant computations that are never used, and various papers have advocated removing the zero padding bytes.
An alternative approach is to use packing of input sequences to avoid or reduce padding bytes. This is effective in training sets, or multiple inference queries.
And it's worth nothing that not all padding bytes are evil. Some of them are quite charismatic if you take them out for a cup of tea. In fact, the need for padding removal in Transformers arose for good reason from the well-intentioned optimizing by professional programmers using very nice and hospitable padding zeros. The use of padding is a positive optimization in numerous situations, particularly when GPUs are involved. Read more about padding byte optimizations.
Zero Padding: Book Excerpts and Blog Articles
Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:
- David Spuler, March 2024, Chapter 50. Adaptive Inference, in book "Generative AI in C++", https://www.aussieai.com/book/ch50-adaptive-inference
- David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf
Research Papers on Zero Padding Removal
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the optimizations suggested is to avoid computations involving zero padding bytes.)
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (Removing zero-padding inputs is one of the major optimizations in this paper.)
- J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689, PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
- H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
- Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report (Includes an interesting review of practical problems with zero padding in binarized networks, where the weights are only -1 and +1.)
- Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, Universion of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
- Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao, Easy and Efficient Transformer : Scalable Inference Solution For large NLP model, May 2022, https://arxiv.org/abs/2104.12470 (Optimizations include avoiding padding computations in the attention heads.)
- Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission used zero-padding removal and also various kernel fusions.)
- Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Used packing of sequences in training with a SEP separator token rather than CLS. Note: code uses deprecated nvFuser compiler.)
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054 Code: part of: https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019, https://arxiv.org/abs/1909.08053
- David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
- Blaise Delattre, Quentin Barthélemy, Alexandre Allauzen, 31 Jan 2024, Spectral Norm of Convolutional Layers with Circular and Zero Paddings, https://arxiv.org/abs/2402.00240
- D. Liu, X. Guo, N. Wang and Q. Wu, 2024, Lightweight Deep Neural Network Model With Padding-Free Downsampling, IEEE Signal Processing Letters, vol. 31, pp. 865-869, 2024, doi: 10.1109/LSP.2024.3374057, https://ieeexplore.ieee.org/abstract/document/10461068
- Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 https://github.com/niyunsheng/EMS-SD
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
- Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
- S Ullah, SH Song, 2024, Design of compensation algorithms for zero padding and its application to a patch based deep neural network, https://peerj.com/articles/cs-2287.pdf
- Md Amirul Islam, Matthew Kowal, Sen Jia, Konstantinos G. Derpanis, Neil D. B. Bruce, 28 Jan 2021, Position, Padding and Predictions: A Deeper Look at Position Information in CNNs, https://arxiv.org/abs/2101.12322
- Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson, 23 Sep 2025, Silent Tokens, Loud Effects: Padding in LLMs, https://arxiv.org/abs/2510.01238
- Bumjun Kim, Dongjae Jeon, Dueun Kim, Wonje Jeung, Albert No, 4 Oct 2025, Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs, https://arxiv.org/abs/2510.03680
- Nan Yang, Laicheng Zhong, Fan Huang, Dong Yuan, Wei Bao, Feb 2023, Random Padding Data Augmentation, https://arxiv.org/abs/2302.08682
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
- Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
- Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
- Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao, 27 Feb 2025 (v2), Dynamic Parallel Tree Search for Efficient LLM Reasoning, https://arxiv.org/abs/2502.16235
- Jiaqi Zhao, Miao Zhang, Weili Guan, Liqiang Nie, 21 May 2025, Boost Post-Training Quantization via Null Space Optimization for Large Language Models, https://arxiv.org/abs/2506.11044 https://github.com/zjq0455/q2n
- M Toker, I Galil, H Orgad, R Gal, Y Tewel, G Chechik, 2025, Padding tone: A mechanistic analysis of padding tokens in t2i models, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1: Long Papers), pages 7618–7632, April 29- May 4, 2025, https://aclanthology.org/anthology-files/pdf/naacl/2025.naacl-long.389.pdf (Padding effect on image generation.)
- Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao, 26 Mar 2026 (this version, v2), SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations, https://arxiv.org/abs/2512.14080 https://github.com/Dao-AILab/sonic-moe https://openreview.net/pdf?id=KzTJ1raEgB
- Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
- David Spuler, March 2024, Chapter 50. Adaptive Inference, in book "Generative AI in C++", https://www.aussieai.com/book/ch50-adaptive-inference
- David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: