Aussie AI

Zero-Padding Removal

  • Last Updated 11 June, 2025
  • by David Spuler, Ph.D.

One technique for speeding up Transformer inference is to avoid using zero padding in the input vectors (see also length pruning). The need for padding arises in some architectures where it can be helpful in keeping vectors the same size, because that consistency can help with pipelining calculations through the GPU. However, research has shown that it can also lead to inefficiency from performing redundant computations that are never used, and various papers have advocated removing the zero padding bytes.

An alternative approach is to use packing of input sequences to avoid or reduce padding bytes. This is effective in training sets, or multiple inference queries.

And it's worth nothing that not all padding bytes are evil. Some of them are quite charismatic if you take them out for a cup of tea. In fact, the need for padding removal in Transformers arose for good reason from the well-intentioned optimizing by professional programmers using very nice and hospitable padding zeros. The use of padding is a positive optimization in numerous situations, particularly when GPUs are involved. Read more about padding byte optimizations.

Research Papers on Zero Padding Removal

More Research on Pruning Types

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: