Aussie AI

Attention Head Pruning Research

  • Last Updated 10 June, 2025
  • by David Spuler, Ph.D.

Attention head pruning, often simply abbreviated to "head pruning", is structured pruning that removes attention heads. It is a type of "width pruning" that makes the network "thinner". The attention heads were one of the main advances in the seminal 2017 Transformer paper, but research has shown that the attention mechanism is expensive and there are various ways to optimize its efficiency, including removing some redundant attention heads.

In addition to head pruning techniques that remove redundant or under-utilized attention heads, there is also research into using simpler attention heads (see approximate attention heads) and simplifying the cost of attention on long sequences (see non-autoregression architectures). There is also research more generally into optimized Transformer architectures.

Head pruning can be combined with various other optimization techniques such as quantization. It is also orthogonal to "depth pruning" such as "layer pruning" and "early exit", and combined depth/width pruning is possible.

Attention Head Pruning Research Papers

Research papers on head pruning:

More Research on Pruning Types

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: