Aussie AI

Mixture-of-Attention

  • Last Updated 16 April, 2026
  • by David Spuler, Ph.D.

What is Mixture-of-Attention?

Mixture-of-Attention (MoA) is the application of the Mixture-of-Experts (MoE) optimization to the attention modules in LLM layers. Traditionally, MoE has been used to optimize the Feed-Forward-Network (FFN) in LLMs, and this has become a mainstream optimization of frontier model architectures. However, although attention is less compute-bound than FFNs, the same ideas can be applied to make the QKV attention module more efficient.

This technique is not yet mainstream, and appears mostly in research papers. Some of the earliest works date back to 2019 and 2020, and various papers have used different names:

  • Mixture-of-Attention (MoA)
  • Mixture of Attention Extenders (MAE)
  • Mixture of Heads (MOH)

There is some overlap with other attention optimization research areas, such as:

  • Sparse attention
  • Attention head pruning
  • Attention head fusion

However, like MoE for FFNs, MoA for attention heads has dual aims:

  • Attention module optimization — faster.
  • More parameters possible in attention — smarter.

This area has a lot of potential and seems to be getting some more research, but it's not as prolific as some other types of attention optimization, such as sparse attention or KV cache compression.

Mixture-of-Attention: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

Survey Papers on Mixture-of-Attention

Research on Mixture-of-Attention

Research papers include:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

More AI Research Topics

Read more about: