Aussie AI
Mixture-of-Attention
-
Last Updated 16 April, 2026
-
by David Spuler, Ph.D.
What is Mixture-of-Attention?
Mixture-of-Attention (MoA) is the application of the Mixture-of-Experts (MoE) optimization to the attention modules in LLM layers. Traditionally, MoE has been used to optimize the Feed-Forward-Network (FFN) in LLMs, and this has become a mainstream optimization of frontier model architectures. However, although attention is less compute-bound than FFNs, the same ideas can be applied to make the QKV attention module more efficient.
This technique is not yet mainstream, and appears mostly in research papers. Some of the earliest works date back to 2019 and 2020, and various papers have used different names:
- Mixture-of-Attention (MoA)
- Mixture of Attention Extenders (MAE)
- Mixture of Heads (MOH)
There is some overlap with other attention optimization research areas, such as:
- Sparse attention
- Attention head pruning
- Attention head fusion
However, like MoE for FFNs, MoA for attention heads has dual aims:
- Attention module optimization — faster.
- More parameters possible in attention — smarter.
This area has a lot of potential and seems to be getting some more research, but it's not as prolific as some other types of attention optimization, such as sparse attention or KV cache compression.
Mixture-of-Attention: Book Excerpts and Blog Articles
Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:
- David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize
Survey Papers on Mixture-of-Attention
Research on Mixture-of-Attention
Research papers include:
- Asankhaya Sharma, 26 Jul 2024, Patched MOA: optimizing inference for diverse software development tasks, https://arxiv.org/abs/2407.18521
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
- Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 23 Jan 2017, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, https://arxiv.org/abs/1701.06538
- Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, Zhang Xiong, 11 Oct 2022, Mixture of Attention Heads: Selecting Attention Heads Per Token, https://arxiv.org/abs/2210.05144 https://aclanthology.org/2022.emnlp-main.278/
- Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 15 Oct 2024, MoH: Multi-Head Attention as Mixture-of-Head Attention, https://arxiv.org/abs/2410.11842 https://github.com/SkyworkAI/MoH
- Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 30 Sep 2024 (v3), SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987
- Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith, 13 May 2020, A Mixture of h−1 Heads is Better than h Heads, https://arxiv.org/abs/2005.06537
- Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei, 23 Apr 2024, Multi-Head Mixture-of-Experts, https://arxiv.org/abs/2404.15045
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- Yuanhang Yang, Chaozheng Wang, Jing Li, 12 May 2025, UMoE: Unifying Attention and FFN with Shared Experts, https://arxiv.org/abs/2505.07260
- Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng, 13 Aug 2025, Speed Always Wins: A Survey on Efficient Architectures for Large Language Models, https://arxiv.org/abs/2508.09834
- Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng, 24 November, 2024, LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training, https://arxiv.org/abs/2411.15708
- G. Blecher and S. Fine, 2023, MoEAtt: A Deep Mixture of Experts Model using Attention-based Routing Gate, 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 1018-1024, doi: 10.1109/ICMLA58977.2023.00151, https://ieeexplore.ieee.org/abstract/document/10459810
- Oktawiusz Jerzy Majewski, Oct 28, 2025, Sparse Adaptive Attention “MoE”: How I Solved OpenAI’s $650B Problem With a £700 GPU, https://medium.com/@hyborian_/sparse-adaptive-attention-moe-how-i-solved-openais-650b-problem-with-a-700-gpu-343f47b2d6c1
- Siyuan Mu, Sen Lin, 24 Jan 2026 (v4), A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications https://arxiv.org/abs/2503.07137
- Kuan-Chieh Wang, Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman, 6 May 2024, MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation, https://arxiv.org/abs/2404.11565 https://snap-research.github.io/mixture-of-attention
- Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov, 2019, Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 5797–5808., Association for Computational Linguistics, https://aclanthology.org/P19-1580/ (Related to "attention head pruning" but proposes "specialized attention heads" which is related to MoA.)
- Xuechen Li, Yupeng Li, Jian Liu, Xiaolin Jin, Xin Hu, 22 Sep 2025 (v3), Multi-branch of Attention Yields Accurate Results for Tabular Data, https://arxiv.org/abs/2502.12507
- Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 3 Apr 2025 (v2), Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
- Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang, 24 Nov 2025 (v3), Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths, https://arxiv.org/abs/2406.14909 https://github.com/thu-nics/MoA
- David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home