Aussie AI

Mixture-of-Attention

Last Updated 16 April, 2026

by David Spuler, Ph.D.

What is Mixture-of-Attention?

Mixture-of-Attention (MoA) is the application of the Mixture-of-Experts (MoE) optimization to the attention modules in LLM layers. Traditionally, MoE has been used to optimize the Feed-Forward-Network (FFN) in LLMs, and this has become a mainstream optimization of frontier model architectures. However, although attention is less compute-bound than FFNs, the same ideas can be applied to make the QKV attention module more efficient.

This technique is not yet mainstream, and appears mostly in research papers. Some of the earliest works date back to 2019 and 2020, and various papers have used different names:

Mixture-of-Attention (MoA)
Mixture of Attention Extenders (MAE)
Mixture of Heads (MOH)

There is some overlap with other attention optimization research areas, such as:

Sparse attention
Attention head pruning
Attention head fusion

However, like MoE for FFNs, MoA for attention heads has dual aims:

Attention module optimization — faster.
More parameters possible in attention — smarter.

This area has a lot of potential and seems to be getting some more research, but it's not as prolific as some other types of attention optimization, such as sparse attention or KV cache compression.

Mixture-of-Attention: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize

Survey Papers on Mixture-of-Attention

Research on Mixture-of-Attention

Research papers include:

Asankhaya Sharma, 26 Jul 2024, Patched MOA: optimizing inference for diverse software development tasks, https://arxiv.org/abs/2407.18521
Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 23 Jan 2017, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, https://arxiv.org/abs/1701.06538
Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, Zhang Xiong, 11 Oct 2022, Mixture of Attention Heads: Selecting Attention Heads Per Token, https://arxiv.org/abs/2210.05144 https://aclanthology.org/2022.emnlp-main.278/
Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 15 Oct 2024, MoH: Multi-Head Attention as Mixture-of-Head Attention, https://arxiv.org/abs/2410.11842 https://github.com/SkyworkAI/MoH
Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 30 Sep 2024 (v3), SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987
Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith, 13 May 2020, A Mixture of h−1 Heads is Better than h Heads, https://arxiv.org/abs/2005.06537
Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei, 23 Apr 2024, Multi-Head Mixture-of-Experts, https://arxiv.org/abs/2404.15045
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Yuanhang Yang, Chaozheng Wang, Jing Li, 12 May 2025, UMoE: Unifying Attention and FFN with Shared Experts, https://arxiv.org/abs/2505.07260
Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng, 13 Aug 2025, Speed Always Wins: A Survey on Efficient Architectures for Large Language Models, https://arxiv.org/abs/2508.09834
Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng, 24 November, 2024, LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training, https://arxiv.org/abs/2411.15708
G. Blecher and S. Fine, 2023, MoEAtt: A Deep Mixture of Experts Model using Attention-based Routing Gate, 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 1018-1024, doi: 10.1109/ICMLA58977.2023.00151, https://ieeexplore.ieee.org/abstract/document/10459810
Oktawiusz Jerzy Majewski, Oct 28, 2025, Sparse Adaptive Attention “MoE”: How I Solved OpenAI’s $650B Problem With a £700 GPU, https://medium.com/@hyborian_/sparse-adaptive-attention-moe-how-i-solved-openais-650b-problem-with-a-700-gpu-343f47b2d6c1
Siyuan Mu, Sen Lin, 24 Jan 2026 (v4), A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications https://arxiv.org/abs/2503.07137
Kuan-Chieh Wang, Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman, 6 May 2024, MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation, https://arxiv.org/abs/2404.11565 https://snap-research.github.io/mixture-of-attention
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov, 2019, Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 5797–5808., Association for Computational Linguistics, https://aclanthology.org/P19-1580/ (Related to "attention head pruning" but proposes "specialized attention heads" which is related to MoA.)
Xuechen Li, Yupeng Li, Jian Liu, Xiaolin Jin, Xin Hu, 22 Sep 2025 (v3), Multi-branch of Attention Yields Accurate Results for Tabular Data, https://arxiv.org/abs/2502.12507
Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 3 Apr 2025 (v2), Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang, 24 Nov 2025 (v3), Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths, https://arxiv.org/abs/2406.14909 https://github.com/thu-nics/MoA
David Spuler, Ph.D., LLM Attention and FFN Optimization are Opposites March 22nd, 2026, Aussie AI Blog, https://www.aussieai.com/blog/attention-ffn-llm-optimize