Aussie AI
Multi-head Latent Attention (MLA)
-
Last Updated 29 August, 2025
-
by David Spuler, Ph.D.
What is Multi-head Latent Attention (MLA)?
Multi-head Latent Attention (MLA) is an LLM attention optimization developed by DeepSeek. It became well-known with the release of DeepSeek R1 reasoning model in early 2025, but had actually been developed earlier for their V2/V3 non-reasoning models in mid-late 2024.
MLA improves upon the well-known LLM attention optimizations such as Multi-Head Attention (MHA) in the original Transformer paper, and the follow-on advancements of Multi-Query Attention (MQA) and and Group Query Attention (GQA). Subsequently, DeepSeek has also released as open-source the code for a combination of MLA and Flash Attention called "FlashMLA."
Research on MLA
Research papers on MLA include:
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
- Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
- Dr. Ashish Bamania, Feb 2025, Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek: A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good https://levelup.gitconnected.com/multi-head-latent-attention-is-the-powerful-engine-behind-deepseek-0ecfd29e0b04 (MLA versus GQA/MQA attention and how MLA achieves KV cache compression.)
- Fanxu Meng, Zengwei Yao, Muhan Zhang, 13 Feb 2025 (v2), TransMLA: Multi-Head Latent Attention Is All You Need, https://arxiv.org/abs/2502.07864
- Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui, 20 Feb 2025, Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, https://arxiv.org/abs/2502.14837
- DeepSeek, Feb 2025 (accessed), FlashMLA: FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving, https://github.com/deepseek-ai/FlashMLA
- Nickie Louise, February 24, 2025, DeepSeek launches FlashMLA: A breakthrough in AI speed and efficiency for NVIDIA GPUs, https://techstartups.com/2025/02/24/deepseek-launches-flashmla-a-breakthrough-in-ai-speed-and-efficiency-for-nvidia-gpus/
- Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
- Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, March 4, 2025, ThunderMLA: FlashMLA, Faster and Fused-er! https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla https://github.com/HazyResearch/ThunderKittens/blob/mla/kernels/attn/demo/mla_decode/template_mla_decode.cu (Using a single CUDA "megakernel" to perform all jobs and passing it meta-instructions, thereby avoiding launching and shutting down kernels.)
- Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum, 14 Mar 2025, X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression, https://arxiv.org/abs/2503.11132
- Chengen Wang, Murat Kantarcioglu, 14 Mar 2025, A Review of DeepSeek Models' Key Innovative Techniques, https://arxiv.org/abs/2503.11486
- L. Xiong et al., May 2025, DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models, IEEE/CAA Journal of Automatica Sinica, vol. 12, no. 5, pp. 841-858, May 2025, doi: 10.1109/JAS.2025.125495, https://ieeexplore.ieee.org/abstract/document/11005752
- Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei, 14 May 2025, Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, https://arxiv.org/abs/2505.09343
- Ted Zadouri, Hubert Strauss, Tri Dao, 27 May 2025, Hardware-Efficient Attention for Fast Decoding, https://arxiv.org/abs/2505.21487
- Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum, 22 May 2025, Zebra-Llama: Towards Extremely Efficient Hybrid Models, https://arxiv.org/abs/2505.17272 (Merging SSM and LLM.)
- Sebastian Raschka, Jul 19, 2025, The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
- Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/
- Keqi Deng, Philip C. Woodland, 21 May 2025 (v2), Multi-head Temporal Latent Attention, https://arxiv.org/abs/2505.13544
- Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng, 13 Aug 2025, Speed Always Wins: A Survey on Efficient Architectures for Large Language Models, https://arxiv.org/abs/2508.09834
- Luoyang Sun, Cheng Deng, Jiwen Jiang, Xinjian Wu, Haifeng Zhang, Lei Chen, Lionel Ni, Jun Wang, 23 Jul 2025, GTA: Grouped-head latenT Attention, https://arxiv.org/abs/2506.17286
- Sungmin Yun and Seonyong Park and Hwayong Nam and Younjoo Lee and Gunjun Lee and Kwanhee Kyung and Sangpyo Kim and Nam Sung Kim and Jongmin Kim and Hyungyo Kim and Juhwan Cho and Seungmin Baek and Jung Ho Ahn, 23 Jul 2025, The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts, https://arxiv.org/abs/2507.15465
- Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat, 2 Aug 2025, Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models, https://arxiv.org/abs/2508.01261
- Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang, 21 Aug 2025, TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference, https://arxiv.org/abs/2508.15881
- Hei Shing Cheung and Boya Zhang, 26 Jul 2025, Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion, https://arxiv.org/abs/2507.19991
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: