Aussie AI

Multi-head Latent Attention (MLA)

Last Updated 29 August, 2025

by David Spuler, Ph.D.

What is Multi-head Latent Attention (MLA)?

Multi-head Latent Attention (MLA) is an LLM attention optimization developed by DeepSeek. It became well-known with the release of DeepSeek R1 reasoning model in early 2025, but had actually been developed earlier for their V2/V3 non-reasoning models in mid-late 2024.

MLA improves upon the well-known LLM attention optimizations such as Multi-Head Attention (MHA) in the original Transformer paper, and the follow-on advancements of Multi-Query Attention (MQA) and and Group Query Attention (GQA). Subsequently, DeepSeek has also released as open-source the code for a combination of MLA and Flash Attention called "FlashMLA."

Research on MLA

Research papers on MLA include:

The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
Dr. Ashish Bamania, Feb 2025, Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek: A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good https://levelup.gitconnected.com/multi-head-latent-attention-is-the-powerful-engine-behind-deepseek-0ecfd29e0b04 (MLA versus GQA/MQA attention and how MLA achieves KV cache compression.)
Fanxu Meng, Zengwei Yao, Muhan Zhang, 13 Feb 2025 (v2), TransMLA: Multi-Head Latent Attention Is All You Need, https://arxiv.org/abs/2502.07864
Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui, 20 Feb 2025, Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, https://arxiv.org/abs/2502.14837
DeepSeek, Feb 2025 (accessed), FlashMLA: FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving, https://github.com/deepseek-ai/FlashMLA
Nickie Louise, February 24, 2025, DeepSeek launches FlashMLA: A breakthrough in AI speed and efficiency for NVIDIA GPUs, https://techstartups.com/2025/02/24/deepseek-launches-flashmla-a-breakthrough-in-ai-speed-and-efficiency-for-nvidia-gpus/
Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, March 4, 2025, ThunderMLA: FlashMLA, Faster and Fused-er! https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla https://github.com/HazyResearch/ThunderKittens/blob/mla/kernels/attn/demo/mla_decode/template_mla_decode.cu (Using a single CUDA "megakernel" to perform all jobs and passing it meta-instructions, thereby avoiding launching and shutting down kernels.)
Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum, 14 Mar 2025, X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression, https://arxiv.org/abs/2503.11132
Chengen Wang, Murat Kantarcioglu, 14 Mar 2025, A Review of DeepSeek Models' Key Innovative Techniques, https://arxiv.org/abs/2503.11486
L. Xiong et al., May 2025, DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models, IEEE/CAA Journal of Automatica Sinica, vol. 12, no. 5, pp. 841-858, May 2025, doi: 10.1109/JAS.2025.125495, https://ieeexplore.ieee.org/abstract/document/11005752
Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei, 14 May 2025, Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, https://arxiv.org/abs/2505.09343
Ted Zadouri, Hubert Strauss, Tri Dao, 27 May 2025, Hardware-Efficient Attention for Fast Decoding, https://arxiv.org/abs/2505.21487
Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum, 22 May 2025, Zebra-Llama: Towards Extremely Efficient Hybrid Models, https://arxiv.org/abs/2505.17272 (Merging SSM and LLM.)
Sebastian Raschka, Jul 19, 2025, The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/
Keqi Deng, Philip C. Woodland, 21 May 2025 (v2), Multi-head Temporal Latent Attention, https://arxiv.org/abs/2505.13544
Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng, 13 Aug 2025, Speed Always Wins: A Survey on Efficient Architectures for Large Language Models, https://arxiv.org/abs/2508.09834
Luoyang Sun, Cheng Deng, Jiwen Jiang, Xinjian Wu, Haifeng Zhang, Lei Chen, Lionel Ni, Jun Wang, 23 Jul 2025, GTA: Grouped-head latenT Attention, https://arxiv.org/abs/2506.17286
Sungmin Yun and Seonyong Park and Hwayong Nam and Younjoo Lee and Gunjun Lee and Kwanhee Kyung and Sangpyo Kim and Nam Sung Kim and Jongmin Kim and Hyungyo Kim and Juhwan Cho and Seungmin Baek and Jung Ho Ahn, 23 Jul 2025, The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts, https://arxiv.org/abs/2507.15465
Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat, 2 Aug 2025, Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models, https://arxiv.org/abs/2508.01261
Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang, 21 Aug 2025, TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference, https://arxiv.org/abs/2508.15881
Hei Shing Cheung and Boya Zhang, 26 Jul 2025, Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion, https://arxiv.org/abs/2507.19991