Aussie AI
Ring Attention
-
Last Updated 31 May, 2026
-
by David Spuler, Ph.D.
What is Ring Attention?
Ring attention is an LLM optimization of the attention module using blockwise computations. The aim is to speed up the calculations of the self-attention step in either training or inference. Ring attention is a method that can be combined orthogonally with some of the other memory-efficient attention algorithms, such as with Flash attention.
Ring Attention: Book Excerpts and Blog Articles
Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:
- David Spuler, Ph.D., March 3rd, 2025, What's Hot in LLM Inference Optimization in 2025? Aussie AI Blog, https://www.aussieai.com/blog/hot-inference-optimization-2025
- David Spuler, May 31st, 2026, Chapter 41. Other Attention Kernels, in book LLM Inference Optimization: State-of-the-Art Research, Table of Contents: https://www.aussieai.com/book/llm-inference-optimization https://www.amazon.com/dp/B0H3FKR39T
Research on Ring Attention
Research papers on ring attention include:
- Hao Liu, Matei Zaharia, Pieter Abbeel, 27 Nov 2023 (v4), Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889 https://github.com/lhao499/llm_large_context (Original paper for ring attention.)
- William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
- Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
- Zongwu Wang, Fangxin Liu, Mingshuai Li, Li Jiang, 29 Dec 2024, TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication, https://arxiv.org/abs/2412.20501 https://github.com/ACA-Lab-SJTU/token-ring (Ring attention with inter-GPU network transmission optimizations.)
- Seongho Hong, Yong-Hoon Choi, 2 Jan 2025, RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer, https://arxiv.org/abs/2501.01182
- zhuzilin, Jan 2025, ring-flash-attention: Ring attention implementation with flash attention, https://github.com/zhuzilin/ring-flash-attention
- Kilian Haefeli, Simon Zirui Guo, Bonnie Li, 10 Apr 2024 Ring Attention Explained, https://coconut-mode.com/posts/ring-attention/
- Tanuj Sharma, Feb 23, 2024, Breaking the Boundaries: Understanding Context Window Limitations and the idea of Ring Attention, https://medium.com/@iamtanujsharma/breaking-the-boundaries-understanding-context-window-limitations-and-the-idea-of-ring-attention-170e522d44b2
- Nivas Jayaseelan, November 1, 2023, Understanding Ring Attention: Building Transformers With Near-Infinite Context, https://www.e2enetworks.com/blog/understanding-ring-attention-building-transformers-with-near-infinite-context
- Peter Chng, August 19, 2024, Ring Attention - scaling attention across multiple devices, https://peterchng.com/blog/2024/08/19/ring-attention-scaling-attention-across-multiple-devices/
- Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf, Feb 19, 2025, The Ultra-Scale Playbook: Training LLMs on GPU Clusters, Hugging Face, https://huggingface.co/spaces/nanotron/ultrascale-playbook https://huggingface.co/spaces/nanotron/ultrascale-playbook/resolve/main/The_Ultra-Scale_Playbook_Training_LLMs_on_GPU_Clusters.pdf
- Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/
- James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901
- Rajarshi Chowdhury, March 2026, I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks, https://assets-eu.researchsquare.com/files/rs-9036613/v1_covered_6447e1e8-b399-4716-af42-991e77ccd91b.pdf
- David Spuler, Ph.D., March 3rd, 2025, What's Hot in LLM Inference Optimization in 2025? Aussie AI Blog, https://www.aussieai.com/blog/hot-inference-optimization-2025
- David Spuler, May 31st, 2026, Chapter 41. Other Attention Kernels, in book LLM Inference Optimization: State-of-the-Art Research, Table of Contents: https://www.aussieai.com/book/llm-inference-optimization https://www.amazon.com/dp/B0H3FKR39T
More Attention Research Topics
Related LLM research areas for long context optimization of the attention methods include:
- Attention optimization (main page)
- Local attention
- Linear attention
- Sparse attention
- Multi-Head Attention (MHA)
- Muti-Query Attention (MQA)
- Group-Query Attention (GQA)
- Flash attention
- Paged attention
Other topics in attention research:
- Low-rank matrix attention
- Medusa attention
- Block attention
- Cross attention
- Fused head attention
- Hybrid local-global attention
- FFT attention
- QKV computation optimizations
- Additive attention
- Multiplicative attention
- Graph attention
- Chunked attention
- Attention sink
- Attention steering
- Bilinear attention
- Attention-free methods
- Mixture-of-Heads (MOH) Attention (MoE+MHA)
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: