Aussie AI
Distributed Training
-
Last Updated 22 October, 2025
-
by David Spuler, Ph.D.
Research on Distributed Training
Research papers include:
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai, 22 Aug 2024, Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters, https://arxiv.org/abs/2408.12596
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
- Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
- Youshao Xiao, Lin Ju, Zhenglei Zhou, Siyuan Li, Zhaoxin Huan, Dalong Zhang, Rujie Jiang, Lin Wang, Xiaolu Zhang, Lei Liang, Jun Zhou, 15 Apr 2024, AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes, https://arxiv.org/abs/2404.09679
- Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang, 28 Jun 2024 (v2), Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training, https://arxiv.org/abs/2406.18820
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
- Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
- Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
- Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu, 17 Dec 2024, Echo: Simulating Distributed Training At Scale, https://arxiv.org/abs/2412.12487
- Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
- Xinyi Liu, Yujie Wang, Shenhan Zhu, Fangcheng Fu, Qingshuo Liu, Guangming Lin, Bin Cui, 30 Apr 2025, Galvatron: An Automatic Distributed System for Efficient Foundation Model Training, https://arxiv.org/abs/2504.21411 https://github.com/PKU-DAIR/Hetu-Galvatron
- Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf, Feb 19, 2025, The Ultra-Scale Playbook: Training LLMs on GPU Clusters, Hugging Face, https://huggingface.co/spaces/nanotron/ultrascale-playbook https://huggingface.co/spaces/nanotron/ultrascale-playbook/resolve/main/The_Ultra-Scale_Playbook_Training_LLMs_on_GPU_Clusters.pdf
- Zihao Song, Shirantha Welikala, Panos J. Antsaklis and Hai Lin, 22 Jul 2025, Graph Neural Network-Based Distributed Optimal Control for Linear Networked Systems: An Online Distributed Training Approach, https://arxiv.org/abs/2504.06439
- Seth Ockerman, Amal Gueroudji, Tanwi Mallick, Yixuan He, Line Pouchard, Robert Ross, Shivaram Venkataraman, 20 Jul 2025, PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training, https://arxiv.org/abs/2507.11683
- Tolga Dimlioglu, Anna Choromanska, 27 Jul 2025, Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning, https://arxiv.org/abs/2507.20424
- Samarth Gupta, Raghudeep Gadde, Rui Chen, Aleix M. Martinez, 20 Aug 2025, Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-states, https://arxiv.org/abs/2508.14413
- Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu, 4 Aug 2025, VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, https://arxiv.org/abs/2508.02317
- Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, Kai Chen, 4 Aug 2025, MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training, https://arxiv.org/abs/2501.03905
- Arefin Niam, Tevfik Kosar and M S Q Zulkar Nine, 5 Sep 2025, RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks, https://arxiv.org/abs/2509.05207
- Yunfei Teng, Sixin Zhang, 3 Sep 2025, LSAM: Asynchronous Distributed Training with Landscape-Smoothed Sharpness-Aware Minimization, https://arxiv.org/abs/2509.03110
- Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan, 12 Sep 2025, Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective, https://arxiv.org/abs/2509.10371
- Ying Cao, Kun Yuan, Ali H. Sayed, 14 Sep 2025, On the Escaping Efficiency of Distributed Adversarial Training Algorithms, https://arxiv.org/abs/2509.11337
- Yuwen Cao, Guijun Liu, Tomoaki Ohtsuki, Howard H. Yang, Tony Q. S. Quek, 31 Aug 2025, Distributed Gossip-GAN for Low-overhead CSI Feedback Training in FDD mMIMO-OFDM Systems, https://arxiv.org/abs/2509.10490
- Wenjiao Feng and Rongxing Xiao and Zonghang Li and Hongfang Yu and Gang Sun and Long Luo and Mohsen Guizani and Qirong Ho and Steve Liu, 13 Sep 2025, Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training, https://arxiv.org/abs/2505.12815
- Kai Yi, 10 Sep 2025, Strategies for Improving Communication Efficiency in Distributed and Federated Learning: Compression, Local Training, and Personalization, https://arxiv.org/abs/2509.08233
- Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richt\'arik, 10 Sep 2025, FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models, https://arxiv.org/abs/2403.09904
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home