Aussie AI

Distributed Training

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Research on Distributed Training

Research papers include:

Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai, 22 Aug 2024, Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters, https://arxiv.org/abs/2408.12596
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
Youshao Xiao, Lin Ju, Zhenglei Zhou, Siyuan Li, Zhaoxin Huan, Dalong Zhang, Rujie Jiang, Lin Wang, Xiaolu Zhang, Lei Liang, Jun Zhou, 15 Apr 2024, AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes, https://arxiv.org/abs/2404.09679
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang, 28 Jun 2024 (v2), Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training, https://arxiv.org/abs/2406.18820
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu, 17 Dec 2024, Echo: Simulating Distributed Training At Scale, https://arxiv.org/abs/2412.12487
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Xinyi Liu, Yujie Wang, Shenhan Zhu, Fangcheng Fu, Qingshuo Liu, Guangming Lin, Bin Cui, 30 Apr 2025, Galvatron: An Automatic Distributed System for Efficient Foundation Model Training, https://arxiv.org/abs/2504.21411 https://github.com/PKU-DAIR/Hetu-Galvatron
Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, Thomas Wolf, Feb 19, 2025, The Ultra-Scale Playbook: Training LLMs on GPU Clusters, Hugging Face, https://huggingface.co/spaces/nanotron/ultrascale-playbook https://huggingface.co/spaces/nanotron/ultrascale-playbook/resolve/main/The_Ultra-Scale_Playbook_Training_LLMs_on_GPU_Clusters.pdf
Zihao Song, Shirantha Welikala, Panos J. Antsaklis and Hai Lin, 22 Jul 2025, Graph Neural Network-Based Distributed Optimal Control for Linear Networked Systems: An Online Distributed Training Approach, https://arxiv.org/abs/2504.06439
Seth Ockerman, Amal Gueroudji, Tanwi Mallick, Yixuan He, Line Pouchard, Robert Ross, Shivaram Venkataraman, 20 Jul 2025, PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training, https://arxiv.org/abs/2507.11683
Tolga Dimlioglu, Anna Choromanska, 27 Jul 2025, Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning, https://arxiv.org/abs/2507.20424
Samarth Gupta, Raghudeep Gadde, Rui Chen, Aleix M. Martinez, 20 Aug 2025, Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-states, https://arxiv.org/abs/2508.14413
Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu, 4 Aug 2025, VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, https://arxiv.org/abs/2508.02317
Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, Kai Chen, 4 Aug 2025, MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training, https://arxiv.org/abs/2501.03905
Arefin Niam, Tevfik Kosar and M S Q Zulkar Nine, 5 Sep 2025, RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks, https://arxiv.org/abs/2509.05207
Yunfei Teng, Sixin Zhang, 3 Sep 2025, LSAM: Asynchronous Distributed Training with Landscape-Smoothed Sharpness-Aware Minimization, https://arxiv.org/abs/2509.03110
Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan, 12 Sep 2025, Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective, https://arxiv.org/abs/2509.10371
Ying Cao, Kun Yuan, Ali H. Sayed, 14 Sep 2025, On the Escaping Efficiency of Distributed Adversarial Training Algorithms, https://arxiv.org/abs/2509.11337
Yuwen Cao, Guijun Liu, Tomoaki Ohtsuki, Howard H. Yang, Tony Q. S. Quek, 31 Aug 2025, Distributed Gossip-GAN for Low-overhead CSI Feedback Training in FDD mMIMO-OFDM Systems, https://arxiv.org/abs/2509.10490
Wenjiao Feng and Rongxing Xiao and Zonghang Li and Hongfang Yu and Gang Sun and Long Luo and Mohsen Guizani and Qirong Ho and Steve Liu, 13 Sep 2025, Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training, https://arxiv.org/abs/2505.12815
Kai Yi, 10 Sep 2025, Strategies for Improving Communication Efficiency in Distributed and Federated Learning: Compression, Local Training, and Personalization, https://arxiv.org/abs/2509.08233
Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richt\'arik, 10 Sep 2025, FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models, https://arxiv.org/abs/2403.09904