Aussie AI

Gradient Optimizer Research

Last Updated 22 October, 2025

by David Spuler, Ph.D.

ADAM

Diederik P. Kingma, Jimmy Ba, 30 Jan 2017 (v9), Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980
Soham De, Anirbit Mukherjee, Enayat Ullah, 20 Nov 2018 (v3), Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration https://arxiv.org/abs/1807.06766
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu, 25 Jun 2019 (v3), A Sufficient Condition for Convergences of Adam and RMSProp, https://arxiv.org/abs/1811.09358
Qi Zhang, Yi Zhou, Shaofeng Zou, 3 Apr 2024 (v2), Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance, https://arxiv.org/abs/2404.01436
Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier, 17 Oct 2022 (v3), A Simple Convergence Proof of Adam and Adagrad, https://arxiv.org/abs/2003.02395
Kushal Chakrabarti, Nikhil Chopra, 30 Sep 2021 (v2), Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective, https://arxiv.org/abs/2106.00092
Sebastian Bock, Josef Goppold, Martin Weiß, 27 Apr 2018, An improvement of the convergence proof of the ADAM-Optimizer, https://arxiv.org/abs/1804.10587
Jiawei Zhang, Fisher B. Gouza, 10 Mar 2019 (v2), GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization, https://arxiv.org/abs/1805.07500
Xiangyi Chen, Sijia Liu, Ruoyu Sun, Mingyi Hong, 10 Mar 2019 (v2), On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization, https://arxiv.org/abs/1808.02941
Remi Genet, Hugo Inzirillo, 31 Oct 2024, CaAdam: Improving Adam optimizer using connection aware methods, https://arxiv.org/abs/2410.24216
R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Jian Li, 29 Nov 2024, CAdam: Confidence-Based Optimization for Online Learning, https://arxiv.org/abs/2411.19647
Kwangryeol Park, Seulki Lee, 12 Dec 2024, SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization, https://arxiv.org/abs/2412.08894 (Gradient optimizer Adam optimized using low-rank matrix factorization.)
Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen, 17 Dec 2024 (v2), No More Adam: Learning Rate Scaling at Initialization is All You Need, https://arxiv.org/abs/2412.11768
Wenhan Jiang, Jinlan Liu, Naimin Zhang, Dongpo Xu, DMAdam: Dual averaging enhanced adaptive gradient method for deep neural networks, Knowledge-Based Systems, 2024, 112886, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2024.112886 https://www.sciencedirect.com/science/article/abs/pii/S095070512401520X
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu, 27 Dec 2024, Towards Simple and Provable Parameter-Free Adaptive Gradient Methods, https://arxiv.org/abs/2412.19444?
Y. Li et al., 2025, "Q-DADAM: A Quantized Distributed Online Optimization Algorithm With Adaptive Momentum," in IEEE Transactions on Control of Network Systems, doi: 10.1109/TCNS.2025.3526555. https://ieeexplore.ieee.org/abstract/document/10830565
Michael Nuñez, July 11, 2025, Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free, https://venturebeat.com/ai/moonshot-ais-kimi-k2-outperforms-gpt-4-in-key-benchmarks-and-its-free/ (One trillion parameters with 32B experts activated each time. Examines new training optimizer MuonClip as more efficient and more stable than variants of AdamW for training.)

RMSprop

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky, 2012, Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude, https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Mahesh Chandra Mukkamala, Matthias Hein, 28 Nov 2017 (v2), Variants of RMSProp and Adagrad with Logarithmic Regret Bounds, https://arxiv.org/abs/1706.05507
Thomas Kurbiel, Shahrzad Khaleghian, 6 Aug 2017, Training of Deep Neural Networks based on Distance Measures using RMSProp, https://arxiv.org/abs/1708.01911
Mohammad Emtiyaz Khan, Zuozhu Liu, Voot Tangkaratt, Yarin Gal, 4 Dec 2017, Vprop: Variational Inference using RMSprop, https://arxiv.org/abs/1712.01038
Soham De, Anirbit Mukherjee, Enayat Ullah, 20 Nov 2018 (v3), Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration https://arxiv.org/abs/1807.06766
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu, 25 Jun 2019 (v3), A Sufficient Condition for Convergences of Adam and RMSProp, https://arxiv.org/abs/1811.09358
Huan Li, Zhouchen Lin 15 Apr 2024 (v3), On the O(d√T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ℓ1 Norm, https://arxiv.org/abs/2402.00389
Qi Zhang, Yi Zhou, Shaofeng Zou, 3 Apr 2024 (v2), Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance, https://arxiv.org/abs/2404.01436
Bilel Bensaid (CEA-CESTA, IMB), Gaël Poëtte (CEA-CESTA), Rodolphe Turpault (IMB), 22 Jul 2024, Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key, https://arxiv.org/abs/2407.15471
Patrick McNamee, Zahra Nili Ahmadabadi, 18 Sep 2024, Adaptive Extremum Seeking Control via the RMSprop Optimizer, https://arxiv.org/abs/2409.12290
R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296

AdaDelta

Matthew D. Zeiler, 22 Dec 2012, ADADELTA: An Adaptive Learning Rate Method, https://arxiv.org/abs/1212.5701
R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296

AdaGrad

Mahesh Chandra Mukkamala, Matthias Hein, 28 Nov 2017 (v2), Variants of RMSProp and Adagrad with Logarithmic Regret Bounds, https://arxiv.org/abs/1706.05507
Rachel Ward, Xiaoxia Wu, Leon Bottou, 19 Apr 2021 (v8), AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, https://arxiv.org/abs/1806.01811
Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu, 15 May 2023 (v4), A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration, https://arxiv.org/abs/1808.03408
Qian Qian, Xiaoyuan Qian, 9 Jun 2019, The Implicit Bias of AdaGrad on Separable Data, https://arxiv.org/abs/1906.03559
Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier, 17 Oct 2022 (v3), A Simple Convergence Proof of Adam and Adagrad, https://arxiv.org/abs/2003.02395
Peter Kairouz, Mónica Ribero, Keith Rush, Abhradeep Thakurta, 30 Jan 2021 (v2), Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces, https://arxiv.org/abs/2008.06570
Cheik Traoré, Edouard Pauwels, 13 Apr 2021 (v3), Sequential convergence of AdaGrad algorithm for smooth convex optimization, https://arxiv.org/abs/2011.12341
Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Simon Lacoste-Julien, 3 Nov 2021 (v2), SVRG Meets AdaGrad: Painless Variance Reduction, https://arxiv.org/abs/2102.09645
Kushal Chakrabarti, Nikhil Chopra, 30 Sep 2021 (v2), Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective, https://arxiv.org/abs/2106.00092
Luofeng Liao, Li Shen, Jia Duan, Mladen Kolar, Dacheng Tao, 23 Sep 2022 (v2), Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Optimization, https://arxiv.org/abs/2106.10022
Ruinan Jin, Yu Xing, Xingkang He, 26 Jan 2022, On the Convergence of mSGD and AdaGrad for Stochastic Optimization, https://arxiv.org/abs/2201.11204
Ali Kavis, Kfir Yehuda Levy, Volkan Cevher, 6 Apr 2022, High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize, https://arxiv.org/abs/2204.02833
Zijian Liu, Ta Duy Nguyen, Alina Ene, Huy L. Nguyen, 4 Oct 2023 (v4), On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration, https://arxiv.org/abs/2209.14827
Amit Attia, Tomer Koren, 11 Jun 2023 (v2), SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance, https://arxiv.org/abs/2302.08783
R. Selvaraj, T. Satheesh, V. Suresh, V. Yathavaraj, 30 Apr 2023, Optimized Machine Learning for CHD Detection using 3D CNN-based Segmentation, Transfer Learning and Adagrad Optimization, https://arxiv.org/abs/2305.00411
Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen, 28 Sep 2023 (v2), Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions, https://arxiv.org/abs/2305.18471
Yusu Hong, Junhong Lin, 13 Sep 2024 (v2), Revisiting Convergence of AdaGrad with Relaxed Assumptions, https://arxiv.org/abs/2402.13794
Sayantan Choudhury, Nazarii Tupitsa, Nicolas Loizou, Samuel Horvath, Martin Takac, Eduard Gorbunov, 5 Jun 2024 (v2), Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad, https://arxiv.org/abs/2403.02648
Antoine Godichon-Baggioni (LPSM (UMR\_8001)), Wei Lu (LMI), Bruno Portier (LMI), 3 May 2024, A Full Adagrad algorithm with O(Nd) operations, https://arxiv.org/abs/2405.01908
Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov, 6 Jun 2024, Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed, https://arxiv.org/abs/2406.04443
Anton Rodomanov, Xiaowen Jiang, Sebastian Stich, 10 Jun 2024, Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction, https://arxiv.org/abs/2406.06398
Yuxing Liu, Rui Pan, Tong Zhang, 14 Oct 2024 (v2), AdaGrad under Anisotropic Smoothness https://arxiv.org/abs/2406.15244
Serge Gratton, Sadok Jerad, Philippe L. Toint, 1 Nov 2024 (v3), Complexity of Adagrad and other first-order methods for nonconvex optimization problems with bounds constraints, https://arxiv.org/abs/2406.15793
Ruinan Jin, Xiaoyu Wang, Baoxiang Wang, 8 Sep 2024, Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis, https://arxiv.org/abs/2409.05023
R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu, 27 Dec 2024, Towards Simple and Provable Parameter-Free Adaptive Gradient Methods, https://arxiv.org/abs/2412.19444?
Minxin Zhang, Yuxuan Liu, Hayden Schaeffer, 3 Sep 2025, AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates, https://arxiv.org/abs/2509.02981

AMSGrad

Jun-Kun Wang, Xiaoyun Li, Belhal Karimi, Ping Li, 3 Nov 2020 (v3), An Optimistic Acceleration of AMSGrad for Nonconvex Optimization, https://arxiv.org/abs/1903.01435
Tran Thi Phuong, Le Trieu Phong, 31 Oct 2019 (v4), On the Convergence Proof of AMSGrad and a New Version, https://arxiv.org/abs/1904.03590
Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, 19 Apr 2019, On the Convergence of Adam and Beyond, https://arxiv.org/abs/1904.09237

Stochastic Gradient Descent (SGD)

Amit Attia, Tomer Koren, 11 Jun 2023 (v2), SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance, https://arxiv.org/abs/2302.08783
R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296

Research on Gradient Optimizers

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo, 5 Nov 2024, ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate, Neural Information Processing Systems (NeurIPS 2024), https://arxiv.org/abs/2411.02853 https://github.com/iShohei220/adopt
Diederik P. Kingma, Jimmy Ba, 30 Jan 2017 (v9), Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980
Jun-Kun Wang, Xiaoyun Li, Belhal Karimi, Ping Li, 3 Nov 2020 (v3), An Optimistic Acceleration of AMSGrad for Nonconvex Optimization, https://arxiv.org/abs/1903.01435
Tran Thi Phuong, Le Trieu Phong, 31 Oct 2019 (v4), On the Convergence Proof of AMSGrad and a New Version, https://arxiv.org/abs/1904.03590
Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, 19 Apr 2019, On the Convergence of Adam and Beyond, https://arxiv.org/abs/1904.09237
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky, 2012, Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude, https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Mahesh Chandra Mukkamala, Matthias Hein, 28 Nov 2017 (v2), Variants of RMSProp and Adagrad with Logarithmic Regret Bounds, https://arxiv.org/abs/1706.05507
Thomas Kurbiel, Shahrzad Khaleghian, 6 Aug 2017, Training of Deep Neural Networks based on Distance Measures using RMSProp, https://arxiv.org/abs/1708.01911
Mohammad Emtiyaz Khan, Zuozhu Liu, Voot Tangkaratt, Yarin Gal, 4 Dec 2017, Vprop: Variational Inference using RMSprop, https://arxiv.org/abs/1712.01038
Soham De, Anirbit Mukherjee, Enayat Ullah, 20 Nov 2018 (v3), Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration https://arxiv.org/abs/1807.06766
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu, 25 Jun 2019 (v3), A Sufficient Condition for Convergences of Adam and RMSProp, https://arxiv.org/abs/1811.09358
Huan Li, Zhouchen Lin 15 Apr 2024 (v3), On the O(d√T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ℓ1 Norm, https://arxiv.org/abs/2402.00389
Qi Zhang, Yi Zhou, Shaofeng Zou, 3 Apr 2024 (v2), Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance, https://arxiv.org/abs/2404.01436
Bilel Bensaid (CEA-CESTA, IMB), Gaël Poëtte (CEA-CESTA), Rodolphe Turpault (IMB), 22 Jul 2024, Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key, https://arxiv.org/abs/2407.15471
Patrick McNamee, Zahra Nili Ahmadabadi, 18 Sep 2024, Adaptive Extremum Seeking Control via the RMSprop Optimizer, https://arxiv.org/abs/2409.12290
Rachel Ward, Xiaoxia Wu, Leon Bottou, 19 Apr 2021 (v8), AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, https://arxiv.org/abs/1806.01811
Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu, 15 May 2023 (v4), A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration, https://arxiv.org/abs/1808.03408
Qian Qian, Xiaoyuan Qian, 9 Jun 2019, The Implicit Bias of AdaGrad on Separable Data, https://arxiv.org/abs/1906.03559
Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier, 17 Oct 2022 (v3), A Simple Convergence Proof of Adam and Adagrad, https://arxiv.org/abs/2003.02395
Peter Kairouz, Mónica Ribero, Keith Rush, Abhradeep Thakurta, 30 Jan 2021 (v2), Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces, https://arxiv.org/abs/2008.06570
Cheik Traoré, Edouard Pauwels, 13 Apr 2021 (v3), Sequential convergence of AdaGrad algorithm for smooth convex optimization, https://arxiv.org/abs/2011.12341
Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Simon Lacoste-Julien, 3 Nov 2021 (v2), SVRG Meets AdaGrad: Painless Variance Reduction, https://arxiv.org/abs/2102.09645
Kushal Chakrabarti, Nikhil Chopra, 30 Sep 2021 (v2), Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective, https://arxiv.org/abs/2106.00092
Luofeng Liao, Li Shen, Jia Duan, Mladen Kolar, Dacheng Tao, 23 Sep 2022 (v2), Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Optimization, https://arxiv.org/abs/2106.10022
Ruinan Jin, Yu Xing, Xingkang He, 26 Jan 2022, On the Convergence of mSGD and AdaGrad for Stochastic Optimization, https://arxiv.org/abs/2201.11204
Ali Kavis, Kfir Yehuda Levy, Volkan Cevher, 6 Apr 2022, High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize, https://arxiv.org/abs/2204.02833
Zijian Liu, Ta Duy Nguyen, Alina Ene, Huy L. Nguyen, 4 Oct 2023 (v4), On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration, https://arxiv.org/abs/2209.14827
Amit Attia, Tomer Koren, 11 Jun 2023 (v2), SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance, https://arxiv.org/abs/2302.08783
R. Selvaraj, T. Satheesh, V. Suresh, V. Yathavaraj, 30 Apr 2023, Optimized Machine Learning for CHD Detection using 3D CNN-based Segmentation, Transfer Learning and Adagrad Optimization, https://arxiv.org/abs/2305.00411
Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen, 28 Sep 2023 (v2), Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions, https://arxiv.org/abs/2305.18471
Yusu Hong, Junhong Lin, 13 Sep 2024 (v2), Revisiting Convergence of AdaGrad with Relaxed Assumptions, https://arxiv.org/abs/2402.13794
Sayantan Choudhury, Nazarii Tupitsa, Nicolas Loizou, Samuel Horvath, Martin Takac, Eduard Gorbunov, 5 Jun 2024 (v2), Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad, https://arxiv.org/abs/2403.02648
Antoine Godichon-Baggioni (LPSM (UMR\_8001)), Wei Lu (LMI), Bruno Portier (LMI), 3 May 2024, A Full Adagrad algorithm with O(Nd) operations, https://arxiv.org/abs/2405.01908
Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov, 6 Jun 2024, Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed, https://arxiv.org/abs/2406.04443
Anton Rodomanov, Xiaowen Jiang, Sebastian Stich, 10 Jun 2024, Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction, https://arxiv.org/abs/2406.06398
Yuxing Liu, Rui Pan, Tong Zhang, 14 Oct 2024 (v2), AdaGrad under Anisotropic Smoothness https://arxiv.org/abs/2406.15244
Serge Gratton, Sadok Jerad, Philippe L. Toint, 1 Nov 2024 (v3), Complexity of Adagrad and other first-order methods for nonconvex optimization problems with bounds constraints, https://arxiv.org/abs/2406.15793
Ruinan Jin, Xiaoyu Wang, Baoxiang Wang, 8 Sep 2024, Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis, https://arxiv.org/abs/2409.05023
Matthew D. Zeiler, 22 Dec 2012, ADADELTA: An Adaptive Learning Rate Method, https://arxiv.org/abs/1212.5701
Sebastian Bock, Josef Goppold, Martin Weiß, 27 Apr 2018, An improvement of the convergence proof of the ADAM-Optimizer, https://arxiv.org/abs/1804.10587
Jiawei Zhang, Fisher B. Gouza, 10 Mar 2019 (v2), GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization, https://arxiv.org/abs/1805.07500
Jiawei Zhang, 11 Mar 2019, Gradient Descent based Optimization Algorithms for Deep Learning Models Training, https://arxiv.org/abs/1903.03614
Xiangyi Chen, Sijia Liu, Ruoyu Sun, Mingyi Hong, 10 Mar 2019 (v2), On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization, https://arxiv.org/abs/1808.02941
Remi Genet, Hugo Inzirillo, 31 Oct 2024, CaAdam: Improving Adam optimizer using connection aware methods, https://arxiv.org/abs/2410.24216
R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Jian Li, 29 Nov 2024, CAdam: Confidence-Based Optimization for Online Learning, https://arxiv.org/abs/2411.19647
Abulikemu Abuduweili, Changliu Liu, 3 Dec 2024, Revisiting the Initial Steps in Adaptive Gradient Descent Optimization, https://arxiv.org/abs/2412.02153
Kwangryeol Park, Seulki Lee, 12 Dec 2024, SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization, https://arxiv.org/abs/2412.08894 (Gradient optimizer Adam optimized using low-rank matrix factorization.)
Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen, 17 Dec 2024 (v2), No More Adam: Learning Rate Scaling at Initialization is All You Need, https://arxiv.org/abs/2412.11768
Wenhan Jiang, Jinlan Liu, Naimin Zhang, Dongpo Xu, DMAdam: Dual averaging enhanced adaptive gradient method for deep neural networks, Knowledge-Based Systems, 2024, 112886, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2024.112886 https://www.sciencedirect.com/science/article/abs/pii/S095070512401520X
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
O. F. Razzouki, A. Charroud, Z. E. Allali, A. Chetouani and N. Aslimani, "A Survey of Advanced Gradient Methods in Machine Learning," 2024 7th International Conference on Advanced Communication Technologies and Networking (CommNet), Rabat, Morocco, 2024, pp. 1-7, doi: 10.1109/CommNet63022.2024.10793249. https://ieeexplore.ieee.org/abstract/document/10793249
Shubhankar Bhakta, Utpal Nandi, Chiranjit Changdar, Bachchu Paul, Tapas Si, Rajat Kumar Pal, aMacP: An adaptive optimization algorithm for Deep Neural Network, Neurocomputing, Volume 620, 2025, 129242, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2024.129242 https://www.sciencedirect.com/science/article/abs/pii/S0925231224020137
Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu, 27 Dec 2024, Towards Simple and Provable Parameter-Free Adaptive Gradient Methods, https://arxiv.org/abs/2412.19444?
Y. Li et al., 2025, "Q-DADAM: A Quantized Distributed Online Optimization Algorithm With Adaptive Momentum," in IEEE Transactions on Control of Network Systems, doi: 10.1109/TCNS.2025.3526555. https://ieeexplore.ieee.org/abstract/document/10830565
Jing Wang, Anna Choromanska, 24 Jan 2025, A Survey of Optimization Methods for Training DL Models: Theoretical Perspective on Convergence and Generalization, https://arxiv.org/abs/2501.14458
Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li, 13 Mar 2025, Structured Preconditioners in Adaptive Optimization: A Unified Analysis, https://arxiv.org/abs/2503.10537
Michael Nuñez, July 11, 2025, Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free, https://venturebeat.com/ai/moonshot-ais-kimi-k2-outperforms-gpt-4-in-key-benchmarks-and-its-free/ (One trillion parameters with 32B experts activated each time. Examines new training optimizer MuonClip as more efficient and more stable than variants of AdamW for training.)
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang, 24 Feb 2025, Muon is Scalable for LLM Training, https://arxiv.org/abs/2502.16982
Tim Tsz-Kit Lau, Qi Long, Weijie Su, 2 Aug 2025, PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective, https://arxiv.org/abs/2505.21799
Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko, 14 Aug 2025, Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping, https://arxiv.org/abs/2310.00098
Minxin Zhang, Yuxuan Liu, Hayden Schaeffer, 3 Sep 2025, AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates, https://arxiv.org/abs/2509.02981
Xinru Mu, Omar M. Saad, Shaowen Wang, and Tariq Alkhalifah, 18 Sep 2025, Inspired by machine learning optimization: can gradient-based optimizers solve cycle skipping in full waveform inversion given sufficient iterations?, https://arxiv.org/abs/2509.14919