Aussie AI
Gradient Optimizer Research
-
Last Updated 29 August, 2025
-
by David Spuler, Ph.D.
ADAM
- Diederik P. Kingma, Jimmy Ba, 30 Jan 2017 (v9), Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980
- Soham De, Anirbit Mukherjee, Enayat Ullah, 20 Nov 2018 (v3), Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration https://arxiv.org/abs/1807.06766
- Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu, 25 Jun 2019 (v3), A Sufficient Condition for Convergences of Adam and RMSProp, https://arxiv.org/abs/1811.09358
- Qi Zhang, Yi Zhou, Shaofeng Zou, 3 Apr 2024 (v2), Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance, https://arxiv.org/abs/2404.01436
- Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier, 17 Oct 2022 (v3), A Simple Convergence Proof of Adam and Adagrad, https://arxiv.org/abs/2003.02395
- Kushal Chakrabarti, Nikhil Chopra, 30 Sep 2021 (v2), Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective, https://arxiv.org/abs/2106.00092
- Sebastian Bock, Josef Goppold, Martin Weiß, 27 Apr 2018, An improvement of the convergence proof of the ADAM-Optimizer, https://arxiv.org/abs/1804.10587
- Jiawei Zhang, Fisher B. Gouza, 10 Mar 2019 (v2), GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization, https://arxiv.org/abs/1805.07500
- Xiangyi Chen, Sijia Liu, Ruoyu Sun, Mingyi Hong, 10 Mar 2019 (v2), On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization, https://arxiv.org/abs/1808.02941
- Remi Genet, Hugo Inzirillo, 31 Oct 2024, CaAdam: Improving Adam optimizer using connection aware methods, https://arxiv.org/abs/2410.24216
- R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
- Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
- Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Jian Li, 29 Nov 2024, CAdam: Confidence-Based Optimization for Online Learning, https://arxiv.org/abs/2411.19647
- Kwangryeol Park, Seulki Lee, 12 Dec 2024, SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization, https://arxiv.org/abs/2412.08894 (Gradient optimizer Adam optimized using low-rank matrix factorization.)
- Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen, 17 Dec 2024 (v2), No More Adam: Learning Rate Scaling at Initialization is All You Need, https://arxiv.org/abs/2412.11768
- Wenhan Jiang, Jinlan Liu, Naimin Zhang, Dongpo Xu, DMAdam: Dual averaging enhanced adaptive gradient method for deep neural networks, Knowledge-Based Systems, 2024, 112886, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2024.112886 https://www.sciencedirect.com/science/article/abs/pii/S095070512401520X
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu, 27 Dec 2024, Towards Simple and Provable Parameter-Free Adaptive Gradient Methods, https://arxiv.org/abs/2412.19444?
- Y. Li et al., 2025, "Q-DADAM: A Quantized Distributed Online Optimization Algorithm With Adaptive Momentum," in IEEE Transactions on Control of Network Systems, doi: 10.1109/TCNS.2025.3526555. https://ieeexplore.ieee.org/abstract/document/10830565
- Michael Nuñez, July 11, 2025, Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free, https://venturebeat.com/ai/moonshot-ais-kimi-k2-outperforms-gpt-4-in-key-benchmarks-and-its-free/ (One trillion parameters with 32B experts activated each time. Examines new training optimizer MuonClip as more efficient and more stable than variants of AdamW for training.)
RMSprop
- Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky, 2012, Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude, https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- Mahesh Chandra Mukkamala, Matthias Hein, 28 Nov 2017 (v2), Variants of RMSProp and Adagrad with Logarithmic Regret Bounds, https://arxiv.org/abs/1706.05507
- Thomas Kurbiel, Shahrzad Khaleghian, 6 Aug 2017, Training of Deep Neural Networks based on Distance Measures using RMSProp, https://arxiv.org/abs/1708.01911
- Mohammad Emtiyaz Khan, Zuozhu Liu, Voot Tangkaratt, Yarin Gal, 4 Dec 2017, Vprop: Variational Inference using RMSprop, https://arxiv.org/abs/1712.01038
- Soham De, Anirbit Mukherjee, Enayat Ullah, 20 Nov 2018 (v3), Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration https://arxiv.org/abs/1807.06766
- Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu, 25 Jun 2019 (v3), A Sufficient Condition for Convergences of Adam and RMSProp, https://arxiv.org/abs/1811.09358
- Huan Li, Zhouchen Lin 15 Apr 2024 (v3), On the O(d√T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ℓ1 Norm, https://arxiv.org/abs/2402.00389
- Qi Zhang, Yi Zhou, Shaofeng Zou, 3 Apr 2024 (v2), Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance, https://arxiv.org/abs/2404.01436
- Bilel Bensaid (CEA-CESTA, IMB), Gaël Poëtte (CEA-CESTA), Rodolphe Turpault (IMB), 22 Jul 2024, Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key, https://arxiv.org/abs/2407.15471
- Patrick McNamee, Zahra Nili Ahmadabadi, 18 Sep 2024, Adaptive Extremum Seeking Control via the RMSprop Optimizer, https://arxiv.org/abs/2409.12290
- R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
AdaDelta
- Matthew D. Zeiler, 22 Dec 2012, ADADELTA: An Adaptive Learning Rate Method, https://arxiv.org/abs/1212.5701
- R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
AdaGrad
- Mahesh Chandra Mukkamala, Matthias Hein, 28 Nov 2017 (v2), Variants of RMSProp and Adagrad with Logarithmic Regret Bounds, https://arxiv.org/abs/1706.05507
- Rachel Ward, Xiaoxia Wu, Leon Bottou, 19 Apr 2021 (v8), AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, https://arxiv.org/abs/1806.01811
- Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu, 15 May 2023 (v4), A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration, https://arxiv.org/abs/1808.03408
- Qian Qian, Xiaoyuan Qian, 9 Jun 2019, The Implicit Bias of AdaGrad on Separable Data, https://arxiv.org/abs/1906.03559
- Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier, 17 Oct 2022 (v3), A Simple Convergence Proof of Adam and Adagrad, https://arxiv.org/abs/2003.02395
- Peter Kairouz, Mónica Ribero, Keith Rush, Abhradeep Thakurta, 30 Jan 2021 (v2), Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces, https://arxiv.org/abs/2008.06570
- Cheik Traoré, Edouard Pauwels, 13 Apr 2021 (v3), Sequential convergence of AdaGrad algorithm for smooth convex optimization, https://arxiv.org/abs/2011.12341
- Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Simon Lacoste-Julien, 3 Nov 2021 (v2), SVRG Meets AdaGrad: Painless Variance Reduction, https://arxiv.org/abs/2102.09645
- Kushal Chakrabarti, Nikhil Chopra, 30 Sep 2021 (v2), Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective, https://arxiv.org/abs/2106.00092
- Luofeng Liao, Li Shen, Jia Duan, Mladen Kolar, Dacheng Tao, 23 Sep 2022 (v2), Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Optimization, https://arxiv.org/abs/2106.10022
- Ruinan Jin, Yu Xing, Xingkang He, 26 Jan 2022, On the Convergence of mSGD and AdaGrad for Stochastic Optimization, https://arxiv.org/abs/2201.11204
- Ali Kavis, Kfir Yehuda Levy, Volkan Cevher, 6 Apr 2022, High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize, https://arxiv.org/abs/2204.02833
- Zijian Liu, Ta Duy Nguyen, Alina Ene, Huy L. Nguyen, 4 Oct 2023 (v4), On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration, https://arxiv.org/abs/2209.14827
- Amit Attia, Tomer Koren, 11 Jun 2023 (v2), SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance, https://arxiv.org/abs/2302.08783
- R. Selvaraj, T. Satheesh, V. Suresh, V. Yathavaraj, 30 Apr 2023, Optimized Machine Learning for CHD Detection using 3D CNN-based Segmentation, Transfer Learning and Adagrad Optimization, https://arxiv.org/abs/2305.00411
- Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen, 28 Sep 2023 (v2), Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions, https://arxiv.org/abs/2305.18471
- Yusu Hong, Junhong Lin, 13 Sep 2024 (v2), Revisiting Convergence of AdaGrad with Relaxed Assumptions, https://arxiv.org/abs/2402.13794
- Sayantan Choudhury, Nazarii Tupitsa, Nicolas Loizou, Samuel Horvath, Martin Takac, Eduard Gorbunov, 5 Jun 2024 (v2), Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad, https://arxiv.org/abs/2403.02648
- Antoine Godichon-Baggioni (LPSM (UMR\_8001)), Wei Lu (LMI), Bruno Portier (LMI), 3 May 2024, A Full Adagrad algorithm with O(Nd) operations, https://arxiv.org/abs/2405.01908
- Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov, 6 Jun 2024, Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed, https://arxiv.org/abs/2406.04443
- Anton Rodomanov, Xiaowen Jiang, Sebastian Stich, 10 Jun 2024, Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction, https://arxiv.org/abs/2406.06398
- Yuxing Liu, Rui Pan, Tong Zhang, 14 Oct 2024 (v2), AdaGrad under Anisotropic Smoothness https://arxiv.org/abs/2406.15244
- Serge Gratton, Sadok Jerad, Philippe L. Toint, 1 Nov 2024 (v3), Complexity of Adagrad and other first-order methods for nonconvex optimization problems with bounds constraints, https://arxiv.org/abs/2406.15793
- Ruinan Jin, Xiaoyu Wang, Baoxiang Wang, 8 Sep 2024, Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis, https://arxiv.org/abs/2409.05023
- R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu, 27 Dec 2024, Towards Simple and Provable Parameter-Free Adaptive Gradient Methods, https://arxiv.org/abs/2412.19444?
AMSGrad
- Jun-Kun Wang, Xiaoyun Li, Belhal Karimi, Ping Li, 3 Nov 2020 (v3), An Optimistic Acceleration of AMSGrad for Nonconvex Optimization, https://arxiv.org/abs/1903.01435
- Tran Thi Phuong, Le Trieu Phong, 31 Oct 2019 (v4), On the Convergence Proof of AMSGrad and a New Version, https://arxiv.org/abs/1904.03590
- Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, 19 Apr 2019, On the Convergence of Adam and Beyond, https://arxiv.org/abs/1904.09237
Stochastic Gradient Descent (SGD)
- Amit Attia, Tomer Koren, 11 Jun 2023 (v2), SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance, https://arxiv.org/abs/2302.08783
- R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Research on Gradient Optimizers
- Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo, 5 Nov 2024, ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate, Neural Information Processing Systems (NeurIPS 2024), https://arxiv.org/abs/2411.02853 https://github.com/iShohei220/adopt
- Diederik P. Kingma, Jimmy Ba, 30 Jan 2017 (v9), Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980
- Jun-Kun Wang, Xiaoyun Li, Belhal Karimi, Ping Li, 3 Nov 2020 (v3), An Optimistic Acceleration of AMSGrad for Nonconvex Optimization, https://arxiv.org/abs/1903.01435
- Tran Thi Phuong, Le Trieu Phong, 31 Oct 2019 (v4), On the Convergence Proof of AMSGrad and a New Version, https://arxiv.org/abs/1904.03590
- Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, 19 Apr 2019, On the Convergence of Adam and Beyond, https://arxiv.org/abs/1904.09237
- Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky, 2012, Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude, https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- Mahesh Chandra Mukkamala, Matthias Hein, 28 Nov 2017 (v2), Variants of RMSProp and Adagrad with Logarithmic Regret Bounds, https://arxiv.org/abs/1706.05507
- Thomas Kurbiel, Shahrzad Khaleghian, 6 Aug 2017, Training of Deep Neural Networks based on Distance Measures using RMSProp, https://arxiv.org/abs/1708.01911
- Mohammad Emtiyaz Khan, Zuozhu Liu, Voot Tangkaratt, Yarin Gal, 4 Dec 2017, Vprop: Variational Inference using RMSprop, https://arxiv.org/abs/1712.01038
- Soham De, Anirbit Mukherjee, Enayat Ullah, 20 Nov 2018 (v3), Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration https://arxiv.org/abs/1807.06766
- Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu, 25 Jun 2019 (v3), A Sufficient Condition for Convergences of Adam and RMSProp, https://arxiv.org/abs/1811.09358
- Huan Li, Zhouchen Lin 15 Apr 2024 (v3), On the O(d√T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ℓ1 Norm, https://arxiv.org/abs/2402.00389
- Qi Zhang, Yi Zhou, Shaofeng Zou, 3 Apr 2024 (v2), Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance, https://arxiv.org/abs/2404.01436
- Bilel Bensaid (CEA-CESTA, IMB), Gaël Poëtte (CEA-CESTA), Rodolphe Turpault (IMB), 22 Jul 2024, Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key, https://arxiv.org/abs/2407.15471
- Patrick McNamee, Zahra Nili Ahmadabadi, 18 Sep 2024, Adaptive Extremum Seeking Control via the RMSprop Optimizer, https://arxiv.org/abs/2409.12290
- Rachel Ward, Xiaoxia Wu, Leon Bottou, 19 Apr 2021 (v8), AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, https://arxiv.org/abs/1806.01811
- Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu, 15 May 2023 (v4), A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration, https://arxiv.org/abs/1808.03408
- Qian Qian, Xiaoyuan Qian, 9 Jun 2019, The Implicit Bias of AdaGrad on Separable Data, https://arxiv.org/abs/1906.03559
- Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier, 17 Oct 2022 (v3), A Simple Convergence Proof of Adam and Adagrad, https://arxiv.org/abs/2003.02395
- Peter Kairouz, Mónica Ribero, Keith Rush, Abhradeep Thakurta, 30 Jan 2021 (v2), Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces, https://arxiv.org/abs/2008.06570
- Cheik Traoré, Edouard Pauwels, 13 Apr 2021 (v3), Sequential convergence of AdaGrad algorithm for smooth convex optimization, https://arxiv.org/abs/2011.12341
- Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Simon Lacoste-Julien, 3 Nov 2021 (v2), SVRG Meets AdaGrad: Painless Variance Reduction, https://arxiv.org/abs/2102.09645
- Kushal Chakrabarti, Nikhil Chopra, 30 Sep 2021 (v2), Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective, https://arxiv.org/abs/2106.00092
- Luofeng Liao, Li Shen, Jia Duan, Mladen Kolar, Dacheng Tao, 23 Sep 2022 (v2), Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Optimization, https://arxiv.org/abs/2106.10022
- Ruinan Jin, Yu Xing, Xingkang He, 26 Jan 2022, On the Convergence of mSGD and AdaGrad for Stochastic Optimization, https://arxiv.org/abs/2201.11204
- Ali Kavis, Kfir Yehuda Levy, Volkan Cevher, 6 Apr 2022, High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize, https://arxiv.org/abs/2204.02833
- Zijian Liu, Ta Duy Nguyen, Alina Ene, Huy L. Nguyen, 4 Oct 2023 (v4), On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration, https://arxiv.org/abs/2209.14827
- Amit Attia, Tomer Koren, 11 Jun 2023 (v2), SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance, https://arxiv.org/abs/2302.08783
- R. Selvaraj, T. Satheesh, V. Suresh, V. Yathavaraj, 30 Apr 2023, Optimized Machine Learning for CHD Detection using 3D CNN-based Segmentation, Transfer Learning and Adagrad Optimization, https://arxiv.org/abs/2305.00411
- Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen, 28 Sep 2023 (v2), Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions, https://arxiv.org/abs/2305.18471
- Yusu Hong, Junhong Lin, 13 Sep 2024 (v2), Revisiting Convergence of AdaGrad with Relaxed Assumptions, https://arxiv.org/abs/2402.13794
- Sayantan Choudhury, Nazarii Tupitsa, Nicolas Loizou, Samuel Horvath, Martin Takac, Eduard Gorbunov, 5 Jun 2024 (v2), Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad, https://arxiv.org/abs/2403.02648
- Antoine Godichon-Baggioni (LPSM (UMR\_8001)), Wei Lu (LMI), Bruno Portier (LMI), 3 May 2024, A Full Adagrad algorithm with O(Nd) operations, https://arxiv.org/abs/2405.01908
- Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov, 6 Jun 2024, Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed, https://arxiv.org/abs/2406.04443
- Anton Rodomanov, Xiaowen Jiang, Sebastian Stich, 10 Jun 2024, Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction, https://arxiv.org/abs/2406.06398
- Yuxing Liu, Rui Pan, Tong Zhang, 14 Oct 2024 (v2), AdaGrad under Anisotropic Smoothness https://arxiv.org/abs/2406.15244
- Serge Gratton, Sadok Jerad, Philippe L. Toint, 1 Nov 2024 (v3), Complexity of Adagrad and other first-order methods for nonconvex optimization problems with bounds constraints, https://arxiv.org/abs/2406.15793
- Ruinan Jin, Xiaoyu Wang, Baoxiang Wang, 8 Sep 2024, Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis, https://arxiv.org/abs/2409.05023
- Matthew D. Zeiler, 22 Dec 2012, ADADELTA: An Adaptive Learning Rate Method, https://arxiv.org/abs/1212.5701
- Sebastian Bock, Josef Goppold, Martin Weiß, 27 Apr 2018, An improvement of the convergence proof of the ADAM-Optimizer, https://arxiv.org/abs/1804.10587
- Jiawei Zhang, Fisher B. Gouza, 10 Mar 2019 (v2), GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization, https://arxiv.org/abs/1805.07500
- Jiawei Zhang, 11 Mar 2019, Gradient Descent based Optimization Algorithms for Deep Learning Models Training, https://arxiv.org/abs/1903.03614
- Xiangyi Chen, Sijia Liu, Ruoyu Sun, Mingyi Hong, 10 Mar 2019 (v2), On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization, https://arxiv.org/abs/1808.02941
- Remi Genet, Hugo Inzirillo, 31 Oct 2024, CaAdam: Improving Adam optimizer using connection aware methods, https://arxiv.org/abs/2410.24216
- R Abdulkadirov, P Lyakhov, N Nagornov - Mathematics, 2023, Survey of optimization algorithms in modern neural networks, https://doi.org/10.3390/math11112466 https://www.mdpi.com/2227-7390/11/11/2466
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
- Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
- Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Jian Li, 29 Nov 2024, CAdam: Confidence-Based Optimization for Online Learning, https://arxiv.org/abs/2411.19647
- Abulikemu Abuduweili, Changliu Liu, 3 Dec 2024, Revisiting the Initial Steps in Adaptive Gradient Descent Optimization, https://arxiv.org/abs/2412.02153
- Kwangryeol Park, Seulki Lee, 12 Dec 2024, SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization, https://arxiv.org/abs/2412.08894 (Gradient optimizer Adam optimized using low-rank matrix factorization.)
- Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen, 17 Dec 2024 (v2), No More Adam: Learning Rate Scaling at Initialization is All You Need, https://arxiv.org/abs/2412.11768
- Wenhan Jiang, Jinlan Liu, Naimin Zhang, Dongpo Xu, DMAdam: Dual averaging enhanced adaptive gradient method for deep neural networks, Knowledge-Based Systems, 2024, 112886, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2024.112886 https://www.sciencedirect.com/science/article/abs/pii/S095070512401520X
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- O. F. Razzouki, A. Charroud, Z. E. Allali, A. Chetouani and N. Aslimani, "A Survey of Advanced Gradient Methods in Machine Learning," 2024 7th International Conference on Advanced Communication Technologies and Networking (CommNet), Rabat, Morocco, 2024, pp. 1-7, doi: 10.1109/CommNet63022.2024.10793249. https://ieeexplore.ieee.org/abstract/document/10793249
- Shubhankar Bhakta, Utpal Nandi, Chiranjit Changdar, Bachchu Paul, Tapas Si, Rajat Kumar Pal, aMacP: An adaptive optimization algorithm for Deep Neural Network, Neurocomputing, Volume 620, 2025, 129242, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2024.129242 https://www.sciencedirect.com/science/article/abs/pii/S0925231224020137
- Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu, 27 Dec 2024, Towards Simple and Provable Parameter-Free Adaptive Gradient Methods, https://arxiv.org/abs/2412.19444?
- Y. Li et al., 2025, "Q-DADAM: A Quantized Distributed Online Optimization Algorithm With Adaptive Momentum," in IEEE Transactions on Control of Network Systems, doi: 10.1109/TCNS.2025.3526555. https://ieeexplore.ieee.org/abstract/document/10830565
- Jing Wang, Anna Choromanska, 24 Jan 2025, A Survey of Optimization Methods for Training DL Models: Theoretical Perspective on Convergence and Generalization, https://arxiv.org/abs/2501.14458
- Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li, 13 Mar 2025, Structured Preconditioners in Adaptive Optimization: A Unified Analysis, https://arxiv.org/abs/2503.10537
- Michael Nuñez, July 11, 2025, Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free, https://venturebeat.com/ai/moonshot-ais-kimi-k2-outperforms-gpt-4-in-key-benchmarks-and-its-free/ (One trillion parameters with 32B experts activated each time. Examines new training optimizer MuonClip as more efficient and more stable than variants of AdamW for training.)
- Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang, 24 Feb 2025, Muon is Scalable for LLM Training, https://arxiv.org/abs/2502.16982
- Tim Tsz-Kit Lau, Qi Long, Weijie Su, 2 Aug 2025, PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective, https://arxiv.org/abs/2505.21799
- Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko, 14 Aug 2025, Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping, https://arxiv.org/abs/2310.00098
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: