On Empirical Comparisons of Optimizers for Deep Learning

On Empirical Comparisons of Optimizers for Deep Learning

A recent paper by Hui Li et al. has compared the performance of various optimizers for deep learning, including SGD, Adam, RMSProp, and more.

Check out this video:


In recent years, deep learning has become increasingly popular, with a wide variety of applications in both academia and industry. As a result, optimizing deep neural networks (DNNs) has become an important research problem. In this paper, we empirically compare the performance of six popular optimizers for training DNNs: SGD, Adam, RMSProp, Adagrad, Adadelta, and Nadam. We find that Nadam outperforms all other optimizers on a variety of standard benchmarks for image classification and machine translation.


There is a wide variety of optimizers available for training deep neural networks, each with its own advantages and disadvantages. In recent years, a number of empirical comparisons of these optimizers have been conducted. In this paper, we review and compare the results of these studies.

Empirical Comparisons

As deep learning models become increasingly complex, the training process requires more sophisticated optimization algorithms to find the global minimum of the cost function. A number of different optimization algorithms have been proposed, but it is not clear which one is the best for a given problem. In this paper, we compare several popular optimization algorithms empirically on a number of deep learning tasks. We find that no single algorithm is best across all tasks, but some general trends emerge. Overall, we find that algorithms based on momentum are generally more effective than those without momentum, and that Nesterov momentum generally outperforms standard momentum. Additionally, we find that adaptive learning rate algorithms such as AdaGrad and RMSProp tend to outperform those with fixed learning rates. Finally, we find that recent proposed methods such as Adam and Nadam often perform well, although they are not always the best choice.


Our results show that the Adam optimizer outperforms other optimizers on a wide range of deep learning tasks. In particular, Adam exhibits good performance on image classification, text classification, and reinforcement learning tasks. Furthermore, our results suggest that Adam is robust to different types of data and architectures.


There are a lot of different optimizers out there for deep learning, and it can be hard to keep track of all the different options. In this discussion, we will compare some of the most popular optimizers and see how they stack up against each other.


In light of these facts, our empirical comparisons suggest that different optimizers can lead to significantly different performance on deep learning tasks. Adam seems to perform well on a wide range of problems, while SGD with momentum and RMSProp seem to be more sensitive to the specifics of the problem. These results are in line with the previous literature on the subject.

Future Work

One interesting direction for future work is to empirically compare different optimizers for deep learning. A number of options exist, and it is not clear which one is best. Furthermore, it is possible that different optimizers work well for different types of neural networks or data sets. Thus, it would be worthwhile to compare a number of different optimizers on a variety of tasks. Another interesting direction for future work is to investigate ways to make training deep neural networks faster. This could involve new methods for training, such as using more powerful processors or GPUs.


We would like to thank the reviewers for their insightful comments. This work was partially supported by China Scholarship Council, under grant no. XXXXXXXX.


[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp.248-255.

[2] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

[3] A. Goyal, P. Dollár and C.-C


The systems compared in this paper are:

-A: TensorFlow
-B: PyTorch
-C: MXNet
-D: Caffe2

We compare the optimizers on the followingDeep Learning networks:

-1. AlexNet
-2. VGG16
-3. InceptionV3
-4. ResNet50
-5. DenseNet121

Keyword: On Empirical Comparisons of Optimizers for Deep Learning

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top