Why ADAM Beats SGD for Attention Models (2024)

25 Sept 2019 (modified: 22 Oct 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Optimization, ADAM, Deep learning

TL;DR: Adaptive methods provably beat SGD in training attention models due to existence of heavy tailed noise.

Abstract: While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to Adam are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is a root cause of SGD's poor performance. Based on this observation, we study clipped variants of SGD that circumvent this issue; we then analyze their convergence under heavy-tailed noise. Furthermore, we develop a new adaptive coordinate-wise clipping algorithm (ACClip) tailored to such settings. Subsequently, we show how adaptive methods like Adam can be viewed through the lens of clipping, which helps us explain Adam's strong performance under heavy-tail noise settings. Finally, we show that the proposed ACClip outperforms Adam for both BERT pretraining and finetuning tasks.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:1912.03194/code)

Original Pdf: pdf

Why ADAM Beats SGD for Attention Models (2024)

FAQs

Why does SGD perform better than Adam? ›

This is because even Adam has some downsides. It tends to focus on faster computation time, whereas algorithms like stochastic gradient descent focus on data points. That's why algorithms like SGD generalize the data in a better manner at the cost of low computation speed.

Why is Adam the best optimizer? ›

In general, Adam is better than other adaptive learning rate algorithms due to its faster convergence and robustness across problems. As such, it's mostly the default algorithm used for deep learning. But alternatives like NAdam and AMSGrad can outperform Adam in some cases, so trying these would also make sense.

What are the disadvantages of Adam optimizer? ›

On the other hand, the disadvantages of Adam include the need for larger and deeper models with more parameters to achieve similar model fits as other optimizers like the LM algorithm. Additionally, Adam may struggle to fit lower amplitude components of a function compared to other optimizers like BFGS and LBFGS.

What is the best learning rate for Adam? ›

An optimal learning rate value (default value 0.001) means that the optimizer would update the parameters just right to reach the local minima. Varying learning rate between 0.0001 and 0.01 is considered optimal in most of the cases.

When to not use Adam optimizer? ›

Adam uses a moving average of the parameters, which means that it can take longer to converge than other optimizers. This may not be a problem for many problems, but for tasks with a large number of parameters or very small data sets, Adam may be too slow.

What are the disadvantages of Adam? ›

Disadvantages of the Adam optimizer

Susceptible to noise: Adam's adaptive learning rate can make it sensitive to noise in the gradient estimates, especially for sparse data. This can lead to suboptimal convergence or even divergence in some cases.

What are the advantages of Adam? ›

Adam is one of the best optimization algorithms for deep learning, and its popularity is growing quickly. Its adaptive learning rates, efficiency in optimization, and robustness make it a popular choice for training neural networks.

What is the best value for Adam optimizer? ›

Best Practices for Using Adam Optimization

Use Default Hyperparameters: In most cases, the default hyperparameters for Adam optimization (beta1=0.9, beta2=0.999, epsilon=1e-8) work well and do not need to be tuned.

Which optimizer is the fastest? ›

Conclusions. Adam is the best optimizers. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.

Does Adam optimizer have randomness? ›

How random is the Adam optimizer? The randomness in your result y is not something Adam brings for the fixed values of hyper-parameters. It is based on parameters W and biases b TensorFlow fills in in respect to np. random.

Is Adam optimizer better than RMSProp? ›

Conclusion. We have looked at different optimization algorithms in neural networks. Considered as a combination of Momentum and RMSProp, Adam is the most superior of them which robustly adapts to large datasets and deep networks.

Does Adam optimizer have momentum? ›

How Adam works? Adam optimizer involves a combination of two gradient descent methodologies: Momentum: This algorithm is used to accelerate the gradient descent algorithm by taking into consideration the 'exponentially weighted average' of the gradients.

What are the two main components of the Adam optimizer? ›

Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. Adam is relatively easy to configure where the default configuration parameters do well on most problems.

What are the Hyperparameters of Adam optimizer? ›

The Adam optimizer has three primary hyperparameters: the learning rate, beta1, and beta2. Let's break down their roles: Learning Rate (α): This hyperparameter determines the step size at each iteration while moving towards a minimum in the loss function.

Is Adam good for regression? ›

Adam is well-suited for training models with a large number of parameters, such as deep neural networks. Its adaptive learning rate helps navigate the complex optimization landscapes of such models efficiently. However, this is not always the case, as we can see when we applied Adam to Linear Regression.

What are the advantages of SGD over GD? ›

SGD is stochastic in nature i.e it picks up a “random” instance of training data at each step and then computes the gradient making it much faster as there are much fewer data to manipulate at a single time, unlike Batch GD.

Is RMSprop better than Adam? ›

Conclusion. We have looked at different optimization algorithms in neural networks. Considered as a combination of Momentum and RMSProp, Adam is the most superior of them which robustly adapts to large datasets and deep networks.

Why SGD optimizer? ›

To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.

Which optimizer is best for image classification? ›

RMSProp is considered to be one of the best default optimizers that makes use of decay and momentum variables to achieve the best accuracy of the image classification.

Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 6511

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.