Why ADAM Beats SGD for Attention Models (2024)

25 Sept 2019 (modified: 22 Oct 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Optimization, ADAM, Deep learning

TL;DR: Adaptive methods provably beat SGD in training attention models due to existence of heavy tailed noise.

Abstract: While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to Adam are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is a root cause of SGD's poor performance. Based on this observation, we study clipped variants of SGD that circumvent this issue; we then analyze their convergence under heavy-tailed noise. Furthermore, we develop a new adaptive coordinate-wise clipping algorithm (ACClip) tailored to such settings. Subsequently, we show how adaptive methods like Adam can be viewed through the lens of clipping, which helps us explain Adam's strong performance under heavy-tail noise settings. Finally, we show that the proposed ACClip outperforms Adam for both BERT pretraining and finetuning tasks.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:1912.03194/code)

Original Pdf: pdf

FAQs

Why does SGD perform better than Adam? ›

This is because even Adam has some downsides. It tends to focus on faster computation time, whereas algorithms like stochastic gradient descent focus on data points. That's why algorithms like SGD generalize the data in a better manner at the cost of low computation speed.

Tell Me More ›

Why is Adam the best optimizer? ›

In general, Adam is better than other adaptive learning rate algorithms due to its faster convergence and robustness across problems. As such, it's mostly the default algorithm used for deep learning. But alternatives like NAdam and AMSGrad can outperform Adam in some cases, so trying these would also make sense.

Learn More ›

What are the disadvantages of Adam optimizer? ›

On the other hand, the disadvantages of Adam include the need for larger and deeper models with more parameters to achieve similar model fits as other optimizers like the LM algorithm. Additionally, Adam may struggle to fit lower amplitude components of a function compared to other optimizers like BFGS and LBFGS.

Show Me More ›

What is the best learning rate for Adam? ›

An optimal learning rate value (default value 0.001) means that the optimizer would update the parameters just right to reach the local minima. Varying learning rate between 0.0001 and 0.01 is considered optimal in most of the cases.

See Details ›

When to not use Adam optimizer? ›

Adam uses a moving average of the parameters, which means that it can take longer to converge than other optimizers. This may not be a problem for many problems, but for tasks with a large number of parameters or very small data sets, Adam may be too slow.

Get More Info ›

What are the disadvantages of Adam? ›

Disadvantages of the Adam optimizer

Susceptible to noise: Adam's adaptive learning rate can make it sensitive to noise in the gradient estimates, especially for sparse data. This can lead to suboptimal convergence or even divergence in some cases.

Know More ›

What are the advantages of Adam? ›

Adam is one of the best optimization algorithms for deep learning, and its popularity is growing quickly. Its adaptive learning rates, efficiency in optimization, and robustness make it a popular choice for training neural networks.

Find Out More ›

What is the best value for Adam optimizer? ›

Best Practices for Using Adam Optimization

Use Default Hyperparameters: In most cases, the default hyperparameters for Adam optimization (beta1=0.9, beta2=0.999, epsilon=1e-8) work well and do not need to be tuned.

Keep Reading ›

Which optimizer is the fastest? ›

Conclusions. Adam is the best optimizers. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.

Learn More Now ›

Does Adam optimizer have randomness? ›

How random is the Adam optimizer? The randomness in your result y is not something Adam brings for the fixed values of hyper-parameters. It is based on parameters W and biases b TensorFlow fills in in respect to np. random.

Find Out More ›

Is Adam optimizer better than RMSProp? ›

Conclusion. We have looked at different optimization algorithms in neural networks. Considered as a combination of Momentum and RMSProp, Adam is the most superior of them which robustly adapts to large datasets and deep networks.

Does Adam optimizer have momentum? ›

How Adam works? Adam optimizer involves a combination of two gradient descent methodologies: Momentum: This algorithm is used to accelerate the gradient descent algorithm by taking into consideration the 'exponentially weighted average' of the gradients.