7 tips to choose the best optimizer (2024)

Based on my experience.

Davide Giordano

See Also

Why ADAM Beats SGD for Attention Models

Published in

Towards Data Science

5 min read

Jul 25, 2020

In machine learning when we need to compute the distance between a predicted value and an actual value, we use the so-called loss function.

Contrary to what many believe, the loss function is not the same thing as the cost function. While the loss function computes the distance of a single prediction from its actual value, the cost function is usually more general. Indeed, the cost function can be, for example, the sum of loss functions over the training set plus some regularization.

Another term often incorrectly used as a synonym of the first two is represented by the objective function, that is the most general term for any function optimized during training.

Once clarified the right terminology, we can give the definition of optimizer.

Optimizers in machine learning are used to tune the parameters of a neural network in order to minimize the cost function.

The choice of the optimizer is, therefore, an important aspect that can make the difference between a good training and bad training.

Actually, there are many optimizers and so the choice is not straightforward. In this article, I’ll briefly describe the most used ones and then I’ll give you some guidelines that helped me in different projects and that I hope will help you when it comes to choosing the most suitable optimizer for your task.

For further information about how the specific optimizers work, you can give a look to this website.

As just said, there are many optimizers. Each of them has advantages and disadvantages often related to the specific task.

I like to divide optimizers into two families: gradient descent optimizers and adaptive optimizers. This division is exclusively based on an operational aspect which forces you to manually tune the learning rate in the case of Gradient Descent algorithms while it is automatically adapted in adaptive algorithms — that’s why we have this name.

Gradient Descent:

Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent

Adaptive:

Adagrad
Adadelta
RMSprop
Adam

There are three types of gradient descent optimizers, which differ in how much data we use to compute the gradient of the objective function.

Batch gradient descent

Also known as vanilla gradient descent, it’s the most basic algorithm among the three. It computes the gradients of the objective function J with respect to the parameters θ for the entire training set.

As we use the entire dataset to perform just one step, batch gradient descent can be very slow. Moreover, it is not suitable for datasets that don’t fit in memory.

Stochastic gradient descent

It is an improved version of batch gradient descent. Instead of computing the gradients over the entire dataset, it performs a parameter update for each example in the dataset.

So the formula now depends also on the values of the input x and output y.

The problem of SGD is that the updates are frequent and with a high variance, so the objective function heavily fluctuates during training.

This fluctuation can be an advantage with respect to batch gradient descent because it allows the function to jump to better local minima, but at the same time it can represent a disadvantage with respect to the convergence in a specific local minima.

A solution to this problem is to slowly decrease the learning rate value in order to make the updates smaller and smaller, so avoiding high oscillations.

Mini batch gradient descent

The intuition behind this algorithm is to exploit the advantages of both gradient descent’s methods that we have seen so far.

It basically computes the gradients on small batches of data in order to reduce the variance of the updates.

Mini batch gradient descent is the best choice among the three in most of the cases.
Learning rate tuning problem: all of them are subjected to the choice of a good learning rate. Unfortunately, this choice is not straighforward.
Not good for sparse data: there is no mechanism to put in evidence rarely occurring features. All parameters are updated equally.
High possibility of getting stuck into a suboptimal local minima.

This family of optimizers has been introduced to solve the issues of the gradient descent’s algorithms. Their most important feature is that they don’t require a tuning of the learning rate value. Actually some libraries — i.e. Keras — still let you the possibility to manually tune it for more advanced trials.

Adagrad

It adapts the learning rate to the parameters performing small updates for frequently occurring features and large updates for the rarest ones.

In this way, the network is able to capture information belonging to features that are not frequent, putting them in evidence and giving them the right weight.

The problem of Adagrad is that it adjusts the learning rate for each parameter according to all the past gradients. So, the possibility of having a very small learning rate after a high number of steps — resulting from the accumulation of all the past gradients — is relevant.

If the learning rate is too much small, we simply can’t update weights and the consequence is that the network doesn’t learn anymore.

Adadelta

It improves the previous algorithm by introducing a history window which sets a fixed number of past gradients to take in consideration during the training.

In this way, we don’t have the problem of the vanishing learning rate.

RMSprop

It is very similar to Adadelta. The only difference is in the way they manage the past gradients.

Adam

It adds to the advantages of Adadelta and RMSprop, the storing of an exponentially decaying average of past gradients similar to momentum.

Adam is the best among the adaptive optimizers in most of the cases.
Good with sparse data: the adaptive learning rate is perfect for this type of datasets.
There is no need to focus on the learning rate value

Adam is the best choice in general. Anyway, many recent papers state that SGD can bring to better results if combined with a good learning rate annealing schedule which aims to manage its value during the training.

My suggestion is to first try Adam in any case, because it is more likely to return good results without an advanced fine tuning.

Then, if Adam achieves good results, it could be a good idea to switch on SGD to see what happens.