CSC321 Tutorial 6: Optimization and Convolutional Neural Networks¶ (2024)

In lecture 5, we talk about different issues that may arise when training an artificial neural network. Today, we'll explore some of theseissues, and explore different ways that we can optimize a neural network'scost function.

In lecture 6, we will cover convolutional neural networks. Since this isthe last tutorial before reading week, we will also train some CNN's today.If you are in the Tuesday lecture section, don't worry! Think of CNN's as aneural network with a slightly different architecture, or that the weightsare "wired" differently. These weights (parameters) still can be optimizedvia gradient descent, and we will still use the back-propagation algorithm.

Please note that because there is stochasticity in the way we initalize theneural network weights, so we will get different results (final training/validationaccuracies) if we run the initialization + training multiple times.You will need to run some of the provided code multiple times to makea conclusion about what optimization methods work well.

Data¶

We'll use the MNIST data set, the same data set that we introduced inTutorial 4.The MNIST dataset contains black and white, hand-written (numerical) digitsthat are 28x28 pixels large.As in tutorial 4, we'll only use the first 2500 images in the MNIST dataset.The first time you run this code, we will download the MNIST dataset.

Models in PyTorch¶

We'll work with two models: a MLP and a convolutional neural network.

One way to gauge the "complexity" or the "capacity" of theneural network is by looking at the number of parameters that ithas.

Training the neural network¶

We'll use a fairly configurable training training function that computesboth training and validation accuracy in each iteration. This is more

Interlude: shuffling the dataset¶

What if don't off data_shuffle? That is, what if we use the same mini-batchesacross all of our epochs? Can you explain what's going on in this learningcurve?

Conv Net¶

The learning curve for the convolutional network looks similar. This networkis a lot more compact with much fewer parameters. The computation time isa bit longer than training MLPs, but we get fairly good results.(The learning rate of 0.1 looks a little high for this CNN, based on the noisinessof the learning curves.)

Momentum¶

We'll mainly experiment with the MLP(30) model, since it trains the fastest.We'll measure how quickly the model trains by looking at how far we get in first 3 epochs of training. Here's how far our model gets without using momenutm,with a learning rate of 0.1.

With a well-tuned learning-rate and momentum parameter, our training can go faster.(Note: We had to try a few settings before finding one that worked well, and encourageyou to try different combinations of the learning rate and momentum. For example,learning rate of 0.1 and momentum of

The optimizer Adam works well and is the most popular optimizer nowadays.Adam typically requires a smaller learning rate: start at 0.001, then increase/decreaseas you see fit. For this example, 0.005 works well.

Convnets can also be trained using SGD with momentum or with Adam. In particular, ourCNN generalizes very well. (Since our validation accuracy is about equal to ourtraining accuracy, we can afford to increase the model capacity if we want to.)

Batch Normalization¶

Batch normalization speeds up training significantly!

There is a debate as to whether the batch-normalization should be appliedbefore or after the activation. The original batch normalization paperapplied the normalization before the ReLU activation, but applying normalizationafter the ReLU performs better in practice.

I (Lisa) believe the reason to be as follows:

If we apply normalization before ReLU, then we are effectively ignoring thebias parameter of those units, since those unit's activations gets centeredanyways.
If we apply normalization after ReLU, we will have both positive and negativeinformation being passed to the next layer.

Batch normalization can be used in CNNs too.

Weight Initialization¶

If we initialize weights to zeros, our neural network will be stuck in asaddle point. Since we are using stochastic gradient descent, we will seeonly noise in the training curve and no progress.