CSC321 Tutorial 6: Optimization and Convolutional Neural Networks¶ (2024)

In lecture 5, we talk about different issues that may arise when training an artificial neural network. Today, we'll explore some of theseissues, and explore different ways that we can optimize a neural network'scost function.

In lecture 6, we will cover convolutional neural networks. Since this isthe last tutorial before reading week, we will also train some CNN's today.If you are in the Tuesday lecture section, don't worry! Think of CNN's as aneural network with a slightly different architecture, or that the weightsare "wired" differently. These weights (parameters) still can be optimizedvia gradient descent, and we will still use the back-propagation algorithm.

Please note that because there is stochasticity in the way we initalize theneural network weights, so we will get different results (final training/validationaccuracies) if we run the initialization + training multiple times.You will need to run some of the provided code multiple times to makea conclusion about what optimization methods work well.

Data

We'll use the MNIST data set, the same data set that we introduced inTutorial 4.The MNIST dataset contains black and white, hand-written (numerical) digitsthat are 28x28 pixels large.As in tutorial 4, we'll only use the first 2500 images in the MNIST dataset.The first time you run this code, we will download the MNIST dataset.

Models in PyTorch

We'll work with two models: a MLP and a convolutional neural network.

One way to gauge the "complexity" or the "capacity" of theneural network is by looking at the number of parameters that ithas.

Training the neural network

We'll use a fairly configurable training training function that computesboth training and validation accuracy in each iteration. This is more

And of course, we need the get_accuracy helper function. To turn the probabilitiesinto a discrete prediction, we will take the digit with the highest probability.Because of the way softmax is computed, the digit with the highest probability isthe same as the digit with the (pre-activation) output value.

Let's see what the training curve of a multi-layer perceptron looks like.This code will take a couple minutes to run...

The first thing we'll explore is the hidden unit size. If we increase the numberof hidden units in a MLP, we'll increase its parameters counts.

With more hidden units, our model has more "capacity", and can learnmore intricate patterns in the training data. Our training accuracy willtherefore be higher. However, the computation time for training andusing these networks will also increase.

Adding more parameters tend to widen the gap between training and validationaccuracy. As we add too many parameters, we could overfit. However, we won'tshow that here since the computations will take a long time.

A smaller network will train faster, but may have worse training accuracy.Bear in mind that since the neural networks initialization is random, k/

Interlude: shuffling the dataset

What if don't off data_shuffle? That is, what if we use the same mini-batchesacross all of our epochs? Can you explain what's going on in this learningcurve?

Conv Net

The learning curve for the convolutional network looks similar. This networkis a lot more compact with much fewer parameters. The computation time isa bit longer than training MLPs, but we get fairly good results.(The learning rate of 0.1 looks a little high for this CNN, based on the noisinessof the learning curves.)

Momentum

We'll mainly experiment with the MLP(30) model, since it trains the fastest.We'll measure how quickly the model trains by looking at how far we get in first 3 epochs of training. Here's how far our model gets without using momenutm,with a learning rate of 0.1.

With a well-tuned learning-rate and momentum parameter, our training can go faster.(Note: We had to try a few settings before finding one that worked well, and encourageyou to try different combinations of the learning rate and momentum. For example,learning rate of 0.1 and momentum of

The optimizer Adam works well and is the most popular optimizer nowadays.Adam typically requires a smaller learning rate: start at 0.001, then increase/decreaseas you see fit. For this example, 0.005 works well.

Convnets can also be trained using SGD with momentum or with Adam. In particular, ourCNN generalizes very well. (Since our validation accuracy is about equal to ourtraining accuracy, we can afford to increase the model capacity if we want to.)

Batch Normalization

Batch normalization speeds up training significantly!

There is a debate as to whether the batch-normalization should be appliedbefore or after the activation. The original batch normalization paperapplied the normalization before the ReLU activation, but applying normalizationafter the ReLU performs better in practice.

I (Lisa) believe the reason to be as follows:

  1. If we apply normalization before ReLU, then we are effectively ignoring thebias parameter of those units, since those unit's activations gets centeredanyways.
  2. If we apply normalization after ReLU, we will have both positive and negativeinformation being passed to the next layer.

Batch normalization can be used in CNNs too.

Weight Initialization

If we initialize weights to zeros, our neural network will be stuck in asaddle point. Since we are using stochastic gradient descent, we will seeonly noise in the training curve and no progress.

CSC321 Tutorial 6: Optimization and Convolutional Neural Networks¶ (2024)
Top Articles
Latest Posts
Article information

Author: Fr. Dewey Fisher

Last Updated:

Views: 5974

Rating: 4.1 / 5 (42 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Fr. Dewey Fisher

Birthday: 1993-03-26

Address: 917 Hyun Views, Rogahnmouth, KY 91013-8827

Phone: +5938540192553

Job: Administration Developer

Hobby: Embroidery, Horseback riding, Juggling, Urban exploration, Skiing, Cycling, Handball

Introduction: My name is Fr. Dewey Fisher, I am a powerful, open, faithful, combative, spotless, faithful, fair person who loves writing and wants to share my knowledge and understanding with you.