ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (2024)

A deeper insight into what makes Residual Networks so efficient

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (1)

Today, we are all familiar with the role neural networks play in empowering machines to be artificially intelligent, in almost every field, ranging from medical industry to space exploration. Machine Learning enthusiasts and researchers from all over the world, are finding ways to make these networks as robust and efficient as possible. One such network is the Residual Network (ResNets), a ubiquitously used architecture which has enabled efficient implementation of deeper and bigger networks. In this article, first, we will go through a brief overview about the classic networks like LeNet-5, AlexNet and VGGNet-16/19 and then we will discuss about the functionality of Residual Network (RenNets) and how it mitigates some limitations of the formerly discussed classic networks and enables efficient training through deep networks. This will help you brush through these concepts, and smoothly clear your path to understanding ResNets more conceptually. If you don’t wish to go through the classic architectures, skip the next section and move straight to ResNets (2. How Deep can we Go?).

1. Classic Network Architectures

The design of a basic Convolutional Neural Network (CNN) starts with an Image Input Layer (dimensions: height, width, channels), subsequently convolved with a combination of some convolutional filters, max or average pooling layers followed by a fully connected layer and in the end, a Softmax or a Sigmoid output function for class label prediction. The channels of subsequent layers increase depending on the number of filters used in convolutional layers. This generic method of designing a CNN is common to every architecture, however the number and fashion of arrangement of these layers is what makes them different from each other in terms of performance and efficiency.

LeNet-5 architecture, developed in 1998 was created to recognize a set of gray-scaled images of hand-written digits, it consists of an Image Input Layer of size: (32x32x1), Convolutional Layers of 6 filters, filter size (5x5) and (2x2), Average Pooling, Stride: 1 for convolutions and 2 for Average Pooling. The network ends with a fully connected layer and output function having 10 possible predictions (digits (0 to 9) in this case). This network consists of a total of 60,000 training parameters.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (2)

In 2012, researchers came up with a much bigger network than LeNet, called AlexNet to classify 1.2 million high-resolution images in the ImageNet LSVRC-2010 dataset. With an input size of (227x227x3), it consists of a combination of convolutional and max pooling layers using same padding and 1000 classes to predict using a Softmax function in the last layer. The network has 60 Million training parameters, much larger compared to LeNet. The architecture of the network is shown in Figure 2.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (3)

The VGGNet-16/19, developed in 2015, trained on the ImageNet dataset is the deepest of these three network architectures. It has 138 Million training parameters, consisting of: 16 Convolutional Layers(using padding) in VGG-16, and 19 in VGG-19 (functionality of both is same), Max Pooling, 4096 features in the fully-connected and Softmax classifier in the end to predict 1000 classes of objects. The systematic reduction of width and height dimensions of input through each layer in this network, accompanied by an organized increment in the number of channels in each layer was the one factor which grabbed the attention of many developers and researchers. Looking at the efficacy of these deep networks, they are ubiquitously used in the industry to train various Machine Learning models as well as teaching basics to beginners.

2. How Deep can we Go?

What happens when we increase the depth of these networks further in order to make the model more robust and enhance its performance? According to theory, the error rate while training and testing, should keep decreasing as we go deeper in a network, however, experimentally, something opposite happens. Instead of steadily decreasing, after attaining a minimum value, the error rate starts increasing again. This happens due to the exploding and vanishing gradient descent problem which also causes overfitting of the model, hence increasing the error. This problem can be mathematically understood here. In this case, techniques like the L2 Regularization or dropout doesn’t help in optimization or reducing overfitting. So as we go deeper, the efficient parameters and activations (even from the identity function) get lost in the middle because the subsequent layers fail to sustain them due to their rigorous activation through continuous updating of weights and biases.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (5)

Fortunately, Residual Networks have proved to be quite efficient in solving this problem, as they use a a skip connection or a “shortcut” between every two layers along with using direct connections among all the layers. This allows us to take activation from one layer and feed it to another layer, even much deeper in the neural network, hence sustaining the learning parameters of the network in deeper layers.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (6)

One block of such a connection is called the “residual block”, these are stacked on top of one another in a ResNet in order to maintain efficient learning of parameters from the identity function, even in much deeper layers.

3. What happens inside a Residual Block?

Let’s compare the structure and functionality of a plain network layer with a residual block to understand the working of a residual block properly. A plain network consists of linear(z) and non-linear (ReLU, here) activations whose output serves as an input for the subsequent layer. Only a single activation function gets used in this case, for every layer.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (7)

Where,

z[l+1] = w[l+1]a[l] + b[l+1]

ReLU: g(z(l+1)) = a[l+1]

z[l+2] = w[l+2]a[l+1]+b[l+2]

ReLU: g(z[l+2]) = a[l+2]

(and so on for (l+n) layers, where n is the number of layers)

w[l] is the weight matrix for lth layer.

B[l] is the bias for lth layer

However, a residual block in its [l+2]th layer, along with the activations from a[l] i.e. g(z[l]), also uses the parameters from the previous activation function, i.e. a[l] itself. This happens through a skip connection illustrated below.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (8)

Residual Block Non-Linear ReLU function:

a[l+2] = g(z(a[l+1]) +a[l]) (Important Equation)

If in case, z(a[l+1]) = 0, (this happens when w[l+1] = 0, b[l+1] = 0)

a[l+2] = a[l] ([l+2]th layer still has a parameter to learn from)

The connection shown above, adds a[l] to the ReLU non-linearity function of [l+1]th layer. So the information in a[l] gets fast forwarded deeper into the network, through the shortcut. This implies that ReLU function from [l+1]th layer now becomes g(z[l+1] + a[l]). In this way, even if in some nth layer in the network, the weight matrix is a zero matrix and the bias is also 0, nullifying z(a[l+1]), the [l+2]th layer will always have an efficient learning parameter from the lth layer, i.e a[l].

NOTE: Sometimes, when the dimensions of a[l] and a[l+2] layers are not same, matrix addition is not possible, so, z(a[l+l]) + a[l] does not make sense. In this case, some adjustments to the weight matrix of such layers is done by multiplying a matrix Ws with the weight matrix of the previous layer W[l], to make the dimensions same and enable the addition. Thankfully, most of these calculations are handled while designing the ResNet Networks, because “same” convolutions are used in their convolutional layers.

4. Layers in ResNets

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (9)

Deep ResNets are built by stacking residual blocks on top of one another and go as long as hundred layers per network, efficiently learning all the parameters from early activations deeper in the network. The convolutional layers of a ResNet look something like Figure 9. It is a ResNet consisting of 34 layers with (3x3) convolutional filters using same padding, max-pooling layers and fully-connected layers, ending by a Softmax Function to predict 1000 classes.

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (10)

In conclusion, ResNets are one of the most efficient Neural Network Architectures, as they help in maintaining a low error rate much deeper in the network. Hence, proved to perform really well where deep neural networks are required, such as feature extraction, semantic segmentations, various Generative Adversarial Network architectures. These can also be used to develop rigorous AI equipped Computer Vision systems, where intricate features need to be extracted or for enhancing resolution of images and videos. I hope this article helped you acquire a greater clarity about the underlying concepts behind ResNets. Thanks for reading!

References:

[1]https://arxiv.org/pdf/1603.05027.pdf: ResNets Research Paper

[2]http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf: LeNet Research Paper

[3]https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf: AlexNet Research Paper

[4]https://arxiv.org/pdf/1409.1556.pdf:VGGNet Research Paper

[5]https://www.coursera.org/learn/convolutional-neural-networks/lecture/HAhz9/resnets :CNN Course by deeplearning.ai

[6]https://towardsdatascience.com/introduction-to-resnets-c0a830a288a4

ResNets: Why do they perform better than Classic ConvNets? (Conceptual Analysis) (2024)
Top Articles
Latest Posts
Article information

Author: The Hon. Margery Christiansen

Last Updated:

Views: 6561

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: The Hon. Margery Christiansen

Birthday: 2000-07-07

Address: 5050 Breitenberg Knoll, New Robert, MI 45409

Phone: +2556892639372

Job: Investor Mining Engineer

Hobby: Sketching, Cosplaying, Glassblowing, Genealogy, Crocheting, Archery, Skateboarding

Introduction: My name is The Hon. Margery Christiansen, I am a bright, adorable, precious, inexpensive, gorgeous, comfortable, happy person who loves writing and wants to share my knowledge and understanding with you.