The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (2024)

Clothes shopping is a taxing experience. My eyes get bombarded with too much information. Sales, coupons, colors, toddlers, flashing lights, and crowded aisles are just a few examples of all the signals forwarded to my visual cortex, whether or not I actively try to pay attention. The visual system absorbs an abundance of information. Should I go for that H&M khaki pants? Is that a Nike tank top? What color are those Adidas sneakers?

Can a computer automatically detect pictures of shirts, pants, dresses, and sneakers? It turns out that accurately classifying images of fashion items is surprisingly straight-forward to do, given quality training data to start from. In this tutorial, we’ll walk through building a machine learning model for recognizing images of fashion objects using the Fashion-MNIST dataset. We’ll walk through how to train a model, design the input and output for category classifications, and finally display the accuracy results for each model.

Image Classification

The problem of Image Classification goes like this: Given a set of images that are all labeled with a single category, we are asked to predict these categories for a novel set of test images and measure the accuracy of the predictions. There are a variety of challenges associated with this task, including viewpoint variation, scale variation, intra-class variation, image deformation, image occlusion, illumination conditions, background clutter etc.

How might we go about writing an algorithm that can classify images into distinct categories? Computer Vision researchers have come up with a data-driven approach to solve this. Instead of trying to specify what every one of the image categories of interest look like directly in code, they provide the computer with many examples of each image class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. In other words, they first accumulate a training dataset of labeled images, then feed it to the computer in order for it to get familiar with the data.

Given that fact, the complete image classification pipeline can be formalized as follows:

  • Our input is a training dataset that consists of N images, each labeled with one of K different classes.
  • Then, we use this training set to train a classifier to learn what every one of the classes looks like.
  • In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem. The big idea behind CNNs is that a local understanding of an image is good enough. The practical benefit is that having fewer parameters greatly improves the time it takes to learn as well as reduces the amount of data required to train the model. Instead of a fully connected network of weights from each pixel, a CNN has just enough weights to look at a small patch of the image. It’s like reading a book by using a magnifying glass; eventually, you read the whole page, but you look at only a small patch of the page at any given time.

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (3)

Consider a 256 x 256 image. CNN can efficiently scan it chunk by chunk — say, a 5 × 5 window. The 5 × 5 window slides along the image (usually left to right, and top to bottom), as shown below. How “quickly” it slides is called its stride length. For example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it spans the entire image.

A convolution is a weighted sum of the pixel values of the image, as the window slides across the whole image. Turns out, this convolution process throughout an image with a weight matrix produces another image (of the same size, depending on the convention). Convolving is the process of applying a convolution.

The sliding-window shenanigans happen in the convolution layer of the neural network. A typical CNN has multiple convolution layers. Each convolutional layer typically generates many alternate convolutions, so the weight matrix is a tensor of 5 × 5 × n, where n is the number of convolutions.

As an example, let’s say an image goes through a convolution layer on a weight matrix of 5 × 5 × 64. It generates 64 convolutions by sliding a 5 × 5 window. Therefore, this model has 5 × 5 × 64 (= 1,600) parameters, which is remarkably fewer parameters than a fully connected network, 256 × 256 (= 65,536).

The beauty of the CNN is that the number of parameters is independent of the size of the original image. You can run the same CNN on a 300 × 300 image, and the number of parameters won’t change in the convolution layer.

Data Augmentation

Image classification research datasets are typically very large. Nevertheless, data augmentation is often used in order to improve generalisation properties. Typically, random cropping of rescaled images together with random horizontal flipping and random RGB colour and brightness shifts are used. Different schemes exist for rescaling and cropping the images (i.e. single scale vs. multi scale training). Multi-crop evaluation during test time is also often used, although computationally more expensive and with limited performance improvement. Note that the goal of the random rescaling and cropping is to learn the important features of each object at different scales and positions. Keras does not implement all of these data augmentation techniques out of the box, but they can easily implemented through the preprocessing function of the ImageDataGenerator modules.

Fashion MNIST Dataset

Recently, Zalando research published a new dataset, which is very similar to the well known MNIST database of handwritten digits. The dataset is designed for machine learning classification tasks and contains in total 60 000 training and 10 000 test images (gray scale) with each 28x28 pixel. Each training and test case is associated with one of ten labels (0–9). Up till here Zalando’s dataset is basically the same as the original handwritten digits data. However, instead of having images of the digits 0–9, Zalando’s data contains (not unsurprisingly) images with 10 different fashion products. Consequently, the dataset is called Fashion-MNIST dataset, which can be downloaded from GitHub. The data is also featured on Kaggle. A few examples are shown in the following image, where each row contains one fashion item.

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (4)

The 10 different class labels are:

  • 0 T-shirt/top
  • 1 Trouser
  • 2 Pullover
  • 3 Dress
  • 4 Coat
  • 5 Sandal
  • 6 Shirt
  • 7 Sneaker
  • 8 Bag
  • 9 Ankle boot

According to the authors, the Fashion-MNIST data is intended to be a direct drop-in replacement for the old MNIST handwritten digits data, since there were several issues with the handwritten digits. For example, it was possible to correctly distinguish between several digits, by simply looking at a few pixels. Even with linear classifiers it was possible to achieve high classification accuracy. The Fashion-MNIST data promises to be more diverse so that machine learning (ML) algorithms have to learn more advanced features in order to be able to separate the individual classes reliably.

Embedding Visualization of Fashion MNIST

Embedding is a way to map discrete objects (images, words, etc.) to high dimensional vectors. The individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of. Embeddings, thus, are important for input to machine learning; since classifiers and neural networks, more generally, work on vectors of real numbers. They train best on dense vectors, where all values contribute to define an object.

TensorBoard has a built-in visualizer, called the Embedding Projector, for interactive visualization and analysis of high-dimensional data like embeddings. The embedding projector will read the embeddings from my model checkpoint file. Although it’s most useful for embeddings, it will load any 2D tensor, including my training weights.

Here, I’ll attempt to represent the high-dimensional Fashion MNIST data using TensorBoard. After reading the data and create the test labels, I use this code to build TensorBoard’s Embedding Projector:

The Embedding Projector has three methods of reducing the dimensionality of a data set: two linear and one nonlinear. Each method can be used to create either a two- or three-dimensional view.

Principal Component Analysis: A straightforward technique for reducing dimensions is Principal Component Analysis (PCA). The Embedding Projector computes the top 10 principal components. The menu lets me project those components onto any combination of two or three. PCA is a linear projection, often effective at examining global geometry.

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (5)

t-SNE: A popular non-linear dimensionality reduction technique is t-SNE. The Embedding Projector offers both two- and three-dimensional t-SNE views. Layout is performed client-side animating every step of the algorithm. Because t-SNE often preserves some local structure, it is useful for exploring local neighborhoods and finding clusters.

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (6)

Custom: I can also construct specialized linear projections based on text searches for finding meaningful directions in space. To define a projection axis, enter two search strings or regular expressions. The program computes the centroids of the sets of points whose labels match these searches, and uses the difference vector between centroids as a projection axis.

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (7)

You can view the full code for the visualization steps at this notebook: TensorBoard-Visualization.ipynb

Training CNN Models on Fashion MNIST

Let’s now move to the fun part: I will create a variety of different CNN-based classification models to evaluate performances on Fashion MNIST. I will be building our model using the Keras framework. For more information on the framework, you can refer to the documentation here. Here are the list of models I will try out and compare their results:

  1. CNN with 1 Convolutional Layer
  2. CNN with 3 Convolutional Layer
  3. CNN with 4 Convolutional Layer
  4. VGG-19 Pre-Trained Model

For all the models (except for the pre-trained one), here is my approach:

  • Split the original training data (60,000 images) into 80% training (48,000 images) and 20% validation (12000 images) optimize the classifier, while keeping the test data (10,000 images) to finally evaluate the accuracy of the model on the data it has never seen. This helps to see whether I’m over-fitting on the training data and whether I should lower the learning rate and train for more epochs if validation accuracy is higher than training accuracy or stop over-training if training accuracy shift higher than the validation.
  • Train the model for 10 epochs with batch size of 256, compiled with categorical_crossentropy loss function and Adam optimizer.
  • Then, add data augmentation, which generates new training samples by rotating, shifting and zooming on the training samples, and train the model on updated data for another 50 epochs.

Here’s the code to load and split the data:

After loading and splitting the data, I preprocess them by reshaping them into the shape the network expects and scaling them so that all values are in the [0, 1] interval. Previously, for instance, the training data were stored in an array of shape (60000, 28, 28) of type uint8 with values in the [0, 255] interval. I transform it into a float32 array of shape (60000, 28 * 28) with values between 0 and 1.

1 — 1-Conv CNN

Here’s the code for the CNN with 1 Convolutional Layer:

After training the model, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (8)

After applying data augmentation, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (9)

For visual purpose, I plot the training and validation accuracy and loss:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (10)
The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (11)

You can view the full code for this model at this notebook: CNN-1Conv.ipynb

2 — 3-Conv CNN

Here’s the code for the CNN with 3 Convolutional Layer:

After training the model, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (12)

After applying data augmentation, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (13)

For visual purpose, I plot the training and validation accuracy and loss:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (14)
The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (15)

You can view the full code for this model at this notebook: CNN-3Conv.ipynb

3 — 4-Conv CNN

Here’s the code for the CNN with 4 Convolutional Layer:

After training the model, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (16)

After applying data augmentation, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (17)

For visual purpose, I plot the training and validation accuracy and loss:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (18)
The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (19)

You can view the full code for this model at this notebook: CNN-4Conv.ipynb

4 — Transfer Learning

A common and highly effective approach to deep learning on small image datasets is to use a pre-trained network. A pre-trained network is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pre-trained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computer-vision problems, even though these new problems may involve completely different classes than those of the original task.

I attempted to implement the VGG19 pre-trained model, which is a widely used ConvNets architecture for ImageNet. Here’s the code you can follow:

After training the model, here’s the test loss and test accuracy:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (20)

For visual purpose, I plot the training and validation accuracy and loss:

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (21)
The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (22)

You can view the full code for this model at this notebook: VGG19-GPU.ipynb

Last Takeaway

The fashion domain is a very popular playground for applications of machine learning and computer vision. The problems in this domain is challenging due to the high level of subjectivity and the semantic complexity of the features involved. I hope that this post has been helpful for you to learn about the 4 different approaches to build your own convolutional neural networks to classify fashion images. You can view all the source code in my GitHub repo at this link. Let me know if you have any questions or suggestions on improvement!

— —

If you enjoyed this piece, I’d love it if you hit the clap button 👏 so others might stumble upon it. You can find my own code on GitHub, and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!

The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images (2024)

FAQs

What are the CNN models used for image classification? ›

The four types of CNN layers are the convolutional layer, ReLU layer, pooling layer, and fully connected layer. An image classifier passes an image through these layers to generate a classification. The convolutional layer extracts the features of an image by scanning through the image with filters.

What is 4 layer CNN? ›

The different layers of a CNN. There are four types of layers for a convolutional neural network: the convolutional layer, the pooling layer, the ReLU correction layer and the fully-connected layer.

What are the different types of CNN models? ›

Different Types of CNN Models
  • LeNet.
  • AlexNet.
  • ResNet.
  • GoogleNet.
  • MobileNet.
  • VGG.
Mar 13, 2024

Why is CNN used for image classification? ›

The CNN architecture is especially useful for image recognition and image classification, as well as other computer vision tasks because they can process large amounts of data and produce highly accurate predictions.

Why only CNN is used for image classification? ›

The Convolutional Neural Network (CNN or ConvNet) is a subtype of Neural Networks that is mainly used for applications in image and speech recognition. Its built-in convolutional layer reduces the high dimensionality of images without losing its information. That is why CNNs are especially suited for this use case.

Which CNN has 4 convolutional layers? ›

The CNN has 4 convolutional layers, 3 max pooling layers, two fully connected layers and one softmax output layer. The input consists of three 48 × 48 patches from axial, sagittal and coronal image slices centered around the target voxel.

What are the 4 different layers on CNN briefly explain the function of each layer? ›

Here, we define a simple CNN that accepts an input, applies a convolution layer, then an activation layer, then a fully connected layer, and, finally, a softmax classifier to obtain the output classification probabilities.

What is a CNN model? ›

A convolutional neural network (CNN or ConvNet) is a network architecture for deep learning that learns directly from data. CNNs are particularly useful for finding patterns in images to recognize objects, classes, and categories. They can also be quite effective for classifying audio, time-series, and signal data.

What is the most used CNN model? ›

LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for written digits recognition (MNIST). Here is the LeNet-5 architecture. We start off with a grayscale image (LeNet-5 was trained on grayscale images), with a shape of 32×32 x1.

Which neural network is best for image classification? ›

Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem.

How many algorithms are there in CNN? ›

CNN algorithm has two main processes: convolution and sampling .

What is better than CNN for image classification? ›

Classification Accuracy of SVM and CNN In this study, it is shown that SVM overcomes CNN, where it gives best results in classification, the accuracy in PCA- band the SVM linear 97.44%, SVM-RBF 98.84% and the CNN 94.01%, But in the all bands just have accuracy for SVM-linear 96.35% due to the big data hyper spectral.

What are the different types of image classification? ›

Unsupervised and supervised image classification is the two most common approaches. However, object-based classification has gained more popularity because it's useful for high-resolution data.

How many types of image classification are there? ›

Two general methods of classification are 'supervised' and 'unsupervised'. Unsupervised classification method is a fully automated process without the use of training data. Using a suitable algorithm, the specified characteristics of an image is detected systematically during the image processing stage.

What are CNN methods in image processing? ›

A convolutional neural network (CNN) is a type of artificial neural network used primarily for image recognition and processing, due to its ability to recognize patterns in images. A CNN is a powerful tool but requires millions of labelled data points for training.

What is ResNet CNN image classification? ›

Defining the CNN

ResNet is recommended. First, it is deep enough with 34 layers, 50 layers, or 101 layers. The deeper the hierarchy, the stronger the representation capability, and the higher the classification accuracy. Second, it is learnable.

Why CNN is better than Ann for image classification? ›

Here are some key reasons why CNNs are better suited for image classification: Hierarchical Feature Learning:CNNs are designed to automatically learn hierarchical features from images. Convolutional layers use filters (kernels) to capture local patterns, edges, and textures.

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 6165

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.