Reshaping inputs for convolutional neural network: Some common and uncommon methods (2024)

Table of Contents
Article preview Pattern Recognition Abstract Graphical abstract Introduction Section snippets Motivation Convolutional neural network (CNN) Datasets Experimental results Analysis of results Conclusion Acknowledgment References (40) Neurocomputing Pattern Recognit. Neural Netw. Pattern Recognit. Pattern Recognit. Lett. Neural Netw. Pattern Recognit. Gradient-based learning applied to document recognition Proc. IEEE ImageNet classification with deep convolutional neural networks Proceedings of the Advances in Neural Information Processing Systems ImageNet large scale visual recognition challenge Int. J. Comput. Vis. ImageNet: a large-scale hierarchical image database Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009 Deep residual learning for image recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Going deeper with convolutions Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015 Batch normalization: accelerating deep network training by reducing internal covariate shift Proceedings of the International Conference on Machine Learning Rethinking the inception architecture for computer vision Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Deep convolutional neural networks for hyperspectral image classification J. Sens. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions Trans. Assoc. Comput. Linguist. Cited by (33) On the Performance of Convolutional Neural Networks with Resizing and Padding Recommended articles (6)
ScienceDirect

RegisterSign in

ViewPDF

  • Access throughyour institution

Article preview

  • Abstract
  • Introduction
  • Section snippets
  • References (40)
  • Cited by (33)
  • Recommended articles (6)

Pattern Recognition

Volume 93,

September 2019

, Pages 79-94

Author links open overlay panel

Abstract

Convolutional Neural Network has become very common in the field of computer vision in recent years. But it comes with a severe restriction regarding the size of the input image. Most convolutional neural networks are designed in a way so that they can only accept images of a fixed size. This creates several challenges during data acquisition and model deployment. The common practice to overcome this limitation is to reshape the input images so that they can be fed into the networks. Many standard pre-trained networks and datasets come with a provision of working with square images. In this work we analyze 25 different reshaping methods across 6 datasets corresponding to different domains trained on three famous architectures namely Inception-V3, which is an extension of GoogLeNet, the Residual Networks (Resent-18) and the 121-Layer deep DenseNet. While some of the reshaping methods like “interpolation” and “cropping” have been commonly used with convolutional neural networks, some uncommon techniques like “containing”, “tiling” and “mirroring” have also been demonstrated. In total, 450 neural networks were trained from scratch to provide various analyses regarding the convergence of the validation loss and the accuracy obtained on the test data. Statistical measures have been provided to demonstrate the dependence between parameter choices and datasets. Several key observations were noted such as the benefits of using randomized processes, poor performance of the commonly used “cropping” techniques and so on. The paper intends to provide empirical evidence to guide the reader to choose a proper technique of reshaping inputs for their convolutional neural networks. The official code is available in https://github.com/DVLP-CMATERJU/Reshaping-Inputs-for-CNN.

Introduction

Before the last decade, traditional feature-based methods[1] were preferred to perform tasks like image localization, detection, recognition, segmentation and so on. But feature-based methods are very difficult to formulate because many latent patterns in the input space can be too complicated for the human mind to comprehend. In 1998, Yann LeCun proposed the convolutional neural networks(CNN) [2] for classification of handwritten English numerals in the MNIST [3] database, that showed a new way of automatically learning features for images. In 2012, a similar network was proposed [4] combined with rectified linear units and local response normalization which made a significant leap over the existing accuracy in the ImageNet Large Scale Visual Recognition Competition(ILSVRC) [5]. Since then CNN has become one of the most common techniques for computer vision tasks [6], [7], [8]. However, there is one significant restriction for CNNs. CNNs always need inputs of a consistent size. We have seen the usage of 32 × 32 inputs for LeNet5 [2]. ImageNet [9] based architectures tend to use inputs of size 224 × 224 [4], [10], [11] with an exception to the GoogLeNet family [12], [13], [14] which accepts an input of size 299 × 299. Over the years this has been a restriction that could not be overlooked. This creates a big problem during both data collection and model deployment. Data collection procedures can be problematic especially when they are captured by different sensors [15]. In many cases, bulk data are collected through crowd-sourcing [9] or internet crawling [16] which makes it difficult to maintain consistency of image sizes. Similarly, during large-scale deployment, fixing the image size makes the product highly device dependent. In the present era, where a huge section of population have access to smartphone cameras, deploying any product as a mobile app [17], [18] becomes difficult because of a variety of factors like sensor resolutions, portrait or landscape mode and so on. CNNs, on the other hand, cannot deal with different sized inputs. A typical CNN can be represented as a sequence of tensor-based functions that operate on the input tensor. Hence a change in the size of the input will also affect the size of all the activations in each and every layer. This can create various problems like defining number of neurons in fully connected layers or output layers or formulating a suitable loss function. While there has been some work which shows us how to use variable sized inputs [19] but they are very domain specific and not really scalable to larger problems. Another work that can address this problem is the use of ROI pooling as mentioned in Regional Convolutional Neural Network(RCNN)[20]. However, ROI pooling is a module that is used as an intermediate layer to convert activations of various sizes to a fixed size using variable length max-pooling windows. Max-pooling is sensitive to towards strong activations as it tracks major activations in a region, however that is not suitable for input images. As unlike intermediate activations, input images are generally not sparse. ROI-pooling operations on images will incur huge loss of information. So the typical practice is to somehow reshape the input image to a fixed size and work with it. In this work, several techniques of reshaping an input image will be thoroughly analyzed across various datasets and architectures to gain an insight on the suitability of these reshaping methods in various scenarios. While some of these methods are commonly used in various networks, others are quite rare. In the next section, we will discuss the motivation behind this empirical study. Following that, a theoretical explanation of the methodologies used will be provided so that the experiments, results and analysis may be clearly understood. Our observations will be kept strictly coherent to the experimental observation and analysis will be provided on the basis of empirical evidence. In conclusion, we shall provide definite guidelines for the readers regarding the choice of methods and parameters depending on dataset.

Section snippets

Motivation

CNNs have become one of the most common algorithms for a large number of computer vision problems[6], [21], [22]. However, as discussed in the previous section, CNNs are extremely restrictive about the size of the input images. Many architectural elements of CNN depend on the size of the input image and hence CNNs cannot accept inputs of different sizes. For example, in various architectures like LeNet [2], AlexNet [4], VGGNet [10] and so on we can see the use of fully connected layers. The

Convolutional neural network (CNN)

Since CNNs were brought to the limelight [2] there has been a significant shift in the machine learning methods towards a new class called deep learning. The insurgence of more and more deep learning approaches are apparent especially in the field of image processing. CNNs have proved to be quite superior over traditional feature-based approach. As evident from the performance of the AlexNet [4], CNNs provides a significant boost over traditional methods. Many variations of CNNs came into the

Datasets

Unlike standardized datasets like ImageNet[9], CIFAR[34] or MNIST[3], many real-life data samples come in a variety of resolutions. For our experiments, we have chosen five different datasets for considering different domains. For simple classification tasks like optical character recognition, we selected the Devanagari dataset [35] and Bangla compound character dataset [36]. Further increasing the complexity of dataset, we move to the Street view house number dataset(SVHN) [37]. Moving from

Experimental results

From the experiments mentioned above, we aim to extract relations between reshaping techniques and various types of dataset. The results are tabulated in a manner that would reflect the dataset-specific performance achieved by using each reshaping method.

Analysis of results

Corresponding to the experiments mentioned in Section4.3, a couple of observations can be noted that leads to some specific conclusions

Conclusion

Convolutional neural networks is known to have a common restriction of being bound to a fixed sized input. Moreover, in many problems like crowd sourcing, domain adaptation, mobile based deployments it can be seen that using square images provides the least amount of compromise for handling unknown sensor resolutions, or aspect ratios of a variety of objects. To address that issue it is a common practice to reshape inputs before pushing them through a CNN. Our work attempts to provide

Acknowledgment

This work is partially supported by the project sponsored by SERB (Government of India, order no. SB/S3/EECE/054/2016) (dated 25/11/2016), and carried out at the Centre for Microprocessor Application for Training Education and Research (CMATER), CSE Department, Jadavpur University.

Swarnendu Ghosh received his B.Tech degree in Computer Science and Engineering from West Bengal University of Technology, in 2012. He received his Masters in Computer Science and Engineering from Jadavpur University, in 2014. He has been a doctoral fellow under the Erasmus Mundus Mobility with Asia at University of Evora, Portugal. Currently he is continuing his Ph.D. on Computer Science and Engineering at Jadavpur University. He is also a junior research fellow under the project entitled

References (40)

  • K. TurkowskiFilters for common resampling tasksGraphics Gems

    (1990)

  • X. Cao et al.Transfer learning for pedestrian detection

    Neurocomputing

    (2013)

  • K. Nogueira et al.Towards better exploiting convolutional neural networks for remote sensing scene classification

    Pattern Recognit.

    (2017)

  • D. CireşAn et al.Multi-column deep neural network for traffic sign classification

    Neural Netw.

    (2012)

  • R. Sarkhel et al.A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular indic scripts

    Pattern Recognit.

    (2017)

  • S. Roy et al.Handwritten isolated Bangla compound character recognition: a new benchmark using a novel deep learning approach

    Pattern Recognit. Lett.

    (2017)

  • J. SchmidhuberDeep learning in neural networks: an overview

    Neural Netw.

    (2015)

  • M. Egmont-Petersen et al.Image processing with neural networks a review

    Pattern Recognit.

    (2002)

  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)

  • Y. LeCun, The MNIST database of handwritten digits, http://yann.lecun.com/exdb/mnist/...
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Proceedings of the Advances in Neural Information Processing Systems

    (2012)

  • O. Russakovsky et al.

    ImageNet large scale visual recognition challenge

    Int. J. Comput. Vis.

    (2015)

  • J. Deng et al.

    ImageNet: a large-scale hierarchical image database

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009

    (2009)

  • K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556...
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)

  • C. Szegedy et al.

    Going deeper with convolutions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015

    (2015)

  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    Proceedings of the International Conference on Machine Learning

    (2015)

  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)

  • W. Hu et al.

    Deep convolutional neural networks for hyperspectral image classification

    J. Sens.

    (2015)

  • P. Young et al.

    From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions

    Trans. Assoc. Comput. Linguist.

    (2014)

  • Cited by (33)

    • GC-EnC: A Copula based ensemble of CNNs for malignancy identification in breast histopathology and cytology images

      2023, Computers in Biology and Medicine

      In the present work, we have explored the potential of Copula-based ensemble of CNNs(Convolutional Neural Networks) over individual classifiers for malignancy identification in histopathology and cytology images. The Copula-based model that integrates three best performing CNN architectures, namely, DenseNet-161/201, ResNet-101/34, InceptionNet-V3 is proposed. Also, the limitation of small dataset is circumvented using a Fuzzy template based data augmentation technique that intelligently selects multiple region of interests (ROIs) from an image. The proposed framework of data augmentation amalgamated with the ensemble technique showed a gratifying performance in malignancy prediction surpassing the individual CNN’s performance on breast cytology and histopathology datasets. The proposed method has achieved accuracies of 84.37%, 97.32%, 91.67% on the JUCYT, BreakHis and BI datasets respectively. This automated technique will serve as a useful guide to the pathologist in delivering the appropriate diagnostic decision in reduced time and effort. The relevant codes of the proposed ensemble model are publicly available on GitHub.

    • Context extraction module for deep convolutional neural networks

      2022, Pattern Recognition

      Convolutional layers convolve the input feature maps to generate valuable output features, and they help deep learning methods significantly in solving complex problems. In order to tackle problems efficiently, deep learning solutions should ensure that the parameters of the model do not increase significantly with the complexity of the problem. Pointwise convolutions are primarily used for parameter reduction in many deep learning architectures. They are convolutional filters of kernel size 1×1. The pointwise convolution, however, ignores the spatial information around the points it is processing. This design is by choice, in order to reduce the overall parameters and computations. However, we hypothesize that this shortcoming of pointwise convolution has a significant impact on network performance. We propose a novel alternative design for pointwise convolution, which uses spatial information from the input efficiently. Our approach extracts spatial context information from the input at two scales and further refines the extracted context based on the channel importance. Finally, we add the refined context to the output of the pointwise convolution. This is the first work that improves pointwise convolution by incorporating context information. Our design significantly improves the performance of the networks without substantially increasing the number of parameters and computations. We perform experiments on coarse/fine-grained image classification, few-shot fine-grained classification, and on object detection. We further perform various ablation experiments to validate the significance of the different components used in our design. Lastly, we show experimentally that our proposed technique can be combined with existing state-of-the-art network performance improvement approaches to further improve the network performance.

    • Deep Learning Models for Medical Imaging

      2021, Deep Learning Models for Medical Imaging

    • Two-phase Dynamic Routing for Micro and Macro-level Equivariance in Multi-Column Capsule Networks

      2021, Pattern Recognition

      The capability of multi column convolutional networks in identifying local invariant features helps improve its performance on image classification tasks to a large extent. Suppression of non maximal activations in a convolutional network, however, can lead to loss of valuable information, as scalar activations typically only ,encode the presence (or absence) of a feature in an input image, providing no additional information. Capsule networks, on other hand, learn richer representations by propagating non-maximal activations to higher layers, encoding the agreement between neurons at various layers on the presence (or absence) of a feature into a fixed-length vector. Traditional capsule networks, however encodes agreements for micro and macro-level features of an input image with same precedence. Such an uniform agreement protocol can hinder the repsentation capability of a network, especially for datasets that contain objects with independently deformable components. To address this, we propose a novel two-phase dynamic routing protocol that computes agreements between neurons at various layers for micro and macro-level features, following a hierarchical learning paradigm. Experiments on seven publicly available datasets show that a multi-column capsule network that encodes an input image following our routing protocol performs competitively or better than contemporary multi-column convolutional architectures andtraditional capsule networks on a classification task.Implementations of the networks used in this paper have been made available at: github.com/DVLP-CMATERJU/TwoPhaseDynamicRouting.

    • Multi scale mirror connection based encoder decoder network for text localization

      2020, Pattern Recognition Letters

      Encoder decoder models with multi-scale feature concatenations have become ubiquitous for various natural scene segmentation tasks. In the current approach, a similar model with an improved mirror connection from encoders to decoder has been proposed. Three different types of mirror connections, namely, linear, parametric and convolutional, have been demonstrated in the proposed work. We have also implemented the use of internal skips to facilitate better gradient propagation within the encoder-decoder architecture. The proposed model also consists of an ensemble module that combines outputs from models with different kernel sizes, such as, 3×3, 5×5 and 7×7 to combine multi-scale features for efficient detections. The model was tested on the ICDAR 2003, SVT, ICDAR 2015 and the Total-Text dataset where it proved to be superior to other state of the art encoder-decoder architectures for pixel level classification.

    View all citing articles on Scopus

    Recommended articles (6)

    • Research article

      Saliency-guided level set model for automatic object segmentation

      Pattern Recognition, Volume 93, 2019, pp. 147-163

      The level set model is a popular method for object segmentation. However, most existing level set models perform poorly in color images since they only use grayscale intensity information to defined their energy functions. To address this shortcoming, in this paper, we propose a new saliency-guided level set model (SLSM), which can automatically segment objects in color images guided by visual saliency. Specifically, we first define a global saliency-guided energy term to extract the color objects approximately. Then, by integrating information from different color channels, we define a novel local multichannel based energy term to extract the color objects in detail. In addition, unlike using a length regularization term in the conventional level set models, we achieve segmentation smoothness by incorporating our SLSM into a graph cuts formulation. More importantly, the proposed SLSM is automatically initialized by saliency detection. Finally, the evaluation on public benchmark databases and our collected database demonstrates that the new SLSM consistently outperforms many state-of-the-art level set models and saliency detecting methods in accuracy and robustness.

    • Research article

      Three-dimensional Krawtchouk descriptors for protein local surface shape comparison

      Pattern Recognition, Volume 93, 2019, pp. 534-545

      Direct comparison of three-dimensional (3D) objects is computationally expensive due to the need for translation, rotation, and scaling of the objects to evaluate their similarity. In applications of 3D object comparison, often identifying specific local regions of objects is of particular interest. We have recently developed a set of 2D moment invariants based on discrete orthogonal Krawtchouk polynomials for comparison of local image patches. In this work, we extend them to 3D and construct 3D Krawtchouk descriptors (3DKDs) that are invariant under translation, rotation, and scaling. The new descriptors have the ability to extract local features of a 3D surface from any region-of-interest. This property enables comparison of two arbitrary local surface regions from different 3D objects. We present the new formulation of 3DKDs and apply it to the local shape comparison of protein surfaces in order to predict ligand molecules that bind to query proteins.

    • Research article

      Distributed data clustering over networks

      Pattern Recognition, Volume 93, 2019, pp. 603-620

      In this paper, we consider the problem of distributed unsupervised clustering, where training data is partitioned over a set of agents, whose interaction happens over a sparse, but connected, communication network. To solve this problem, we recast the well known Expectation Maximization method in a distributed setting, exploiting a recently proposed algorithmic framework for in-network non-convex optimization. The resulting algorithm, termed as Expectation Maximization Consensus, exploits successive local convexifications to split the computation among agents, while hinging on dynamic consensus to diffuse information over the network in real-time. Convergence to local solutions of the distributed clustering problem is then established. Experimental results on well-known datasets illustrate that the proposed method performs better than other distributed Expectation-Maximization clustering approaches, while the method is faster than a centralized Expectation-Maximization procedure and achieves a comparable performance in terms of cluster validity indexes. The latter ones achieve good values in absolute range scales and prove the quality of the obtained clustering results, which compare favorably with other methods in the literature.

    • Research article

      Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection

      Pattern Recognition, Volume 93, 2019, pp. 337-352

      Unsupervised feature selection plays an important role in machine learning and data mining, which is very challenging because of unavailable class labels. We propose an unsupervised feature selection framework by combining the discriminative information of class labels with the subspace learning in this paper. In the proposed framework, the nonnegative Laplacian embedding is first utilized to produce pseudo labels, so as to improve the classification accuracy. Then, an optimal feature subset is selected by the subspace learning guiding by the discriminative information of class labels, on the premise of maintaining the local structure of data. We develop an iterative strategy for updating similarity matrix and pseudo labels, which can bring about more accurate pseudo labels, and then we provide the convergence of the proposed strategy. Finally, experimental results on six real-world datasets prove the superiority of the proposed approach over seven state-of-the-art ones.

    • Research article

      A unified definition and computation of Laplacian spectral distances

      Pattern Recognition, Volume 93, 2019, pp. 68-78

      Laplacian spectral kernels and distances (e.g., biharmonic, heat diffusion, wave kernel distances) are easily defined through a filtering of the Laplacian eigenpairs. They play a central role in several applications, such as dimensionality reduction with spectral embeddings, diffusion geometry, image smoothing, geometric characterisations and embeddings of graphs. Extending the results recently derived in the discrete setting [38,39] to the continuous case, we propose a novel definition of the Laplacian spectral kernels and distances, whose approximation requires the solution of a set of inhom*ogeneous Laplace equations. Their discrete counterparts are equivalent to a set of sparse, symmetric, and well-conditioned linear systems, which are efficiently solved with iterative methods. Finally, we discuss the optimality of the Laplacian spectrum for the approximation of the spectral kernels, the relation between the spectral and Green kernels, and the stability of the spectral distances with respect to the evaluation of the Laplacian spectrum and to multiple Laplacian eigenvalues.

    • Research article

      On the computation of distribution-free performance bounds: Application to small sample sizes in neuroimaging

      Pattern Recognition, Volume 93, 2019, pp. 1-13

      In this paper we derive practical and novel upper bounds for the resubstitution error estimate by assessing the number of linear decision functions within the problem of pattern recognition in neuroimaging. Linear classifiers and regressors have been considered in many fields, where the number of predictors far exceeds the number of training samples available, to overcome the limitations of high complexity models in terms of computation, interpretability and overfitting. Typically in neuroimaging this is the rule rather than the exception, since the dimensionality of each observation (millions of voxels) in relation to the number of available samples (hundred of scans) implies a high risk of overfitting. Based on classical combinatorial geometry, we estimate the number of hyperplanes or linear decision rules and the corresponding distribution-independent performance bounds, comparing it to those obtained by the use of the VC-dimension concept. Experiments on synthetic and neuroimaging data demonstrate the performance of resubstitution error estimators, which are often overlooked in heterogeneous scenarios where their performance is similar to that obtained by cross-validation methods.

    Swarnendu Ghosh received his B.Tech degree in Computer Science and Engineering from West Bengal University of Technology, in 2012. He received his Masters in Computer Science and Engineering from Jadavpur University, in 2014. He has been a doctoral fellow under the Erasmus Mundus Mobility with Asia at University of Evora, Portugal. Currently he is continuing his Ph.D. on Computer Science and Engineering at Jadavpur University. He is also a junior research fellow under the project entitled “Development of knowledge graph from images using deep learning” sponsored by SERB(GOI). His area of interest is deep learning, graph based learning and knowledge representation.

    Nibaran Das received his B.Tech degree in Computer Science and Technology from Kalyani Govt. Engineering College under Kalyani University, in 2003. He received his M.C.S.E. degree from Jadavpur University, in 2005. He received his Ph.D. (Engg.) degree thereafter from Jadavpur University, in 2012. He joined J.U. as a lecturer in 2006. His areas of current research interest are OCR of handwritten text, optimization techniques, image processing and deep learning. He has been an editor of Bengali monthly magazine Computer Jagat since 2005.

    Mita Nasipuri received her B.E.Tel.E., M.E.Tel.E., and Ph.D. (Engg.) degrees from Jadavpur University, in 1979, 1981 and 1990, respectively. Prof. Nasipuri has been a faculty member of J.U. since 1987. Her current research interest includes image processing, pattern recognition, and multimedia systems. She is a senior member of the IEEE, U.S.A., Fellow of I.E. (India) and W.B.A.S.T., Kolkata, India

    View full text

    © 2019 Elsevier Ltd. All rights reserved.

    Reshaping inputs for convolutional neural network: Some common and uncommon methods (2024)
    Top Articles
    Latest Posts
    Article information

    Author: Barbera Armstrong

    Last Updated:

    Views: 5796

    Rating: 4.9 / 5 (79 voted)

    Reviews: 86% of readers found this page helpful

    Author information

    Name: Barbera Armstrong

    Birthday: 1992-09-12

    Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

    Phone: +5026838435397

    Job: National Engineer

    Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

    Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.