Convolutional Neural Networks

  • Neural networks history:
    • First perceptron machine was developed by Frank Rosenblatt in 1957. It was used to recognize letters of the alphabet. Back propagation wasn’t developed yet.
    • Multilayer perceptron was developed in 1960 by Adaline/Madaline. Back propagation wasn’t developed yet.
    • Back propagation was developed in 1986 by Rumeelhart.
    • There was a period which nothing new was happening with NN. Cause of the limited computing resources and data.
    • In 2006 Hinton released a paper that shows that we can train a deep neural network using Restricted Boltzmann machines to initialize the weights then back propagation.
    • The first strong results was in 2012 by Hinton in speech recognition. And the AlexNet “Convolutional neural networks” that wins the image net in 2012 also by Hinton’s team.
    • After that neural nets has been widely used in various applications.
  • Convolutional neural networks history:
    • Hubel & Wisel in 1959 to 1968 experiments on cats cortex found that there are a topographical mapping in the cortex and that the neurons has hireical organization from simple to complex.
    • In 1998, Yann Lecun gives the paper Gradient-based learning applied to document recognition that introduced the Convolutional neural networks. It was good for recognizing zip letters but couldn’t run on a more complex examples.
    • In 2012 AlexNet used the same Yan Lecun architecture and won the image net challenge. The difference from 1998 that now we have a large data sets that can be used also the power of the GPUs solved a lot of performance problems.
    • Starting from 2012 there are CNN that are used for various tasks (Here are some applications):
      • Image classification.
      • Image retrieval.
        • Extracting features using a NN and then do a similarity matching.
      • Object detection.
      • Segmentation.
        • Each pixel in an image takes a label.
      • Face recognition.
      • Pose recognition.
      • Medical images.
      • Playing Atari games with reinforcement learning.
      • Galaxies classification.
      • Street signs recognition.
      • Image captioning.
      • Deep dream.
  • ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture.
  • There are a few distinct types of Layers in ConvNet (e.g. CONV/FC/RELU/POOL are by far the most popular)
  • Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
  • Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)
  • How Convolutional neural networks works?
    • A fully connected layer is a layer in which all the neurons is connected. Sometimes we call it a dense layer.
      • If input shape is \((X, M)\) the weighs shape for this will be \((NoOfHiddenNeurons, X)\)
    • Convolution layer is a layer in which we will keep the structure of the input by a filter that goes through all the image.
      • We do this with dot product: \(W.T*X + b\). This equation uses the broadcasting technique.
      • So we need to get the values of \(W\) and \(b\)
      • We usually deal with the filter (\(W\)) as a vector not a matrix.
    • We call output of the convolution activation map. We need to have multiple activation map.
      • Example if we have 6 filters, here are the shapes:
        • Input image \((32,32,3)\)
        • filter size \((5,5,3)\)
          • We apply 6 filters. The depth must be three because the input map has depth of three.
        • Output of Conv. \((28,28,6)\)
          • if one filter it will be \((28,28,1)\)
        • After RELU \((28,28,6)\)
        • Another filter \((5,5,6)\)
        • Output of Conv. \((24,24,10)\)
    • It turns out that convNets learns in the first layers the low features and then the mid-level features and then the high level features.
    • After the Convnets we can have a linear classifier for a classification task.
    • In Convolutional neural networks usually we have some (Conv ==> Relu)s and then we apply a pool operation to downsample the size of the activation.
  • What is stride when we are doing convolution:
    • While doing a conv layer we have many choices to make regarding the stride of which we will take. I will explain this by examples.
    • Stride is skipping while sliding. By default its 1.
    • Given a matrix with shape of \((7,7)\) and a filter with shape \((3,3)\):
      • If stride is \(1\) then the output shape will be \((5,5)\) # 2 are dropped
      • If stride is \(2\) then the output shape will be \((3,3)\) # 4 are dropped
      • If stride is \(3\) it doesn’t work.
    • A general formula would be \(((N-F)/stride +1)\)
      • If stride is \(1\) then \(O = ((7-3)/1)+1 = 4 + 1 = 5\)
      • If stride is \(2\) then \(O = ((7-3)/2)+1 = 2 + 1 = 3\)
      • If stride is \(3\) then \(O = ((7-3)/3)+1 = 1.33 + 1 = 2.33\) `# doesn’t work$$
  • In practice it’s common to zero pad the border. # Padding from both sides
    • Give a stride of \(1\) its common to pad to this equation: \((F-1)/2\) where F is the filter size
      • Example \(F = 3\) ==> Zero pad with \(1\)
      • Example \(F = 5\) ==> Zero pad with \(2\)
    • If we pad this way we call this same convolution.
    • Adding zeros gives another features to the edges thats why there are different padding techniques like padding the corners not zeros but in practice zeros works!
    • We do this to maintain our full size of the input. If we didn’t do that the input will be shrinking too fast and we will lose a lot of data.
  • Example:
    • If we have input of shape \((32,32,3)\) and ten filters with shape is \((5,5)\) with stride \(1\) and pad \(2\)
      • Output size will be \((32,32,10)\) # We maintain the size
    • Size of parameters per filter \(= 5*5*3 + 1 = 76\)
    • All parameters: \(76 * 10 = 76\)
  • Number of filters is usually common to be to the power of 2 to vectorize well.
  • So here are the parameters for the Conv layer:
    • Number of filters K.
      • Usually a power of 2.
    • Spatial content size F.
      • 3,5,7 ….
    • The stride S.
      • Usually 1 or 2 (If the stride is big there will be a downsampling but different of pooling)
    • Amount of Padding
      • If we want the input shape to be as the output shape, based on the F if 3 its 1, if F is 5 the 2 and so on.
  • Pooling makes the representation smaller and more manageable.
  • Pooling Operates over each activation map independently.
  • Example of pooling is the maxpooling.
    • Parameters of max pooling is the size of the filter and the stride”
      • Example \(2x2\) with stride \(2\) # Usually the two parameters are the same 2 , 2
  • Also example of pooling is average pooling.
    • In this case it might be learnable.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020ConvolutionalNeuralNetworks,
  title   = {Convolutional Neural Networks},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}