Convolutional Neural Networks

• Neural networks history:
• First perceptron machine was developed by Frank Rosenblatt in 1957. It was used to recognize letters of the alphabet. Back propagation wasn’t developed yet.
• Multilayer perceptron was developed in 1960 by Adaline/Madaline. Back propagation wasn’t developed yet.
• Back propagation was developed in 1986 by Rumeelhart.
• There was a period which nothing new was happening with NN. Cause of the limited computing resources and data.
• In 2006 Hinton released a paper that shows that we can train a deep neural network using Restricted Boltzmann machines to initialize the weights then back propagation.
• The first strong results was in 2012 by Hinton in speech recognition. And the AlexNet “Convolutional neural networks” that wins the image net in 2012 also by Hinton’s team.
• After that neural nets has been widely used in various applications.
• Convolutional neural networks history:
• Hubel & Wisel in 1959 to 1968 experiments on cats cortex found that there are a topographical mapping in the cortex and that the neurons has hireical organization from simple to complex.
• In 1998, Yann Lecun gives the paper Gradient-based learning applied to document recognition that introduced the Convolutional neural networks. It was good for recognizing zip letters but couldn’t run on a more complex examples.
• In 2012 AlexNet used the same Yan Lecun architecture and won the image net challenge. The difference from 1998 that now we have a large data sets that can be used also the power of the GPUs solved a lot of performance problems.
• Starting from 2012 there are CNN that are used for various tasks (Here are some applications):
• Image classification.
• Image retrieval.
• Extracting features using a NN and then do a similarity matching.
• Object detection.
• Segmentation.
• Each pixel in an image takes a label.
• Face recognition.
• Pose recognition.
• Medical images.
• Playing Atari games with reinforcement learning.
• Galaxies classification.
• Street signs recognition.
• Image captioning.
• Deep dream.
• ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture.
• There are a few distinct types of Layers in ConvNet (e.g. CONV/FC/RELU/POOL are by far the most popular)
• Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
• Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)
• How Convolutional neural networks works?
• A fully connected layer is a layer in which all the neurons is connected. Sometimes we call it a dense layer.
• If input shape is $$(X, M)$$ the weighs shape for this will be $$(NoOfHiddenNeurons, X)$$
• Convolution layer is a layer in which we will keep the structure of the input by a filter that goes through all the image.
• We do this with dot product: $$W.T*X + b$$. This equation uses the broadcasting technique.
• So we need to get the values of $$W$$ and $$b$$
• We usually deal with the filter ($$W$$) as a vector not a matrix.
• We call output of the convolution activation map. We need to have multiple activation map.
• Example if we have 6 filters, here are the shapes:
• Input image $$(32,32,3)$$
• filter size $$(5,5,3)$$
• We apply 6 filters. The depth must be three because the input map has depth of three.
• Output of Conv. $$(28,28,6)$$
• if one filter it will be $$(28,28,1)$$
• After RELU $$(28,28,6)$$
• Another filter $$(5,5,6)$$
• Output of Conv. $$(24,24,10)$$
• It turns out that convNets learns in the first layers the low features and then the mid-level features and then the high level features.
• After the Convnets we can have a linear classifier for a classification task.
• In Convolutional neural networks usually we have some (Conv ==> Relu)s and then we apply a pool operation to downsample the size of the activation.
• What is stride when we are doing convolution:
• While doing a conv layer we have many choices to make regarding the stride of which we will take. I will explain this by examples.
• Stride is skipping while sliding. By default its 1.
• Given a matrix with shape of $$(7,7)$$ and a filter with shape $$(3,3)$$:
• If stride is $$1$$ then the output shape will be $$(5,5)$$ # 2 are dropped
• If stride is $$2$$ then the output shape will be $$(3,3)$$ # 4 are dropped
• If stride is $$3$$ it doesn’t work.
• A general formula would be $$((N-F)/stride +1)$$
• If stride is $$1$$ then $$O = ((7-3)/1)+1 = 4 + 1 = 5$$
• If stride is $$2$$ then $$O = ((7-3)/2)+1 = 2 + 1 = 3$$
• If stride is $$3$$ then $$O = ((7-3)/3)+1 = 1.33 + 1 = 2.33$$ # doesn’t work
• In practice it’s common to zero pad the border. # Padding from both sides
• Give a stride of $$1$$ its common to pad to this equation: $$(F-1)/2$$ where F is the filter size
• Example $$F = 3$$ ==> Zero pad with $$1$$
• Example $$F = 5$$ ==> Zero pad with $$2$$
• If we pad this way we call this same convolution.
• Adding zeros gives another features to the edges thats why there are different padding techniques like padding the corners not zeros but in practice zeros works!
• We do this to maintain our full size of the input. If we didn’t do that the input will be shrinking too fast and we will lose a lot of data.
• Example:
• If we have input of shape $$(32,32,3)$$ and ten filters with shape is $$(5,5)$$ with stride $$1$$ and pad $$2$$
• Output size will be $$(32,32,10)$$ # We maintain the size
• Size of parameters per filter $$= 5*5*3 + 1 = 76$$
• All parameters: $$76 * 10 = 76$$
• Number of filters is usually common to be to the power of 2 to vectorize well.
• So here are the parameters for the Conv layer:
• Number of filters K.
• Usually a power of 2.
• Spatial content size F.
• 3,5,7 ….
• The stride S.
• Usually 1 or 2 (If the stride is big there will be a downsampling but different of pooling)
• Amount of Padding
• If we want the input shape to be as the output shape, based on the F if 3 its 1, if F is 5 the 2 and so on.
• Pooling makes the representation smaller and more manageable.
• Pooling Operates over each activation map independently.
• Example of pooling is the maxpooling.
• Parameters of max pooling is the size of the filter and the stride”
• Example $$2x2$$ with stride $$2$$ # Usually the two parameters are the same 2 , 2
• Also example of pooling is average pooling.
• In this case it might be learnable.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020ConvolutionalNeuralNetworks,
title   = {Convolutional Neural Networks},
author  = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year    = {2020},
note    = {\url{https://aman.ai}}
}
`