CS231n • Training Neural Networks I
- Training Neural Networks I
- Activation Functions
- Data preprocessing
- Weight initialization
- Batch normalization
- Baby sitting the learning process
- Hyperparameter Optimization
- Citation
Training Neural Networks I
-
As a revision here are the mini-batch stochastic gradient descent algorithm steps:
- Loop:
- Sample a batch of data.
- Forward prop it through the graph (network) and get loss.
- Backprop to calculate the gradients.
- Update the parameters using the gradients.
- Loop:
Activation Functions
-
Different choices for activation function includes Sigmoid, tanh, RELU, Leaky RELU, Maxout, and ELU.
-
Sigmoid:
- Squashes the numbers between [0,1]
- Used as a firing rate like human brains.
- Problems with sigmoid:
- big values neurons kill the gradients.
- Gradients are in most cases near 0 (Big values/small values), that kills the updates if the graph/network are large.
- Not Zero-centered.
- Didn’t produce zero-mean data.
exp()
is a bit compute expensive.- just to mention. We have a more complex operations in deep learning like convolution.
- big values neurons kill the gradients.
-
tanh:
- Squashes the numbers between \([-1,1]\).
- Zero centered.
- Still big values neurons “kill” the gradients.
tanh(x)
is the equation.- Proposed by Yann LeCun in 1991.
-
RELU (Rectified linear unit):
RELU(x) = max(0, x)
- Doesn’t kill the gradients.
- Only small values that are killed. Killed the gradient in the half
- Computationally efficient.
- Converges much faster than Sigmoid and tanh
(6x)
- More biologically plausible than sigmoid.
- Proposed by Alex Krizhevsky in 2012 as AlexNet.
- Problems:
- Not zero centered.
- If weights aren’t initialized good, maybe 75% of the neurons will be dead and that’s a waste computation. But its still works. This is an active area of research to optimize this.
- To solve the issue mentioned above, people might initialize all the biases by 0.01
-
Leaky RELU:
leaky_RELU(x) = max(0.01x,x)
- Doesn’t kill the gradients from both sides.
- Computationally efficient.
- Converges much faster than Sigmoid and tanh (6x).
- Will not die.
- PRELU is placing the 0.01 by a variable alpha which is learned as a parameter.
-
Exponential linear units (ELU):
-
ELU(x) = { x if x > 0 alpha *(exp(x) -1) if x <= 0 # alpha are a learning parameter }
-
It has all the benefits of RELU
-
Closer to zero mean outputs and adds some robustness to noise.
-
Problems
exp()
is a bit compute expensive.
-
-
Maxout activations:
maxout(x) = max(w1.T*x + b1, w2.T*x + b2)
- Generalizes RELU and Leaky RELU
- Doesn’t die!
- Problems:
- oubles the number of parameters per neuron
-
In practice:
- Use RELU. Be careful for your learning rates.
- Try out Leaky RELU/Maxout/ELU
- Try out tanh but don’t expect much.
- Don’t use sigmoid!
Data preprocessing
- Normalize the data:
# Zero centered data. (Calculate the mean for every input).
# On of the reasons we do this is because we need data to be between positive and negative and not all the be negative or positive.
X -= np.mean(X, axis = 1)
# Then apply the standard deviation. Hint: in images we don't do this.
X /= np.std(X, axis = 1)
-
To normalize images:
- Subtract the mean image (e.g. AlexNet).
- Mean image shape is the same as the input images.
- Or subtract the per-channel mean
- Means calculate the mean for each channel of all images. Shape is 3 (3 channels)
- Subtract the mean image (e.g. AlexNet).
Weight initialization
-
What happened when initialize all \(W's\) with zeros?
- All the neurons will do exactly the same thing. They will have the same gradient and they will have the same update.
- So if W’s of a specific layer is equal the thing described happened
-
First idea is to initialize the \(W's\) with small random numbers:
W = 0.01 * np.random.rand(D, H) # Works OK for small networks but it makes problems with deeper networks!
-
The standard deviations is going to zero in deeper networks. and the gradient will vanish sooner in deep networks.
W = 1 * np.random.rand(D, H) # Works OK for small networks but it makes problems with deeper networks!
-
The network will explode with big numbers!
-
-
Xavier initialization:
W = np.random.rand(in, out) / np.sqrt(in)
-
It works because we want the variance of the input to be as the variance of the output.
-
But it has an issue, It breaks when you are using RELU.
-
-
He initialization (Solution for the RELU issue):
W = np.random.rand(in, out) / np.sqrt(in/2)
- Solves the issue with RELU. Its recommended when you are using RELU
-
Proper initialization is an active area of research.
Batch normalization
- is a technique to provide any layer in a Neural Network with inputs that are zero mean/unit variance.
- It speeds up the training. You want to do this a lot.
- Made by Sergey Ioffe and Christian Szegedy at 2015.
- We make a Gaussian activations in each layer. by calculating the mean and the variance.
- Usually inserted after (fully connected or Convolutional layers) and (before nonlinearity).
- Steps (For each output of a layer)
- First we compute the mean and variance^2 of the batch for each feature.
- We normalize by subtracting the mean and dividing by square root of (variance^2 + epsilon)
- epsilon to not divide by zero
- Then we make a scale and shift variables:
Result = gamma * normalizedX + beta
- gamma and beta are learnable parameters.
- it basically possible to say “Hey! I don’t want zero mean/unit variance input, give me back the raw input - it’s better for me.”
- Hey shift and scale by what you want not just the mean and variance!
- The algorithm makes each layer flexible (It chooses which distribution it wants)
- We initialize the BatchNorm Parameters to transform the input to zero mean/unit variance distributions but during training they can learn that any other distribution might be better.
- During the running of the training we need to calculate the globalMean and globalVariance for each layer by using weighted average.
- Benefits of Batch Normalization
- Networks train faster.
- Allows higher learning rates.
- helps reduce the sensitivity to the initial starting weights.
- Makes more activation functions viable.
- Provides some regularization.
- Because we are calculating mean and variance for each batch that gives a slight regularization effect.
- In conv layers, we will have one variance and one mean per activation map.
- Batch normalization have worked best for CONV and regular deep NN, But for recurrent NN and reinforcement learning its still an active research area.
- Its challengey in reinforcement learning because the batch is small.
Baby sitting the learning process
- Preprocessing of data.
- Choose the architecture.
- Make a forward pass and check the loss (disable regularization). Check if the loss is reasonable.
- Add regularization, the loss should go up!
- Disable the regularization again and take a small number of data and try to train the loss and reach zero loss.
- You should overfit perfectly for small datasets.
- Take your full training data, and small regularization then try some value of learning rate.
- If loss is barely changing, then the learning rate is small.
- If you got
NAN
then your NN exploded and your learning rate is high. - Get your learning rate range by trying the min value (that can change) and the max value that doesn’t explode the network.
- Do Hyperparameters optimization to get the best hyperparameters values.
Hyperparameter Optimization
- Try Cross validation strategy.
- Run with a few epochs, and try to optimize the ranges.
- Its best to optimize in log space.
- Adjust your ranges and try again.
- Better to try random search instead of grid searches (in log space).
Citation
If you found our work useful, please cite it as:
@article{Chadha2020TrainingNeuralNetworksI,
title = {Training Neural Networks I},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year = {2020},
note = {\url{https://aman.ai}}
}