CS231n • Training Neural Networks II
Optimization algorithms
-
Problems with stochastic gradient descent:
- if loss quickly in one direction and slowly in another (For only two variables), you will get very slow progress along shallow dimension, jitter along steep direction. Our NN will have a lot of parameters then the problem will be more.
- Local minimum or saddle points
- If SGD went into local minimum we will stuck at this point because the gradient is zero.
- Also in saddle points the gradient will be zero so we will stuck.
- Saddle points says that at some point:
- Some gradients will get the loss up.
- Some gradients will get the loss down.
- And that happens more in high dimensional (100 million dimension for example)
- The problem of deep NN is more about saddle points than about local minimum because deep NN has high dimensions (Parameters)
- Mini batches are noisy because the gradient is not taken for the whole batch.
-
SGD + momentum
-
Build up velocity as a running mean of gradients:
-
# Computing weighted average. rho best is in range [0.9 - 0.99] V[t+1] = rho * v[t] + dx x[t+1] = x[t] - learningRate * V[t+1]
-
V[0]
is zero. -
Solves the saddle point and local minimum problems.
- It overshoots the problem and returns to it back.
-
-
Nestrov momentum
-
dx = compute_gradient(x) old_v = v v = rho * v - learning_rate * dx x+= -rho * old_v + (1+rho) * v
- Doesn’t overshoot the problem but slower than SGD + momentum
-
-
AdaGrad
-
grad_squared = 0 while(True): dx = compute_gradient(x) # here is a problem, the grad_squared isn't decayed (gets so large) grad_squared += dx * dx x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
-
-
RMSProp
-
grad_squared = 0 while(True): dx = compute_gradient(x) #Solved ADAgra grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
- People uses this instead of AdaGrad
-
-
Adam
- Calculates the momentum and RMSProp as the gradients.
- It need a Fixing bias to fix starts of gradients.
- Is the best technique so far runs best on a lot of problems.
- With
beta1 = 0.9
andbeta2 = 0.999
andlearning_rate = 1e-3
or5e-4
is a great starting point for many models!
-
Learning decay
- Ex. decay learning rate by half every few epochs.
- To help the learning rate not to bounce out.
- Learning decay is common with SGD+momentum but not common with Adam.
- Dont use learning decay from the start at choosing your hyperparameters. Try first and check if you need decay or not.
-
All the above algorithms we have discussed is a first order optimization.
-
Second order optimization
- Use gradient and Hessian to from quadratic approximation.
- Step to the minima of the approximation.
- What is nice about this update?
- It doesn’t has a learning rate in some of the versions.
- But its unpractical for deep learning
- Has O(N^2) elements.
- Inverting takes O(N^3).
- L-BFGS is a version of second order optimization
- Works with batch optimization but not with mini-batches.
-
In practice first use ADAM and if it didn’t work try L-BFGS.
-
Some says all the famous deep architectures uses SGS + Nestrov momentum
Regularization
- So far we have talked about reducing the training error, but we care about most is how our model will handle unseen data!
- What if the gab of the error between training data and validation data are too large?
- This error is called high variance.
- Model Ensembles:
- Algorithm:
- Train multiple independent models of the same architecture with different initializations.
- At test time average their results.
- It can get you extra 2% performance.
- It reduces the generalization error.
- You can use some snapshots of your NN at the training ensembles them and take the results.
- Algorithm:
- Regularization solves the high variance problem. We have talked about L1, L2 Regularization.
- Some Regularization techniques are designed for only NN and can do better.
- Drop out:
- In each forward pass, randomly set some of the neurons to zero. Probability of dropping is a hyperparameter that are 0.5 for almost cases.
- So you will chooses some activation and makes them zero.
- It works because:
- It forces the network to have redundant representation; prevent co-adaption of features!
- If you think about this, It ensemble some of the models in the same model!
- At test time we might multiply each dropout layer by the probability of the dropout.
- Sometimes at test time we don’t multiply anything and leave it as it is.
- With drop out it takes more time to train.
- Data augmentation:
- Another technique that makes Regularization.
- Change the data!
- For example flip the image, or rotate it.
- Example in ResNet:
- Training: Sample random crops and scales:
- Pick random L in range [256,480]
- Resize training image, short side = L
- Sample random 224x244 patch.
- Testing: average a fixed set of crops
- Resize image at 5 scales: {224, 256, 384, 480, 640}
- For each size, use 10 224x224 crops: 4 corners + center + flips
- Apply Color jitter or PCA
- Translation, rotation, stretching.
- Training: Sample random crops and scales:
- Drop connect
- Like drop out idea it makes a regularization.
- Instead of dropping the activation, we randomly zeroing the weights.
- Fractional Max Pooling
- Cool regularization idea. Not commonly used.
- Randomize the regions in which we pool.
- Stochastic depth
- New idea.
- Eliminate layers, instead on neurons.
- Has the similar effect of drop out but its a new idea.
Transfer learning
-
Some times your data is overfitted by your model because the data is small not because of regularization.
-
You need a lot of data if you want to train/use CNNs.
-
Steps of transfer learning
- Train on a big dataset that has common features with your dataset. Called pretraining.
- Freeze the layers except the last layer and feed your small dataset to learn only the last layer.
- Not only the last layer maybe trained again, you can fine tune any number of layers you want based on the number of data you have
-
Guide to use transfer learning:
-
Very Similar dataset very different dataset very little dataset Use Linear classifier on top layer You’re in trouble.. Try linear classifier from different stages quite a lot of data Finetune a few layers Finetune a large layers
-
-
Transfer learning is the normal not an exception.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020TrainingNeuralNetworksII,
title = {Training Neural Networks II},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year = {2020},
note = {\url{https://aman.ai}}
}