CS231n • Training Neural Networks II
Optimization algorithms

Problems with stochastic gradient descent:
 if loss quickly in one direction and slowly in another (For only two variables), you will get very slow progress along shallow dimension, jitter along steep direction. Our NN will have a lot of parameters then the problem will be more.
 Local minimum or saddle points
 If SGD went into local minimum we will stuck at this point because the gradient is zero.
 Also in saddle points the gradient will be zero so we will stuck.
 Saddle points says that at some point:
 Some gradients will get the loss up.
 Some gradients will get the loss down.
 And that happens more in high dimensional (100 million dimension for example)
 The problem of deep NN is more about saddle points than about local minimum because deep NN has high dimensions (Parameters)
 Mini batches are noisy because the gradient is not taken for the whole batch.

SGD + momentum

Build up velocity as a running mean of gradients:

# Computing weighted average. rho best is in range [0.9  0.99] V[t+1] = rho * v[t] + dx x[t+1] = x[t]  learningRate * V[t+1]

V[0]
is zero. 
Solves the saddle point and local minimum problems.
 It overshoots the problem and returns to it back.


Nestrov momentum

dx = compute_gradient(x) old_v = v v = rho * v  learning_rate * dx x+= rho * old_v + (1+rho) * v
 Doesn’t overshoot the problem but slower than SGD + momentum


AdaGrad

grad_squared = 0 while(True): dx = compute_gradient(x) # here is a problem, the grad_squared isn't decayed (gets so large) grad_squared += dx * dx x = (learning_rate*dx) / (np.sqrt(grad_squared) + 1e7)


RMSProp

grad_squared = 0 while(True): dx = compute_gradient(x) #Solved ADAgra grad_squared = decay_rate * grad_squared + (1grad_squared) * dx * dx x = (learning_rate*dx) / (np.sqrt(grad_squared) + 1e7)
 People uses this instead of AdaGrad


Adam
 Calculates the momentum and RMSProp as the gradients.
 It need a Fixing bias to fix starts of gradients.
 Is the best technique so far runs best on a lot of problems.
 With
beta1 = 0.9
andbeta2 = 0.999
andlearning_rate = 1e3
or5e4
is a great starting point for many models!

Learning decay
 Ex. decay learning rate by half every few epochs.
 To help the learning rate not to bounce out.
 Learning decay is common with SGD+momentum but not common with Adam.
 Dont use learning decay from the start at choosing your hyperparameters. Try first and check if you need decay or not.

All the above algorithms we have discussed is a first order optimization.

Second order optimization
 Use gradient and Hessian to from quadratic approximation.
 Step to the minima of the approximation.
 What is nice about this update?
 It doesn’t has a learning rate in some of the versions.
 But its unpractical for deep learning
 Has O(N^2) elements.
 Inverting takes O(N^3).
 LBFGS is a version of second order optimization
 Works with batch optimization but not with minibatches.

In practice first use ADAM and if it didn’t work try LBFGS.

Some says all the famous deep architectures uses SGS + Nestrov momentum
Regularization
 So far we have talked about reducing the training error, but we care about most is how our model will handle unseen data!
 What if the gab of the error between training data and validation data are too large?
 This error is called high variance.
 Model Ensembles:
 Algorithm:
 Train multiple independent models of the same architecture with different initializations.
 At test time average their results.
 It can get you extra 2% performance.
 It reduces the generalization error.
 You can use some snapshots of your NN at the training ensembles them and take the results.
 Algorithm:
 Regularization solves the high variance problem. We have talked about L1, L2 Regularization.
 Some Regularization techniques are designed for only NN and can do better.
 Drop out:
 In each forward pass, randomly set some of the neurons to zero. Probability of dropping is a hyperparameter that are 0.5 for almost cases.
 So you will chooses some activation and makes them zero.
 It works because:
 It forces the network to have redundant representation; prevent coadaption of features!
 If you think about this, It ensemble some of the models in the same model!
 At test time we might multiply each dropout layer by the probability of the dropout.
 Sometimes at test time we don’t multiply anything and leave it as it is.
 With drop out it takes more time to train.
 Data augmentation:
 Another technique that makes Regularization.
 Change the data!
 For example flip the image, or rotate it.
 Example in ResNet:
 Training: Sample random crops and scales:
 Pick random L in range [256,480]
 Resize training image, short side = L
 Sample random 224x244 patch.
 Testing: average a fixed set of crops
 Resize image at 5 scales: {224, 256, 384, 480, 640}
 For each size, use 10 224x224 crops: 4 corners + center + flips
 Apply Color jitter or PCA
 Translation, rotation, stretching.
 Training: Sample random crops and scales:
 Drop connect
 Like drop out idea it makes a regularization.
 Instead of dropping the activation, we randomly zeroing the weights.
 Fractional Max Pooling
 Cool regularization idea. Not commonly used.
 Randomize the regions in which we pool.
 Stochastic depth
 New idea.
 Eliminate layers, instead on neurons.
 Has the similar effect of drop out but its a new idea.
Transfer learning

Some times your data is overfitted by your model because the data is small not because of regularization.

You need a lot of data if you want to train/use CNNs.

Steps of transfer learning
 Train on a big dataset that has common features with your dataset. Called pretraining.
 Freeze the layers except the last layer and feed your small dataset to learn only the last layer.
 Not only the last layer maybe trained again, you can fine tune any number of layers you want based on the number of data you have

Guide to use transfer learning:

Very Similar dataset very different dataset very little dataset Use Linear classifier on top layer You’re in trouble.. Try linear classifier from different stages quite a lot of data Finetune a few layers Finetune a large layers


Transfer learning is the normal not an exception.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020TrainingNeuralNetworksII,
title = {Training Neural Networks II},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year = {2020},
note = {\url{https://aman.ai}}
}