Optimization algorithms

  • Problems with stochastic gradient descent:

    • if loss quickly in one direction and slowly in another (For only two variables), you will get very slow progress along shallow dimension, jitter along steep direction. Our NN will have a lot of parameters then the problem will be more.
    • Local minimum or saddle points
      • If SGD went into local minimum we will stuck at this point because the gradient is zero.
      • Also in saddle points the gradient will be zero so we will stuck.
      • Saddle points says that at some point:
        • Some gradients will get the loss up.
        • Some gradients will get the loss down.
        • And that happens more in high dimensional (100 million dimension for example)
      • The problem of deep NN is more about saddle points than about local minimum because deep NN has high dimensions (Parameters)
      • Mini batches are noisy because the gradient is not taken for the whole batch.
  • SGD + momentum

    • Build up velocity as a running mean of gradients:

    • # Computing weighted average. rho best is in range [0.9 - 0.99]
      V[t+1] = rho * v[t] + dx
      x[t+1] = x[t] - learningRate * V[t+1]
    • V[0] is zero.

    • Solves the saddle point and local minimum problems.

    • It overshoots the problem and returns to it back.
  • Nestrov momentum

    • dx = compute_gradient(x)
      old_v = v
      v = rho * v - learning_rate * dx
      x+= -rho * old_v + (1+rho) * v
    • Doesn’t overshoot the problem but slower than SGD + momentum
  • AdaGrad

    • grad_squared = 0
        dx = compute_gradient(x)
        # here is a problem, the grad_squared isn't decayed (gets so large)
        grad_squared += dx * dx     
        x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
  • RMSProp

    • grad_squared = 0
        dx = compute_gradient(x)
        #Solved ADAgra
        grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx  
        x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
    • People uses this instead of AdaGrad
  • Adam

    • Calculates the momentum and RMSProp as the gradients.
    • It need a Fixing bias to fix starts of gradients.
    • Is the best technique so far runs best on a lot of problems.
    • With beta1 = 0.9 and beta2 = 0.999 and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!
  • Learning decay

    • Ex. decay learning rate by half every few epochs.
    • To help the learning rate not to bounce out.
    • Learning decay is common with SGD+momentum but not common with Adam.
    • Dont use learning decay from the start at choosing your hyperparameters. Try first and check if you need decay or not.
  • All the above algorithms we have discussed is a first order optimization.

  • Second order optimization

    • Use gradient and Hessian to from quadratic approximation.
    • Step to the minima of the approximation.
    • What is nice about this update?
      • It doesn’t has a learning rate in some of the versions.
    • But its unpractical for deep learning
      • Has O(N^2) elements.
      • Inverting takes O(N^3).
    • L-BFGS is a version of second order optimization
      • Works with batch optimization but not with mini-batches.
  • In practice first use ADAM and if it didn’t work try L-BFGS.

  • Some says all the famous deep architectures uses SGS + Nestrov momentum


  • So far we have talked about reducing the training error, but we care about most is how our model will handle unseen data!
  • What if the gab of the error between training data and validation data are too large?
  • This error is called high variance.
  • Model Ensembles:
    • Algorithm:
      • Train multiple independent models of the same architecture with different initializations.
      • At test time average their results.
    • It can get you extra 2% performance.
    • It reduces the generalization error.
    • You can use some snapshots of your NN at the training ensembles them and take the results.
  • Regularization solves the high variance problem. We have talked about L1, L2 Regularization.
  • Some Regularization techniques are designed for only NN and can do better.
  • Drop out:
    • In each forward pass, randomly set some of the neurons to zero. Probability of dropping is a hyperparameter that are 0.5 for almost cases.
    • So you will chooses some activation and makes them zero.
    • It works because:
      • It forces the network to have redundant representation; prevent co-adaption of features!
      • If you think about this, It ensemble some of the models in the same model!
    • At test time we might multiply each dropout layer by the probability of the dropout.
    • Sometimes at test time we don’t multiply anything and leave it as it is.
    • With drop out it takes more time to train.
  • Data augmentation:
    • Another technique that makes Regularization.
    • Change the data!
    • For example flip the image, or rotate it.
    • Example in ResNet:
      • Training: Sample random crops and scales:
        1. Pick random L in range [256,480]
        2. Resize training image, short side = L
        3. Sample random 224x244 patch.
      • Testing: average a fixed set of crops
        1. Resize image at 5 scales: {224, 256, 384, 480, 640}
        2. For each size, use 10 224x224 crops: 4 corners + center + flips
      • Apply Color jitter or PCA
      • Translation, rotation, stretching.
  • Drop connect
    • Like drop out idea it makes a regularization.
    • Instead of dropping the activation, we randomly zeroing the weights.
  • Fractional Max Pooling
    • Cool regularization idea. Not commonly used.
    • Randomize the regions in which we pool.
  • Stochastic depth
    • New idea.
    • Eliminate layers, instead on neurons.
    • Has the similar effect of drop out but its a new idea.

Transfer learning

  • Some times your data is overfitted by your model because the data is small not because of regularization.

  • You need a lot of data if you want to train/use CNNs.

  • Steps of transfer learning

    1. Train on a big dataset that has common features with your dataset. Called pretraining.
    2. Freeze the layers except the last layer and feed your small dataset to learn only the last layer.
    3. Not only the last layer maybe trained again, you can fine tune any number of layers you want based on the number of data you have
  • Guide to use transfer learning:

    •   Very Similar dataset very different dataset
      very little dataset Use Linear classifier on top layer You’re in trouble.. Try linear classifier from different stages
      quite a lot of data Finetune a few layers Finetune a large layers
  • Transfer learning is the normal not an exception.


If you found our work useful, please cite it as:

  title   = {Training Neural Networks II},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{https://aman.ai}}