BatchNorm

  • Accelerate your NN training and converge faster: BatchNorm (BN) and LayerNorm (LN)

  • Sergey Ioffe and Christian Szegedy’s 2015 paper, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” introduced the idea of BN

  • Internal Covariate Shift is just a fancy name for the change in the distribution of network activations due to the change in network parameters during training across layers. (See the figure, how activations create the shift and how BN fixes it)

What does it mean in practice?

  • Some classical algorithms and NNs expect normalized data i.e means a mean of 0 and a standard deviation of 1. The authors showed not only ensure data is normalized before entering training but ensure it stays normalized while it’s training across layers for each mini-batch.

The above paper showed that fixing Internal Covariate shift with BN can:

  • Substantially decrease training time.
  • Remove the need for drop-out (or combine with dropout).
  • Decrease the regularization needed.
  • Allow for increased learning rate.

When not to use BN?

Auto-normalization using SELU

  • Wouldn’t it be great if deep neural networks just knew how best to normalize?
  • Enter SELU Activation:

\[\operatorname{selu}(x)=\lambda \begin{cases}x & \text { if } x>0 \\ \alpha e^{x}-\alpha & \text { if } x \leq 0\end{cases}\]

Benefits of SELU

  • Self-normalizing: automatically converges to zero mean, unit variance.
  • Allows training of very deep networks.
  • Allows strong regularization schemes.
  • Ensures learning robustness.
  • Theoretically, makes vanishing/exploding gradients impossible.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledNormalizationMethods,
  title   = {Normalization Methods},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}