Primers • Batchnorm
 Introduction
 Problem of Training Deep Networks
 Standardize Layer Inputs
 How to Standardize Layer Inputs
 Examples of Using Batch Normalization
 Tips for Using Batch Normalization
 Further Reading
 Key takeaways
 Citation
Introduction

Training deep neural networks with tens of layers is challenging as they can be sensitive to the initial random weights and configuration of the learning algorithm.

One possible reason for this difficulty is the distribution of the inputs to layers deep in the network may change after each minibatch when the weights are updated. This can cause the learning algorithm to forever chase a moving target. This change in the distribution of inputs to layers in the network is referred to the technical name “internal covariate shift.”

Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each minibatch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.

In this article, you will discover the batch normalization method used to accelerate the training of deep learning neural networks.
Problem of Training Deep Networks

Training deep neural networks, e.g. networks with tens of hidden layers, is challenging.

One aspect of this challenge is that the model is updated layerbylayer backward from the output to the input using an estimate of error that assumes the weights in the layers prior to the current layer are fixed.
Very deep models involve the composition of several functions or layers. The gradient tells how to update each parameter, under the assumption that the other layers do not change. In practice, we update all of the layers simultaneously. — Page 317, Deep Learning, 2016.

Because all layers are changed during an update, the update procedure is forever chasing a moving target.

For example, the weights of a layer are updated given an expectation that the prior layer outputs values with a given distribution. This distribution is likely changed after the weights of the prior layer are updated.
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
 The authors of the paper introducing batch normalization refer to change in the distribution of inputs during training as “internal covariate shift.”
We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
Standardize Layer Inputs
 Batch normalization, or batchnorm for short, proposed in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift by Ioffe and Szegedy (2015) as a technique to help coordinate the update of multiple layers in the model.
Batch normalization provides an elegant way of reparametrizing almost any deep network. The reparametrization significantly reduces the problem of coordinating updates across many layers. — Page 318, Deep Learning, 2016.
 It does this scaling the output of the layer, specifically by standardizing the activations of each input variable per minibatch, such as the activations of a node from the previous layer. Recall that standardization refers to rescaling data to have a mean of zero and a standard deviation of one, e.g. a standard Gaussian.
Batch normalization reparametrizes the model to make some units always be standardized by definition. — Page 319, Deep Learning, 2016.
 This process is also called “whitening” when applied to images in computer vision.
By whitening the inputs to each layer, we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift. — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
 Standardizing the activations of the prior layer means that assumptions the subsequent layer makes about the spread and distribution of inputs during the weight update will not change, at least not dramatically. This has the effect of stabilizing and speedingup the training process of deep neural networks.
Batch normalization acts to standardize only the mean and variance of each unit in order to stabilize learning, but allows the relationships between units and the nonlinear statistics of a single unit to change.  Page 320, Deep Learning, 2016.
 Normalizing the inputs to the layer has an effect on the training of the model, dramatically reducing the number of epochs required. It can also have a regularizing effect, reducing generalization error much like the use of activation regularization.
Batch normalization can have a dramatic effect on optimization performance, especially for convolutional networks and networks with sigmoidal nonlinearities. — Page 425, Deep Learning, 2016.
 Although reducing “internal covariate shift” was a motivation in the development of the method, there is some suggestion that instead batch normalization is effective because it smooths and, in turn, simplifies the optimization function that is being solved when training the network.
… BatchNorm impacts network training in a fundamental way: it makes the landscape of the corresponding optimization problem be significantly more smooth. This ensures, in particular, that the gradients are more predictive and thus allow for use of larger range of learning rates and faster network convergence. — How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift), 2018.
How to Standardize Layer Inputs

Batch normalization can be implemented during training by calculating the mean and standard deviation of each input variable to a layer per minibatch and using these statistics to perform the standardization.

Alternately, a running average of mean and standard deviation can be maintained across minibatches, but may result in unstable training.
It is natural to ask whether we could simply use the moving averages […] to perform the normalization during training […]. This, however, has been observed to lead to the model blowing up. — Batch Renormalization: Towards Reducing Minibatch Dependence in BatchNormalized Models, 2017.
 After training, the mean and standard deviation of inputs for the layer can be set as mean values observed over the training dataset. The Batchnorm algorithm is as shown below (diagram taken from Ioffe and Szegedy, 2015).
 For small minibatch sizes or minibatches that do not contain a representative distribution of examples from the training dataset, the differences in the standardized inputs between training and inference (using the model after training) can result in noticeable differences in performance. This can be addressed with a modification of the method called Batch Renormalization (or BatchRenorm for short) that makes the estimates of the variable mean and standard deviation more stable across minibatches.
Batch Renormalization extends batchnorm with a perdimension correction to ensure that the activations match between the training and inference networks. — Batch Renormalization: Towards Reducing Minibatch Dependence in BatchNormalized Models, 2017.

This standardization of inputs may be applied to input variables for the first hidden layer or to the activations from a hidden layer for deeper layers.

In practice, it is common to allow the layer to learn two new parameters, namely a new mean and standard deviation, Beta and Gamma respectively, that allow the automatic scaling and shifting of the standardized layer inputs. These parameters are learned by the model as part of the training process.
Note that simply normalizing each input of a layer may change what the layer can represent. […] These parameters are learned along with the original model parameters, and restore the representation power of the network. — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.

Importantly the backpropagation algorithm is updated to operate upon the transformed inputs, and error is also used to update the new scale and shifting parameters learned by the model.

The standardization is applied to the inputs to the layer, namely the input variables or the output of the activation function from the prior layer. Given the choice of activation function, the distribution of the inputs to the layer may be quite nonGaussian. In this case, there may be benefit in standardizing the summed activation before the activation function in the previous layer.
We add the BN transform immediately before the nonlinearity […] We could have also normalized the layer inputs \(u\), but since \(u\) is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
Examples of Using Batch Normalization

This section provides a few examples of milestone papers and popular models that make use of batch normalization.

In the 2015 paper that introduced the technique titled “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” the authors Sergey Ioffe and Christian Szegedy from Google demonstrated a dramatic speedup of an Inceptionbased convolutional neural network for photo classification over a baseline method.
By only using Batch Normalization […], we match the accuracy of Inception in less than half the number of training steps.
 Kaiming He, et al. in their 2015 paper titled “Deep Residual Learning for Image Recognition” used batch normalization after the convolutional layers in their very deep model referred to as ResNet and achieve then stateoftheart results on the ImageNet dataset, a standard photo classification task.
We adopt batch normalization (BN) right after each convolution and before activation …
 Christian Szegedy, et al. from Google in their 2016 paper titled “Rethinking the Inception Architecture for Computer Vision” used batch normalization in their updated inception model referred to as GoogleNet Inceptionv3, achieving then stateoftheart results on the ImageNet dataset.
BNauxiliary refers to the version in which the fully connected layer of the auxiliary classifier is also batchnormalized, not just the convolutions.
 Dario Amodei from Baidu in their 2016 paper titled “Deep Speech 2 : EndtoEnd Speech Recognition in English and Mandarin” use a variation of batch normalization recurrent neural networks in their endtoend deep model for speech recognition.
… we find that when applied to very deep networks of RNNs on large data sets, the variant of BatchNorm we use substantially improves final generalization error in addition to accelerating training
Tips for Using Batch Normalization
 This section provides tips and suggestions for using batch normalization with your own neural networks.
Use With Different Network Types

Batch normalization is a general technique that can be used to normalize the inputs to a layer.

It can be used with most network types, such as Multilayer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks.
Probably Use Before the Activation

Batch normalization may be used on the inputs to the layer before or after the activation function in the previous layer.

It may be more appropriate after the activation function if for sshaped functions like the hyperbolic tangent and logistic function.

It may be appropriate before the activation function for activations that may result in nonGaussian distributions like the rectified linear activation function (ReLU), the modern default for most network types.
The goal of Batch Normalization is to achieve a stable distribution of activation values throughout training, and in our experiments we apply it before the nonlinearity since that is where matching the first and second moments is more likely to result in a stable distribution.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
 Perhaps test both approaches with your network.
Use Large Learning Rates

Using batch normalization makes the network more stable during training.

This may require the use of much larger than normal learning rates, that in turn may further speed up the learning process.
In a batchnormalized model, we have been able to achieve a training speedup from higher learning rates, with no ill side effects — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
 The faster training also means that the decay rate used for the learning rate may be increased.
Less Sensitive to Weight Initialization

Deep neural networks can be quite sensitive to the technique used to initialize the weights prior to training.

The stability to training brought by batch normalization can make training deep networks less sensitive to the choice of weight initialization method.
Alternate to Data Preparation

Batch normalization could be used to standardize raw input variables that have differing scales.

If the mean and standard deviations calculated for each input feature are calculated over the minibatch instead of over the entire training dataset, then the batch size must be sufficiently representative of the range of each variable.

It may not be appropriate for variables that have a data distribution that is highly nonGaussian, in which case it might be better to perform data scaling as a preprocessing step.
Don’t Use With Dropout
 Batch normalization offers some regularization effect, reducing generalization error, perhaps no longer requiring the use of dropout for regularization.
Removing Dropout from Modified BNInception speeds up training, without increasing overfitting. — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.

Further, it may not be a good idea to use batch normalization and dropout in the same network.

The reason is that the statistics used to normalize the activations of the prior layer may become noisy given the random dropping out of nodes during the dropout procedure.
Batch normalization also sometimes reduces generalization error and allows dropout to be omitted, due to the noise in the estimate of the statistics used to normalize each variable. — Page 425, Deep Learning, 2016.
Further Reading
 This section provides more resources on the topic if you are looking to go deeper.
Books
 Section – 8.7.1 Batch Normalization, Deep Learning, 2016
 Section 7.3.1. Advanced architecture patterns, Deep Learning With Python, 2017
Papers
 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015
 Batch Renormalization: Towards Reducing Minibatch Dependence in BatchNormalized Models, 2017
 How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift), 2018
Articles
 Batch normalization, Wikipedia
 Why Does Batch Norm Work?, deeplearning.ai, Video
 Batch Normalization, OpenAI, 2016
 Batch Normalization before or after ReLU?, Reddit
Key takeaways

In this post, you discovered the batch normalization method used to accelerate the training of deep learning neural networks.

Specifically, you learned:
 Deep neural networks are challenging to train, not least because the input from prior layers can change after weight updates.
 Batch normalization is a technique to standardize the inputs to a network, applied to ether the activations of a prior layer or inputs directly.
 Batch normalization accelerates training, in some cases by halving the epochs or better, and provides some regularization, reducing generalization error.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledBatchNorm,
title = {Batchnorm},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}