Primers • Xavier Initialization and Regularization
 Introduction
 Xavier initialization
 Derivation: Xavier initialization
 L1 and L2 Regularization
 Derivation: Update rule with L1 and L2 Regularization
 Further Reading
 Citation
Introduction
 Training a deep neural network involves finding the right values for the weights: we would like to end with a model with good generalization error. Our model should perform well on the full, realworld data distribution, and should neither underfit nor overfit.
 In this section, we’ll analyze two methods, initialization and regularization, and show how they help us train models more effectively.
Xavier initialization
 All deep learning optimization methods involve an initialization of the weight parameters.
 Let’s explore the first visualization in this article to gain some intuition on the effect of different initializations. Two questions come to mind:
 What makes a good or bad initialization? How can different magnitudes of initializations lead to exploding and vanishing gradients?
 If we initialize weights to all zeros or the same value, what problem arises?
 Visualizing the effects of different initializations:
 The goal of Xavier Initialization is to initialize the weights such that the variance of the activations are the same across every layer. This constant variance helps prevent the gradient from exploding or vanishing.
 To help derive our initialization values, we will make the following simplifying assumptions:
 Weights and inputs are centered at zero
 Weights and inputs are independent and identically distributed
 Biases are initialized as zeros
 We use the
tanh()
activation function, which is approximately linear for small inputs: \(Var(a^{[l]}) \approx Var(z^{[l]})\)
Derivation: Xavier initialization
 Our full derivation gives us the following initialization rule, which we apply to all weights:
 Xavier initialization is designed to work well with tanh or sigmoid activation functions. For ReLU activations, look into He initialization, which follows a very similar derivation.
L1 and L2 Regularization

We know that \(L_1\) regularization encourages sparse weights (many zero values), and that \(L_2\) regularization encourages small weight values, but why does this happen?

Let’s consider some cost function \(J(w_1,\dots,w_l)\), a function of weight matrices \(w_1,\dots,w_l\). Let’s define the following two regularized cost functions:
Derivation: Update rule with L1 and L2 Regularization
 Let’s derive the update rules for L1 and L2 regularization based on their respective cost functions.
 The update for \(w_i\) when using \(J_{L_1}\) is:
 The update for \(w_i\) when using \(J_{L_2}\) is:
 The next question is: what do you notice that is different between these two update rules, and how does it affect optimization? What effect does the hyperparameter \(\lambda\) have?
 The figure below shows a histogram of weight values for an unregularized (red) and L1 regularized (blue left) and L2 regularized (blue right) network:

The different effects of \(L_1\) and \(L_2\) regularization on the optimal parameters are an artifact of the different ways in which they change the original loss landscape. In the case of two parameters (\(w_1\) and \(w_2\)), we can visualize this.

The figure below shows the landscape of a two parameter loss function with L1 regularization (left) and L2 regularization (right):
Further Reading
 Here are some (optional) links you may find interesting for further reading:
 Daniel Kunin’s blog post for a deeper treatment into regularization.
 Chapter 3 of The Elements of Statistical Learning.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledXavierInitandRegularization,
title = {Xavier Initialization and Regularization},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}