• In order to avoid overfitting the training set (and thus improve generalization), you can try to reduce the complexity of the model by removing layers, and consequently decreasing the number of parameters. As shown in “A Simple Weight Decay Can Improve Generalization” by Krogh and Hertz (1992), another way to constrain a network and lower its complexity is to:

Limit the growth of the weights through some kind of weight decay.

  • You want to prevent the weights from growing too large, unless it is really necessary. Intuitively, you are reducing the set of potential networks to choose from.

Weight sparsity and why it matters

  • Sparse vectors often contain many dimensions. Creating a feature-cross results in even more dimensions. Given such high-dimensional feature vectors, the model size may become huge and require huge amounts of RAM.

  • In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.

  • For example, consider a housing dataset that covers the entire globe. Bucketing global latitude at the minute level (60 minutes per degree) gives about 10,000 dimensions in a sparse encoding; global longitude at the minute level gives about 20,000 dimensions. A feature-cross of these two features would result in roughly 200,000,000 dimensions. Many of those 200,000,000 dimensions represent areas of such limited residence (for example, the middle of the ocean) that it would be difficult to use that data to generalize effectively. It would be silly to pay the RAM cost of storing these unneeded dimensions. Therefore, it would be nice to encourage the weights for the meaningless dimensions to drop to exactly 0, which would allow us to avoid paying for the storage cost of these model coefficients at inference time.

  • We might be able to encode this idea into the optimization problem done at training time, by adding an appropriately chosen regularization term.

  • Would L2 regularization accomplish this task? Unfortunately not. L2 regularization encourages weights to be small, but doesn’t force them to exactly 0.0.

  • An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model’s ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem. So this idea, known as L0 regularization, isn’t something we can use effectively in practice.

  • However, there is a regularization term called L1 regularization that serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. So we can use L1 regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time.

How does regularization work?

  • L1 and L2 regularizations can be achieved by simply adding the corresponding norm term that penalizes large weights to the cost function. If you were training a network to minimize the cost \(J_{cross−entropy} = − \frac{1}{m} \sum\limits_{i = 1}^m {y^{(i)}}{log(\hat{y}^{(i)})}\) with (let’s say) gradient descent, your weight update rule would be:
\[w = w − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
  • Instead, you would now train a network to minimize the cost

    \[J_{regularized} = J_{cross−entropy} + \lambda J_{L1\,or\,L2}\]
    • where,
      • \(\lambda\) is the regularization strength, which is a hyperparameter.
      • \(J_{L1} = \sum\limits_{\text{all weights }w_k} | w_k |\).
      • \(J_{L2} = || w ||^2 = \sum\limits_{\text{all weights }w_k} | w_k |^2\) (here, \(|| \cdot ||^2\) is the L2 norm for vectors and the Frobenius norm for matrices).
  • Your modified weight update rule would thus be:

\[w = w − \alpha \frac{\partial J_{cross−entropy}}{\partial w} = w − \alpha (\frac{\partial J_{cross−entropy}}{\partial w} + \lambda \frac{\partial J_{L1\,or\,L2}}{\partial w})\]
  • For L1 regularization, this would lead to the update rule:
\[w = w − \alpha \lambda sign(w) − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
  • For L2 regularization, this would lead to the update rule:

    \[w = w − 2 \alpha \lambda w − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
    • where,
      • \(\alpha \lambda sign(w)\) is the L1 penalty.
      • \(2 \alpha \lambda w\) is the L2 penalty.
      • \(\alpha \frac{\partial J_{cross−entropy}}{\partial w}\) is the gradient penalty.
  • At every step of L1 and L2 regularization the weight is pushed to a slightly lower value because \(2 \alpha \lambda\,\ll\,1\), causing weight decay.

  • As seen above, the update rules for L1 and L2 regularization are different. While the L2 “weight decay” \(2w\) penalty is proportional to the value of the weight to be updated, the L1 “weight decay” \(sign(w)\) is not.

L1 vs. L2 regularization

  • L2 and L1 penalize weights differently:

    • L2 penalizes \(weight^2\).
    • L1 penalizes \(|weight|\).
  • Consequently, L2 and L1 have different derivatives:

    • The derivative of L2 is \(2 * weight\).
    • The derivative of L1 is \(k\) (a constant, whose value is independent of the weight but depends on the sign of the weight).
  • You can think of the derivative of L2 as a force that removes \(x\%\) of the weight every time. As Zeno knew, even if you remove \(x\%\) of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.
    • Simply put, for L2, the smaller the \(w\), the smaller the penalty during the update of \(w\) and vice-versa for larger \(w\).

    L2 regularization results in a uniform distribution of weights, as much as possible, over the entire weight matrix.

  • You can think of the derivative of L1 as a force that subtracts some constant from the weight every time, irrespective of the weight’s magnitude. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight!
    • Simply put, for L1, note that while the penalty is independent of the value of \(w\), but the direction of the penalty (positive or negative) depends on the sign of \(w\).

    L1 regularization thus results in an effect called “feature selection” or “weight sparsity” since it makes the irrelevant weights 0.

  • L1 regularization – penalizing the absolute value of all the weights – turns out to be quite efficient for wide models.

  • Note that this description is true for a one-dimensional model.

Effect of L1 and L2 regularization on weights

  • L1 and L2 regularizations have a dramatic effect on the weights values during training.

  • For L1 regularization:
    • “Too small” \(\lambda\) constant: there’s no apparent effect.
    • “Appropriate” \(\lambda\) constant: many of the weights become zeros. This is called “sparsity of the weights”. Because the weight penalty is independent of the weight values, weights with value 0.001 are penalized just as much as weights with value 1000. The value of the penalty is \(\alpha \lambda\) (generally very small). It constrains the subset of weights that are “less useful to the model” to be equal to 0. For example, you could effectively end up with 200 non-zero weights out of 1000, which means that 800 weights were less useful for learning the task.
    • “Too large” \(\lambda\) constant: You can observe a plateau, which means that the weights uniformly take values around zero. In fact, because \(\lambda\) is large, the penalty −\(\alpha \lambda sign(w)\) is much higher than the gradient −\(\alpha \frac{\partial J_{cross−entropy}}{\partial w}\). Thus, at every update, \(w\) is pushed by \(\approx −\alpha\lambda sign(w)\) in the opposite direction of its sign. For instance, if w is around zero, but slightly positive, then it will be pushed towards −\(\alpha \lambda\) when the penalty is applied. Hence, the plateau’s width should be \(2 \times \alpha \lambda\).
  • For L2 regularization:
    • “Too small” \(\lambda\) constant: there’s no apparent effect.
    • “Appropriate” \(\lambda\) constant: the weight values decrease following a centered distribution that becomes more and more peaked throughout training.
    • “Too large” \(\lambda\) constant: All the weights are rapidly collapsing to zeros, and the model obviously underfits because the weight values are too constrained.
  • Note that the weight sparsity effect caused by L1 regularization makes your model more compact in theory, and leads to storage-efficient compact models that are commonly used in smart mobile devices.

How does weight decay help the model generalize?

  • Weight decay suppresses any irrelevant components of the weight vector by driving the optimization to find the smallest vector that solves the learning problem.
  • Weight decay attenuates the influence of outliers on the weight optimization by constraining the weight values. There’s less risk for the weights to learn the sampling error (for example, when a subset of your data points comes from a wrong distribution or is mislabelled). In other words, the output of the model is less affected by changes in the input.
  • L1 and L2 regularizations have a dramatic effect on the geometry of the cost function: adding regularization results in a more convex cost landscape and diminishes the chance of converging to a non-desired local minimum.

Dropout regularization

  • Although L1 and L2 regularization are simple techniques to reducing overfitting, there exist other methods, such as dropout regularization, that have been shown to be more effective at regularizing larger and more complex networks. If you had unlimited computational power, you could improve generalization by averaging the predictions of several different neural networks trained on the same task. The combination of these models will likely perform better than a single neural network trained on this task. However, with deep neural networks, training various architectures is expensive because:
    • Tuning the hyperparameters is time consuming.
    • Training the networks requires a lot of computations.
    • You need a large amount of data to train the models on different subsets of the data.
  • Dropout is a regularization technique, introduced in Srivastava et al. (2014), that allows you to combine many different architectures efficiently by randomly dropping some of the neurons of your network during training.


  • When applying regularization methods, you need a metric to track your model’s improvement and generalization ability. The bias/variance tradeoff enables you to measure the efficiency of your regularization.
  • Successfully training a model on complex tasks is complicated. You need to find a model architecture that can encompass the complexity of the dataset. Once you find such an architecture, you can work on improving generalization. Exploring, and even combining, different regularization techniques is an unmissable step of the training process. It helps you build intuition on the ability of a model to generalize in the real-world.


Further Reading


If you found our work useful, please cite it as:

  title   = {Regularization},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{}}