Introduction

  • In order to avoid overfitting the training set (and thus improve generalization), you can try to reduce the complexity of the model by removing layers, and consequently decreasing the number of parameters. As shown in “A Simple Weight Decay Can Improve Generalization” by Krogh and Hertz (1992), another way to constrain a network and lower its complexity is to:

Limit the growth of the weights through some kind of weight decay.

  • You want to prevent the weights from growing too large, unless it is really necessary. Intuitively, you are reducing the set of potential networks to choose from.

Weight sparsity and why it matters

  • Sparse vectors often contain many dimensions. Creating a feature-cross results in even more dimensions. Given such high-dimensional feature vectors, the model size may become huge and require huge amounts of RAM.

  • In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.

  • For example, consider a housing dataset that covers the entire globe. Bucketing global latitude at the minute level (60 minutes per degree) gives about 10,000 dimensions in a sparse encoding; global longitude at the minute level gives about 20,000 dimensions. A feature-cross of these two features would result in roughly 200,000,000 dimensions. Many of those 200,000,000 dimensions represent areas of such limited residence (for example, the middle of the ocean) that it would be difficult to use that data to generalize effectively. It would be silly to pay the RAM cost of storing these unneeded dimensions. Therefore, it would be nice to encourage the weights for the meaningless dimensions to drop to exactly 0, which would allow us to avoid paying for the storage cost of these model coefficients at inference time.

  • We might be able to encode this idea into the optimization problem done at training time, by adding an appropriately chosen regularization term.

  • Would \(L_2\) regularization accomplish this task? Unfortunately not. \(L_2\) regularization encourages weights to be small, but doesn’t force them to exactly 0.0.

  • An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model’s ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem. So this idea, known as L0 regularization, isn’t something we can use effectively in practice.

  • However, there is a regularization term called \(L_1\) regularization that serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. So we can use \(L_1\) regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time.

How does regularization work?

  • \(L_1\) and \(L_2\) regularizations can be achieved by simply adding the corresponding norm term that penalizes large weights to the cost function. If you were training a network to minimize the cost $$J_{cross−entropy} = − \frac{1}{m} \sum\limits_{i = 1}^m {y^{(i)}}{log(\hat{y}^{(i)})}$$ with (let’s say) gradient descent, your weight update rule would be:
\[w = w − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
  • Instead, you would now train a network to minimize the cost

    \(J_{regularized} = J_{cross−entropy} + \lambda J_{\)L_1\(\,or\,\)L_2\(}\)

    • where,
      • $$\lambda$$ is the regularization strength, which is a hyperparameter.
      • $$J_{\(L_1\)} = \sum\limits_{\text{all weights }w_k} | w_k |$$.
      • $$J_{\(L_2\)} = || w ||^2 = \sum\limits_{\text{all weights }w_k} | w_k |^2$$ (here, $$|| \cdot ||^2$$ is the \(L_2\) norm for vectors and the Frobenius norm for matrices).
  • Your modified weight update rule would thus be:

\(w = w − \alpha \frac{\partial J_{cross−entropy}}{\partial w} = w − \alpha (\frac{\partial J_{cross−entropy}}{\partial w} + \lambda \frac{\partial J_{\)L_1\(\,or\,\)L_2\(}}{\partial w})\)

  • For \(L_1\) regularization, this would lead to the update rule:
\[w = w − \alpha \lambda sign(w) − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
  • For \(L_2\) regularization, this would lead to the update rule:

    \[w = w − 2 \alpha \lambda w − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
    • where,
      • $$\alpha \lambda sign(w)$$ is the \(L_1\) penalty.
      • $$2 \alpha \lambda w$$ is the \(L_2\) penalty.
      • $$\alpha \frac{\partial J_{cross−entropy}}{\partial w}$$ is the gradient penalty.
  • At every step of \(L_1\) and \(L_2\) regularization the weight is pushed to a slightly lower value because $$2 \alpha \lambda\,\ll\,1$$, causing weight decay.

  • As seen above, the update rules for \(L_1\) and \(L_2\) regularization are different. While the \(L_2\) “weight decay” $$2w$$ penalty is proportional to the value of the weight to be updated, the \(L_1\) “weight decay” $$sign(w)$$ is not.

Graphical treatment

  • We know that \(L_1\) regularization encourages sparse weights (many zero values), and that \(L_2\) regularization encourages small weight values, but why does this happen?

  • Let’s consider some cost function \(J(w_1,\dots,w_l)\), a function of weight matrices \(w_1,\dots,w_l\). Let’s define the following two regularized cost functions:

\[\begin{align} J_{L_1}(w_1,\dots,w_l) &= J(w_1,\dots,w_l) + \lambda\sum_{i=1}^l|w_i|\\ J_{L_2}(w_1,\dots,w_l) &= J(w_1,\dots,w_l) + \lambda\sum_{i=1}^l||w_i||^2 \end{align}\]
  • Now, let’s derive the update rules for \(L_1\) and \(L_2\) regularization based on their respective cost functions.
  • The update for \(w_i\) when using \(J_{L_1}\) is:
\[w_i^{k+1} = w_i^{k} - \underbrace{\alpha\lambda sign(w_i)}_{L_1 \text{ penalty}} - \alpha\frac{\partial J}{\partial w_i}\]
  • The update for \(w_i\) when using \(J_{L_2}\) is:
\[w_i^{k+1} = w_i^{k} - \underbrace{2\alpha\lambda w_i}_{L_2 \text{ penalty}}- \alpha\frac{\partial J}{\partial w_i}\]
  • The next question is: what do you notice that is different between these two update rules, and how does it affect optimization? What effect does the hyperparameter \(\lambda\) have?
  • The figure below shows a histogram of weight values for an unregularized (red) and \(L_1\) regularized (blue left) and \(L_2\) regularized (blue right) network:

  • The different effects of \(L_1\) and \(L_2\) regularization on the optimal parameters are an artifact of the different ways in which they change the original loss landscape. In the case of two parameters (\(w_1\) and \(w_2\)), we can visualize this.

  • The figure below shows the landscape of a two parameter loss function with \(L_1\) regularization (left) and \(L_2\) regularization (right):

\(L_1\) vs. \(L_2\) regularization

  • \(L_2\) and \(L_1\) penalize weights differently:

    • \(L_2\) penalizes $$weight^2$$.
    • \(L_1\) penalizes $$|weight|$$.
  • Consequently, \(L_2\) and \(L_1\) have different derivatives:

    • The derivative of \(L_2\) is $$2 * weight$$.
    • The derivative of \(L_1\) is $$k$$ (a constant, whose value is independent of the weight but depends on the sign of the weight).
  • You can think of the derivative of \(L_2\) as a force that removes $$x\%$$ of the weight every time. As Zeno knew, even if you remove $$x\%$$ of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, \(L_2\) does not normally drive weights to zero.
    • Simply put, for \(L_2\), the smaller the $$w$$, the smaller the penalty during the update of $$w$$ and vice-versa for larger $$w$$.

    \(L_2\) regularization results in a uniform distribution of weights, as much as possible, over the entire weight matrix.

  • You can think of the derivative of \(L_1\) as a force that subtracts some constant from the weight every time, irrespective of the weight’s magnitude. However, thanks to absolute values, \(L_1\) has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, \(L_1\) will set the weight to exactly 0. Eureka, \(L_1\) zeroed out the weight!
    • Simply put, for \(L_1\), note that while the penalty is independent of the value of $$w$$, but the direction of the penalty (positive or negative) depends on the sign of $$w$$.

    \(L_1\) regularization thus results in an effect called “feature selection” or “weight sparsity” since it makes the irrelevant weights 0.

  • \(L_1\) regularization – penalizing the absolute value of all the weights – turns out to be quite efficient for wide models.

  • Note that this description is true for a one-dimensional model.

Effect of \(L_1\) and \(L_2\) regularization on weights

  • \(L_1\) and \(L_2\) regularizations have a dramatic effect on the weights values during training.

  • For \(L_1\) regularization:
    • “Too small” $$\lambda$$ constant: there’s no apparent effect.
    • “Appropriate” $$\lambda$$ constant: many of the weights become zeros. This is called “sparsity of the weights”. Because the weight penalty is independent of the weight values, weights with value 0.001 are penalized just as much as weights with value 1000. The value of the penalty is $$\alpha \lambda$$ (generally very small). It constrains the subset of weights that are “less useful to the model” to be equal to 0. For example, you could effectively end up with 200 non-zero weights out of 1000, which means that 800 weights were less useful for learning the task.
    • “Too large” $$\lambda$$ constant: You can observe a plateau, which means that the weights uniformly take values around zero. In fact, because $$\lambda$$ is large, the penalty −$$\alpha \lambda sign(w)$$ is much higher than the gradient −$$\alpha \frac{\partial J_{cross−entropy}}{\partial w}$$. Thus, at every update, $$w$$ is pushed by $$\approx −\alpha\lambda sign(w)$$ in the opposite direction of its sign. For instance, if w is around zero, but slightly positive, then it will be pushed towards −$$\alpha \lambda$$ when the penalty is applied. Hence, the plateau’s width should be $$2 \times \alpha \lambda$$.
  • For \(L_2\) regularization:
    • “Too small” $$\lambda$$ constant: there’s no apparent effect.
    • “Appropriate” $$\lambda$$ constant: the weight values decrease following a centered distribution that becomes more and more peaked throughout training.
    • “Too large” $$\lambda$$ constant: All the weights are rapidly collapsing to zeros, and the model obviously underfits because the weight values are too constrained.
  • Note that the weight sparsity effect caused by \(L_1\) regularization makes your model more compact in theory, and leads to storage-efficient compact models that are commonly used in smart mobile devices.

How does weight decay help the model generalize?

  • Weight decay suppresses any irrelevant components of the weight vector by driving the optimization to find the smallest vector that solves the learning problem.
  • Weight decay attenuates the influence of outliers on the weight optimization by constraining the weight values. There’s less risk for the weights to learn the sampling error (for example, when a subset of your data points comes from a wrong distribution or is mislabelled). In other words, the output of the model is less affected by changes in the input.
  • \(L_1\) and \(L_2\) regularizations have a dramatic effect on the geometry of the cost function: adding regularization results in a more convex cost landscape and diminishes the chance of converging to a non-desired local minimum.

Dropout regularization

  • Although \(L_1\) and \(L_2\) regularization are simple techniques to reducing overfitting, there exist other methods, such as dropout regularization, that have been shown to be more effective at regularizing larger and more complex networks. If you had unlimited computational power, you could improve generalization by averaging the predictions of several different neural networks trained on the same task. The combination of these models will likely perform better than a single neural network trained on this task. However, with deep neural networks, training various architectures is expensive because:
    • Tuning the hyperparameters is time consuming.
    • Training the networks requires a lot of computations.
    • You need a large amount of data to train the models on different subsets of the data.
  • Dropout is a regularization technique, introduced in Srivastava et al. (2014), that allows you to combine many different architectures efficiently by randomly dropping some of the neurons of your network during training.
  • For more details, please refer to the Dropout primer.

Choosing \(L_1\)-norm vs. \(L_2\)-norm as the regularizing function

  • In a typical setting, the \(L_2\)-norm is better at minimizing the prediction error over the \(L_1\)-norm. However, we do find the \(L_1\)-norm being used despite the \(L_2\)-norm outperforming it in almost every task, and this is primarily because the \(L_1\)-norm is capable of producing a sparser solution.

  • To understand why an \(L_1\)-norm produces sparser solutions over an \(L_2\)-norm during regularization, we just need to visualize the spaces for both the \(L_1\)-norm and \(L_2\)-norm as shown in the figure below (source).

  • In the diagram above, we can see that the solution for the \(L_1\)-norm (to get to the line from inside our diamond-shaped space), it’s best to maximize $$x_1$$ and leave $$x_2$$ at 0, whereas the solution for the \(L_2\)-norm (to get to the line from inside our circular-shaped space), is a combination of both $$x_1$$ and $$x_2$$. It is likely that the fit for \(L_2\) will be more precise. However, with the \(L_1\) norm, our solution will be more sparse.

  • Since the \(L_2\)-norm penalizes larger errors more strongly, it will yield a solution which has fewer large residual values along with fewer very small residuals as well.

  • The \(L_1\)-norm, on the other hand, will give a solution with more large residuals, however, it will also have a lot of zeros in the solution. Hence, we might want to use the \(L_1\)-norm when we have constraints on feature extraction. We can easily avoid computing a lot of computationally expensive features at the cost of some of the accuracy, since the \(L_1\)-norm will give us a solution which has the weights for a large set of features set to 0. A use-case of \(L_1\) would be real-time detection or tracking of an object/face/material using a set of diverse handcrafted features with a large margin classifier like an SVM in a sliding window fashion, where you would probably want feature computation to be as fast as possible in this case.

  • Another way to interpret this is that \(L_2\)-norm basically views all features similarly since it assumes all the features to be “equidistant” (given its geometric representation as a circle), while \(L_1\) norm views different features differently since it treats the “distance” between them differently (given its geometric representation as a square). Thus, if you are unsure of the kind of features available in your dataset and their relative importance, \(L_2\) regularization is the way to go. On the other hand, if you know that one feature matters much more than another, use \(L_1\) regularization.

  • In summary,

    • Broadly speaking, \(L_1\) is more useful for “what?” and \(L_2\) is more “how much?”.
    • The \(L_1\) norm isn’t smooth, which is less about ignoring fine detail and more about generating sparse feature vectors. Sparse is sometimes good e.g., in high dimensional classification problems.
    • The \(L_2\) norm is as smooth as your floats are precise. It captures energy and Euclidean distance, things you want when, for e.g., tracking features. It’s also computation heavy compared to the \(L_1\) norm.

How does adding a penalty term/component to the loss term prevent overfitting with regularization?

  • Adding a penalty term/component to the loss term for \(L_1\) (Lasso) and \(L_2\) (Ridge) regularization helps prevent overfitting by penalizing overly complex models, encouraging simpler ones that generalize better to unseen data. The penalty term in the loss function penalizes large weights, which . By controlling model complexity, regularization encourages the model to focus on generalizable patterns rather than specific data points, making it less sensitive to the idiosyncrasies of the training data.
  • Here’s how this works in detail:

    1. Penalizing Large Weights:

      In models like neural networks or linear regressions, overfitting often involves assigning high weights to certain features, which makes the model sensitive to noise in the training data.

      • Regularization adds a term to the loss function (e.g., \(\lambda \sum_{i} w_i^2\) for \(L_2\) regularization) that penalizes large weights. By keeping weights smaller, the model learns a smoother mapping that’s less likely to fit noise.
    2. Balancing Fit with Complexity: The regularization term adds a trade-off in the loss function between fitting the training data well and keeping the model complexity in check. For instance, with a regularized loss function \(\text{Loss} = \text{MSE} + \lambda \sum_{i} w_i^2\), the regularization strength \(\lambda\) controls how much weight is given to reducing error versus limiting model complexity. A well-chosen \(\lambda\) helps the model to capture only the significant patterns, not the noise.

    3. Improved Generalization: By limiting the capacity of the model to learn very specific mappings, regularization helps the model to generalize better to new data. Overly flexible models tend to memorize training examples, while regularized models are forced to learn broader trends, which results in better performance on validation or test datasets.

    4. Types of Regularization Penalties:
      • \(L_1\) Regularization (Lasso): Adds an absolute penalty (e.g., \(\lambda \sum_{i} \|w_i\|\)), encouraging sparse models by pushing some weights to zero. This can also perform feature selection by removing less useful features.
      • \(L_2\) Regularization (Ridge): Penalizes the square of the weights, encouraging the model to spread weight values across many parameters rather than making any one parameter large. This results in smoother models.

FAQ: Why does \(L_1\) regularization lead to feature selection or weight sparsity while \(L_2\) results in weight shrinkage (i.e., uniform distribution of weights)?

  • \(L_1\) and \(L_2\) regularization affect the weights of features differently, leading to distinct outcomes: feature selection in the case of \(L_1\) and uniform weight shrinkage in the case of \(L_2\). These differences arise from the mathematical and geometric properties of their respective penalty functions.
  • These distinctions can serve as a guideline to choose the appropriate regularization method based on the problem requirements, such as whether feature selection or weight smoothness is more important.

\(L_1\) Regularization: Feature Selection and Weight Sparsity

  • Penalty: The penalty term added to the loss function is proportional to the absolute value of the weights:

    \[\lambda \sum |w_i|\]
  • Effect on Weights: The absolute value function is non-differentiable at zero, creating a “sharp corner” in the optimization landscape. This sharpness allows gradient-based optimization methods (e.g., gradient descent) to drive some weights exactly to zero.

  • Geometric Interpretation: \(L_1\) regularization constrains the optimization problem within a diamond-shaped polytope (or more generally, an \(L_1\)-norm ball). The sharp vertices of this polytope align with the coordinate axes, making it more likely for the optimization process to find solutions where weights for some features are exactly zero. This sparsity is a natural consequence of the sharp corners in the constraint surface.

  • Feature Selection: By driving irrelevant or redundant feature weights to zero, \(L_1\) regularization effectively removes these features from the model. This makes \(L_1\) particularly useful in situations where feature selection is critical, such as high-dimensional datasets with many irrelevant or noisy features.

\(L_2\) Regularization: Weight Shrinkage (i.e., Uniform Distribution of Weights)

  • Penalty: The penalty term added to the loss function is proportional to the square of the weights:

    \[\lambda \sum w_i^2\]
  • Effect on Weights: The squared function is smooth and differentiable everywhere, including at zero. During optimization, this smoothness results in a uniform force that shrinks all weights toward zero. However, unlike \(L_1\), it does not eliminate any weight entirely.

  • Geometric Interpretation: \(L_2\) regularization constrains the optimization problem within a spherical region (or more generally, an \(L_2\)-norm ball) – specifically, a hypersphere (or ellipse in lower dimensions). The smooth, curved surface of the sphere distributes the penalty evenly across all dimensions. The smooth curve distributes the penalty such that larger weights are penalized more heavily than smaller weights, encouraging the model to learn more balanced and generalized patterns. In other words, this even distribution ensures that all weights are reduced proportionally to their magnitude, rather than being selectively driven to zero.

  • Weight Shrinkage: The uniform shrinkage helps the model generalize better by reducing the influence of less important features while retaining all features in the model. This makes \(L_2\) regularization particularly effective when all features contribute meaningfully to the prediction, even if some are less significant than others.

Key Differences Between \(L_1\) and \(L_2\)

Mathematical Properties

  1. Penalty Function:
    • \(L_1\): Proportional to the absolute value of the weights ($$ w_i $$).
    • \(L_2\): Proportional to the square of the weights (\(w_i^2\)).
  2. Behavior Around Zero:
    • \(L_1\): Sharp corners in the absolute value function create a discontinuity in its derivative at zero, favoring sparsity by driving weights exactly to zero.
    • \(L_2\): The squared function is smooth and differentiable everywhere, resulting in gradual weight reduction without setting weights to zero.

Geometric Interpretation

  1. \(L_1\): The diamond-shaped constraint region (from the \(L_1\)-norm ball) promotes sparse solutions by aligning weights with the coordinate axes, naturally driving some weights to zero.
  2. \(L_2\): The spherical constraint region (from the \(L_2\)-norm ball) distributes the penalty evenly across all dimensions, resulting in proportional weight reduction and smoother, more uniform weight distributions.
  • Both methods mitigate overfitting but suit different needs: \(L_1\) for sparse models and feature selection, and \(L_2\) for smoother, more distributed weight adjustments.

Conclusion

  • When applying regularization methods, you need a metric to track your model’s improvement and generalization ability. The bias/variance tradeoff enables you to measure the efficiency of your regularization.
  • Successfully training a model on complex tasks is complicated. You need to find a model architecture that can encompass the complexity of the dataset. Once you find such an architecture, you can work on improving generalization. Exploring, and even combining, different regularization techniques is an unmissable step of the training process. It helps you build intuition on the ability of a model to generalize in the real-world.

References

Further Reading

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledRegularization,
  title   = {Regularization},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}