Regularization

  • In order to avoid overfitting the training set (and thus improve generalization), you can try to reduce the complexity of the model by removing layers, and consequently decreasing the number of parameters. As shown in A Simple Weight Decay Can Improve Generalization” by Krogh and Hertz (1992), another way to constrain a network and lower its complexity is to:

Limit the growth of the weights through some kind of weight decay.

  • You want to prevent the weights from growing too large, unless it is really necessary. Intuitively, you are reducing the set of potential networks to choose from.

Weight sparsity an why it matters

  • Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.

  • In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly \(0\) where possible. A weight of exactly \(0\) essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.

  • For example, consider a housing dataset that covers the entire globe. Bucketing global latitude at the minute level (\(60\) minutes per degree) gives about \(10,000\) dimensions in a sparse encoding; global longitude at the minute level gives about \(20,000\) dimensions. A feature cross of these two features would result in roughly \(200,000,000\) dimensions. Many of those \(200,000,000\) dimensions represent areas of such limited residence (for example, the middle of the ocean) that it would be difficult to use that data to generalize effectively. It would be silly to pay the RAM cost of storing these unneeded dimensions. Therefore, it would be nice to encourage the weights for the meaningless dimensions to drop to exactly 0, which would allow us to avoid paying for the storage cost of these model coefficients at inference time.

  • We might be able to encode this idea into the optimization problem done at training time, by adding an appropriately chosen regularization term.

  • Would L2 regularization accomplish this task? Unfortunately not. L2 regularization encourages weights to be small, but doesn’t force them to exactly \(0.0\).

  • An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model’s ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem. So this idea, known as L0 regularization isn’t something we can use effectively in practice.

  • However, there is a regularization term called L1 regularization that serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. So we can use L1 regularization to encourage many of the uninformative coefficients in our model to be exactly \(0\), and thus reap RAM savings at inference time.

How does regularization work?

  • L1 and L2 regularizations can be achieved by simply adding the corresponding norm term that penalizes large weights to the cost function. If you were training a network to minimize the cost \(J_{cross−entropy} = − \frac{1}{m} \sum\limits_{i = 1}^m {y^{(i)}}{log(y^{(i)})}\) with (let’s say) gradient descent, your weight update rule would be:
\[w = w − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
  • Instead, you would now train a network to minimize the cost

    \[J_{regularized} = J_{cross−entropy} + \lambda J_{L1\,or\,L2}\]
    • where,
      • \(\lambda\) is the regularization strength, which is a hyperparameter.
      • \(J_{L1} = \sum\limits_{\text{all weights }w_k} | w_k |\).
      • \(J_{L2} = || w ||^2 = \sum\limits_{\text{all weights }w_k} | w_k |^2\) (here, \(|| \cdot ||^2\) is the L2 norm for vectors and the Frobenius norm for matrices).
  • Your modified weight update rule would thus be:

\[w = w − \alpha \frac{\partial J_{cross−entropy}}{\partial w} = w − \alpha (\frac{\partial J_{cross−entropy}}{\partial w} + \lambda \frac{\partial J_{L1\,or\,L2}}{\partial w})\]
  • For L1 regularization, this would lead to the update rule:
\[w = w − \alpha \lambda sign(w) − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
  • For L2 regularization, this would lead to the update rule:

    \[w = w − 2 \alpha \lambda w − \alpha \frac{\partial J_{cross−entropy}}{\partial w}\]
    • where,
      • \(\alpha \lambda sign(w)\) is the L1 penalty.
      • \(2 \alpha \lambda w\) is the L2 penalty.
      • \(\alpha \frac{\partial J_{cross−entropy}}{\partial w}\) is the gradient penalty.
  • At every step of L1 and L2 regularization the weight is pushed to a slightly lower value because \(2 \alpha \lambda\,\ll\,1\), causing weight decay.

  • As seen above, the update rules for L1 and L2 regularization are different. While the L2 “weight decay” \(2w\) penalty is proportional to the value of the weight to be updated, the L1 “weight decay” \(sign(w)\) is not.

L1 vs. L2 regularization

  • L2 and L1 penalize weights differently:

    • L2 penalizes \(weight^2\).
    • L1 penalizes \(|weight|\).
  • Consequently, L2 and L1 have different derivatives:

    • The derivative of L2 is \(2 * weight\).
    • The derivative of L1 is \(k\) (a constant, whose value is independent of weight but depends on the sign of the weight).
  • You can think of the derivative of L2 as a force that removes \(x\%\) of the weight every time. As Zeno knew, even if you remove \(x\%\) of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.
    • Simply put, for L2, the smaller the \(w\), the smaller the penalty during the update of \(w\) and vice-versa for larger \(w\).

    L2 regularization results in a uniform distribution of weights, as much as possible, over the entire weight matrix.

  • You can think of the derivative of L1 as a force that subtracts some constant from the weight every time, irrespective of the weight’s magnitude. However, thanks to absolute values, L1 has a discontinuity at \(0\), which causes subtraction results that cross \(0\) to become zeroed out. For example, if subtraction would have forced a weight from \(+0.1\) to \(-0.2\), L1 will set the weight to exactly \(0\). Eureka, L1 zeroed out the weight.
    • Simply put, for L1, note that while the penalty is independent of the value of \(w\), but the direction of the penalty (positive or negative) depends on the sign of \(w\).

    L1 regularization thus results in an effect called “feature selection” or “weight sparsity” since it makes the non-relevant weights \(0\).

  • L1 regularization – penalizing the absolute value of all the weights – turns out to be quite efficient for wide models.

  • Note that this description is true for a one-dimensional model.

Effect of L1 and L2 regularization on weights

  • L1 and L2 regularizations have a dramatic effect on the weights values during training.

  • For L1 regularization:
    • “Too small” \(\lambda\) constant: there’s no apparent effect.
    • “Appropriate” \(\lambda\) constant: many of the weights become zeros. This is called “sparsity of the weights”. Because the weight penalty is independent of the weight values, weights with value \(0.001\) are penalized just as much as weights with value \(1000\). The value of the penalty is \(\alpha \lambda\) (generally very small). It constrains the subset of weights that are “less useful to the model” to be equal to \(0\). For example, you could effectively end up with \(200\) non-zero weights out of \(1000\), which means that \(800\) weights were less useful for learning the task.
    • “Too large” \(\lambda\) constant: You can observe a plateau, which means that the weights uniformly take values around zero. In fact, because \(\lambda\) is large, the penalty −\(\alpha \lambda sign(w)\) is much higher than the gradient −\(\alpha \frac{\partial J_{cross−entropy}}{\partial w}\). Thus, at every update, \(w\) is pushed by \(\approx −\alpha\lambda sign(w)\) in the opposite direction of its sign. For instance, if w is around zero, but slightly positive, then it will be pushed towards −\(\alpha \lambda\) when the penalty is applied. Hence, the plateau’s width should be \(2 \times \alpha \lambda\).
  • For L2 regularization:
    • “Too small” \(\lambda\) constant: there’s no apparent effect.
    • “Appropriate” \(\lambda\) constant: the weight values decrease following a centered distribution that becomes more and more peaked throughout training.
    • “Too large” \(\lambda\) constant: All the weights are rapidly collapsing to zeros, and the model obviously under

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledRegularization,
  title   = {Regularization},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}