CS231n • L1 vs. L2 Norm
When would you choose L1norm vs. L2norm?

In a typical setting, the L2norm is better at minimizing the prediction error over the L1norm. However, we do find the L1norm being used despite the L2norm outperforming it in almost every task, and this is primarily because the L1norm is capable of producing a sparser solution.

To understand why an L1norm produces sparser solutions over an L2norm during regularization, we just need to visualize the spaces for both the L1norm and L2norm.

In the diagram above, we can see that the solution for the L1norm (to get to the line from inside our diamondshaped space), it’s best to maximize \(x_1\) and leave \(x_2\) at \(0\), whereas the solution for the L2norm (to get to the line from inside our circularshaped space), is a combination of both \(x_1\) and \(x_2\). It is likely that the fit for L2 will be more precise, however, with the L1 norm, our solution will be more sparse.

Since the L2norm penalizes larger errors more strongly, it will yield a solution which has fewer large residual values along with fewer very small residuals as well.

The L1norm, on the other hand, will give a solution with more large residuals, however, it will also have a lot of zeros in the solution. Hence, we might want to use the L1norm when we have constraints on feature extraction. We can easily avoid computing a lot of computationally expensive features at the cost of some of the accuracy, since the L1norm will give us a solution which has the weights for a large set of features set to \(0\). A usecase of L1 would be realtime detection or tracking of an object/face/material using a set of diverse handcrafted features with a large margin classifier like an SVM in a sliding window fashion, where you would probably want feature computation to be as fast as possible in this case.

Another way to interpret this is that L2norm basically views all features similarly since it assumes all the features to be “equidistant” (given its geometric representation as a circle), while L1 norm views different features differently since it treats the “distance” between them differently (given its geometric representation as a square). Thus, if you are unsure of the kind of features available in your dataset and their relative importance, L2 regularization is the way to go. On the other hand, if you know that one feature matters much more than another, use L1 regularization.

In summary,
 Broadly speaking, L1 is more useful for “what?” and L2 is more “how much?”.
 The L2 norm is as smooth as your floats are precise. It captures energy and Euclidean distance, things you want when, for e.g., tracking features. It’s also computation heavy compared to the L1 norm.
 The L1 norm isn’t smooth, which is less about ignoring fine detail and more about generating sparse feature vectors. Sparse is sometimes good e.g., in high dimensional classification problems.
References
Further Reading
Here are some (optional) links you may find interesting for further reading:
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledNorm,
title = {L1 vs. L2 Norm},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}