Primers • Loss Functions
 Overview
 Tasks
 Output Activation Functions
 Classification Loss Functions
 Regression Loss Functions
 Further Reading
 References
 Citation
Overview
 Loss functions (or cost/error functions) compute the distance between the current output of the algorithm and the expected output. It’s a method to evaluate how your algorithm models the data. Put informally, they are a window to your model’s heart.
 We will first look into typical modeling tasks and then go over common loss functions and their respective use cases. For ease of understanding, the loss functions have been segregated into classification and regression.
Tasks
 Let’s consider a neural network setup with a CNN model, activation function (softmax or sigmoid), and a crossentropy loss:
Binary classification
 In machine learning, binary classification is a supervised learning algorithm that categorizes new observations into one of two classes. The model has a single output (which is fed as input to the sigmoid function) in the range [0,1]. If the output > 0.5, then class 1 (positive class), else 0 (negative class).
 Typical binary classification problems include:
 Medical testing to determine if a patient has certain disease or not;
 Quality control in industry, deciding whether a specification has been met;
 In information retrieval, deciding whether a page should be in the result set of a search or not.
MultiClass Classification
 Oneofmany classification. Each sample can belong to only one of \(C\) classes. The model has \(C\) output neurons that can be gathered in a scores vector \(s\) (which is fed as input to the softmax function) . The target (ground truth) vector \(t\) will be a onehot vector with a positive (1) class and \(C1\) negative (0) classes.
 This task is treated as a single classification problem of samples in one of \(C\) classes.
MultiLabel Classification
 Each sample can belong to more than one class. The model has \(C\) output neurons (similar to multiclass classification). The target vector \(t\) can have more than a positive class, so it will be a multihot vector of 0s and 1s with \(C\) dimensionality (as opposed to a onehot vector in case of multiclass classification). This task is treated as \(C\) independent binary \(\left(C^{\prime}=2, t^{\prime}=0\right.$ or $\left.t^{\prime}=1\right)\) classification problems, where each output neuron decides if a sample belongs to a class or not.
Output Activation Functions
 These functions are transformations we apply to vectors coming out from the model before the loss computation.
Sigmoid
 The Sigmoid function is used for binary classification. It squashes a vector in the range (0, 1). It is applied independently to each element of \(s\). It is also called the logistic function (since it is used in logistic regression for binary classification).
Softmax
 The Softmax function is a generalization of the sigmoid function for multiclass classification. In other words, use sigmoid for binary classification and softmax for multiclass classification. Softmax is a function, not a loss. It squashes a vector in the range (0, 1) and all the resulting elements sum up to 1. It is applied to the output scores \(s\). As elements represent a class, they can be interpreted as class probabilities.

The Softmax function cannot be applied independently to each \(s_i\), since it depends on all elements of \(\boldsymbol{s}\). For a given class \(s_i\), the Softmax function can be computed as:
\[f(s)_i=\frac{e^{s_i}}{\sum_j^C e^{s_j}}\] where \(s_j\) are the scores inferred by the net for each class in \(C\). Note that the Softmax activation for a class \(s_i\) depends on all the scores in \(s\).
 Activation functions are used to transform vectors before computing the loss in the training phase. In testing, when the loss is no longer applied, activation functions are also used to get the CNN outputs.
Classification Loss Functions
Cross Entropy / Negative Log Likelihood
Binary classification
 Crossentropy loss, or (negative) log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.
 Crossentropy loss increases as the predicted probability value moves further away from the actual label. A perfect model would have a loss of 0 because the predicted value would match the actual value.
 Let’s look at the formula for crossentropy loss:

First we look at binary classification where the number of classes \(M\) equals 2:
\[\text {CrossEntropyLoss}=(y log(p) +(1y)log(1p))\]
 Note that some literature in the field denotes the prediction as \(\hat{y}\) so the same equation then becomes:
 Below we see the formula for when our number of classes \(M\) is greater than 2.

 Note the variables and their meanings:
 \(M\): The number of classes or output we want to predict (Red, Black, Blue)
 \(y\): 0 or 1, binary indicator if the class \(c\) is the correct classification for observation \(o\)
 \(p\): predicted probability
Multiclass classification / Categorical CrossEntropy loss
Kullback–Leibler (KL) Divergence
 The Kullback–Leibler divergence, denoted \(D_{\text{KL}}(P\parallel Q)}{\displaystyle D_{\text{KL}}(P\parallel Q)\), is a type of statistical distance: a measure of how one probability distribution \(P\) is different from a second, reference probability distribution \(Q\).
 A simple interpretation of the KL divergence of \(P\) from \(Q\) is the expected excess surprise from using \(Q\) as a model when the actual distribution is \(P\).

Note that KL divergence is commonly used as a difference (loss) and not a metric since it is not symmetric in the two distributions, i.e., \(D_{\mathrm{KL}}(P \ Q) \neq D_{\mathrm{KL}}(Q \ P)\).

For discrete probability distributions \(P\) and \(Q\) defined on the same probability space, \(\mathcal{X}\), the relative entropy from \(Q\) to \(P\) is defined to be:
\[D_{\mathrm{KL}}(P \ Q)=\sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) .\] which is equivalent to
 In other words, it is the expectation of the logarithmic difference between the probabilities \(P\) and \(Q\), where the expectation is taken using the probabilities \(P\).
 KullbackLeibler Divergence Explained offers a walkthrough of KL divergence using an example.
KL divergence vs. Crossentropy loss

Explanation 1:

You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing \(\mathrm{KL}\) divergence. I will put your question under the context of classification problems using cross entropy as loss functions.

Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as,
\[S(v)=\sum_i p\left(v_i\right) \log p\left(v_i\right)\] for \(p\left(v_i\right)\) as the probabilities of different states $v_i$ of the system. From an information theory point of view, \(S(v)\) is the amount of information is needed for removing the uncertainty.
 For instance, the event \(I\)
I will die within 200 years
is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information ofthe aging problem cannot be solved
to make it certain. However, the event \(II\)I will die within 50 years
is more uncertain than event \(I\), thus it needs more information to remove the uncertainties. Here entropy can be used to quantify the uncertainty of the distributionWhen will I die?
, which can be regarded as the expectation of uncertainties of individual events like \(I\) and \(II\).  Now look at the definition of KL divergence between distributions \(\mathrm{A}\) and \(\mathrm{B}\),

where the first term of the right hand side is the entropy of distribution \(A\), the second term can be interpreted as the expectation of distribution \(\mathrm{B}\) in terms of \(A\). And the \(D_{K L}\) describes how different \(\mathrm{B}\) is from \(\mathrm{A}\) from the perspective of \(\mathrm{A}\). It’s worth of noting \(A\) usually stands for the data, i.e. the measured distribution, and \(B\) is the theoretical or hypothetical distribution. That means, you always start from what you observed.

To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions \(A\) and \(B\) as,
 From the definitions, we can easily see,

If \(S_A\) is a constant, then minimizing \(H(A, B)\) is equivalent to minimizing \(D_{K L}(A \ B)\).

A further question follows naturally as how the entropy can be a constant. In a machine learning task, we start with a dataset (denoted as \(P(\mathcal{D})\)) which represent the problem to be solved, and the learning purpose is to make the model estimated distribution (denoted as \(P(model)\)) as close as possible to true distribution of the problem (denoted as \(P(truth)\)). \(P(truth)\) is unknown and represented by \(P(\mathcal{D})\). Therefore in an ideal world, we expect
\[P(\text { model }) \approx P(\mathcal{D}) \approx P(\text { truth })\] and minimize \(D_{K L}(P(\mathcal{D}) \ P(model))\). And luckily, in practice \(\mathcal{D}\) is given, which means its entropy \(S(D)\) is fixed as a constant.


Explanation 2:

Considering models usually work with the samples packed in minibatches, for \(\mathrm{KL}\) divergence and CrossEntropy, their relation can be written as:
\[H(q, p)=D_{K L}(p, q)+H(p)=\sum_i p_i \log \left(q_i\right)\] 
which gives:
\[D_{K L}(p, q)=H(q, p)H(p)\]  From the equation, we can see that KL divergence can depart into a CrossEntropy of \(p\) and \(q\) (\(KL(p, q)\), which is the first part), and a global entropy of ground truth \(p\) (\(H(p)\), which is the second part).
 In many machine learning projects, minibatch is involved to expedite training, where the \(p^{\prime}\) of a minibatch may be different from the global \(p\). In such a case, CrossEntropy is relatively more robust in practice while \(\mathrm{KL}\) divergence needs a more stable \(\mathrm{H}(\mathrm{p})\) to finish her job.

Hinge Loss / Multiclass SVM Loss
 The hinge loss is used for “maximummargin” classification, most notably for support vector machines (SVMs).
 The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.
 For an intended output \(t = \pm1\) and a classifier score y, the hinge loss of the prediction y is defined as:
 The hinge loss is a specific type of cost function that incorporates a margin or distance from the classification boundary into the cost calculation.
 Even if new observations are classified correctly, they can incur a penalty if the margin from the decision boundary is not large enough. The hinge loss increases linearly.
Focal Loss
 Proposed in Focal Loss for Dense Object Detection by Lin et al. in 2017.
 One of the most common choices when training deep neural networks for object detection and classification problems in general.
 Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases.
PolyLoss
 Proposed in PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions by Leng et al. in 2022.
 Crossentropy loss and focal loss are the most common choices when training deep neural networks for classification problems.
 Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets.
 PolyLoss is a generalized form of Cross Entropy loss.
 The paper proposes a framework to view and design loss functions as a linear combination of polynomial functions, motivated by how functions can be approximated via Taylor expansion. Under polynomial expansion, focal loss is a horizontal shift of the polynomial coefficients compared to the crossentropy loss.
 Motivated by this new insight, they explore an alternative dimension, i.e., vertically modify the polynomial coefficients.
Generalized EndtoEnd Loss
 Proposed in Generalized EndtoEnd Loss for Speaker Verification by Wan et al. in ICASSP 2018.
 GE2E makes the training of speaker verification models more efficient than our previous tuplebased endtoend (TE2E) loss function.
 Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process.
 Additionally, the GE2E loss does not require an initial stage of example selection.
Additive Angular Margin Loss
 Proposed in ArcFace: Additive Angular Margin Loss for Deep Face Recognition by Deng et al. in 2018.
 AAM has been predominantly utilized in for face recognition but has recently found applications in other areas such as speaker verification.
 One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for largescale face recognition is the design of appropriate loss functions that enhance discriminative power.
 Centre loss penalises the distance between the deep features and their corresponding class centres in the Euclidean space to achieve intraclass compactness.
 SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in an angular space and penalises the angles between the deep features and their corresponding weights in a multiplicative way.
 Recently, a popular line of research is to incorporate margins in wellestablished loss functions in order to maximise face class separability.
 Additive Angular Margin (AAM) Loss (ArcFace) obtains highly discriminative features with a clear geometric interpretation (better than other loss functions) due to the exact correspondence to the geodesic distance on the hypersphere.
 ArcFace consistently outperforms the stateoftheart and can be easily implemented with negligible computational overhead. We release all refined training data, training codes, pretrained models and training logs, which will help reproduce the results in this paper.

Specifically, the proposed ArcFace \(\cos(\theta + m)\) directly maximises the decision boundary in angular (arc) space based on the L2 normalised weights and features.
\[\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{S *\left(\cos \left(\theta_{y_{i}}+m\right)\right)}}{e^{s *\left(\cos \left(\theta_{y_{i}}+m\right)\right)}+\sum_{j=1, j \neq y_{i}}^{n} e^{s * \cos \theta_{j}}}\] where,
 \(\theta_{j}\) is the angle between the weight \(W_{j}\) and the feature \(x_{i}\)
 \(s\): feature scale, the hypersphere radius
 \(m\): angular margin penalty
 where,
Triplet Loss
 Proposed in FaceNet: A Unified Embedding for Face Recognition and Clustering by Schroff et al. in CVPR 2015.
 Triplet loss was orginally used to learn face recognition of the same person at different poses and angles.
 Triplet loss is a loss function for machine learning algorithms where a reference input (called anchor) is compared to a matching input (called positive) and a nonmatching input (called negative).
 where,
 \(A\) is an anchor input
 \(P\) is a positive input of the same class as {\displaystyle A}A
 \(N\) is a negative input of a different class from {\displaystyle A}A
 Alpha is a margin between positive and negative pairs
 \(f\) is an embedding
 Consider the task of training a neural network to recognize faces (e.g. for admission to a high security zone).
 A classifier trained to classify an instance would have to be retrained every time a new person is added to the face database.
 This can be avoided by posing the problem as a similarity learning problem instead of a classification problem.
 Here the network is trained (using a contrastive loss) to output a distance which is small if the image belongs to a known person and large if the image belongs to an unknown person.
 However, if we want to output the closest images to a given image, we would like to learn a ranking and not just a similarity.
 A triplet loss is used in this case.
InfoNCE Loss
 Proposed in [Contrastive Predictive Coding]((https://arxiv.org/pdf/1807.03748v2.pdf) by van den Oord et al. in 2018.
 InfoNCE, where NCE stands for NoiseContrastive Estimation, is a type of contrastive loss function used for selfsupervised learning.
 The InfoNCE loss, inspired by NCE, uses categorical crossentropy loss to identify the positive sample amongst a set of unrelated noise samples.
Dice Loss
 Proposed in Rethinking Dice Loss for Medical Image Segmentation by Zhao et al. in ICDM 2020.
 Dice loss originates from Sørensen–Dice coefficient, which is a statistic developed in 1940s to gauge the similarity between two samples.
 It was brought to computer vision community by Milletari et al. in 2016 for 3D medical image segmentation.
 The image above another view of the Dice coefficient mentioned above, from the perspective of set theory, in which the Dice coefficient (DSC) is a measure of overlap between two sets.
 For example, if two sets A and B overlap perfectly, DSC gets its maximum value to 1. Otherwise, DSC starts to decrease, getting to its minimum value to 0 if the two sets don ‘t overlap at all.
 Therefore, the range of DSC is between 0 and 1, the larger the better. Thus we can use 1DSC as Dice loss to maximize the overlap between two sets.
Margin Ranking Loss
 Proposed in Adaptive Margin Ranking Loss for Knowledge Graph Embeddings via a Correntropy Objective Function by Nayyeri et al. in 2019.
 As the name suggests, Margin Ranking Loss (MRL) is used for ranking problems.
 MRL calculates the loss provided there are inputs \(X1\), \(X2\), as well as a label tensor, \(y\) containing 1 or 1.
 When the value of \(y\) is 1 the first input will be assumed as the larger value and will be ranked higher than the second input.
 Similarly, if \(y=1\), the second input will be ranked as higher. It is mostly used in ranking problems.
 Let’s look at the code for this below from analyticsindiamag:
first_input = torch.randn(3, requires_grad=True)
Second_input = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
ranking_loss = nn.MarginRankingLoss()
output = ranking_loss(first_input, Second_input, target)
output.backward()
print('input one: ', first_input)
print('input two: ', Second_input)
print('target: ', target)
print('output: ', output)
Contrastive Loss
 Proposed in Dimensionality Reduction by Learning an Invariant Mapping by Hadsell et al. (with Yann LeCun) in IEEE CVPR 2006.
 Contrastive loss is a distancebased loss as opposed to more conventional errorprediction losses. This loss is used to learn embeddings in which two “similar” points have a low Euclidean distance and two “dissimilar” points have a large Euclidean distance.
 Contrastive loss takes the output of the network for a positive example and calculates its distance to an example of the same class and contrasts that with the distance to negative examples.
 Two samples are either similar or dissimilar. This binary similarity can be determined using several approaches:
 In this work, the \(N\) closest neighbors of a sample in input space (e.g. pixel space) are considered similar; all others are considered dissimilar. (This approach yields a smooth latent space; e.g. the latent vectors for two similar views of an object are close)
 To the group of similar samples to a sample, we can add transformed versions of the sample (e.g. using data augmentation). This allows the latent space to be invariant to one or more transformations.
 We can use a manually obtained label determining if two samples are similar. (For example, we could use the class label. However, there can be cases where two samples from the same class are relatively dissimilar, or where two samples from different classes are relatively similar. Using classes alone does not encourage a smooth latent space.)
 Put simply, clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. In other words, contrastive loss calculates the distance between positive example (example of the same class) and negative example (example not of the same class). So loss can be expected to be low if the positive examples are encoded (in this embedding space) to similar examples and the negative ones are further away encoded to different representations. This behavior is illustrated in the image below:
 Formally, if we consider $\vec{X}$ as the input data and $G_W(\vec{X})$ the output of a neural network, the interpoint distance is given by,

The contrastive loss is simply,
\[\begin{aligned} \mathcal{L}(W) &=\sum_{i=1}^P L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) \\ L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) &=(1Y) L_S\left(D_W^i\right)+Y L_D\left(D_W^i\right) \end{aligned}\] where $Y=0$ when $X_1$ and $X_2$ are similar and $Y=1$ otherwise, and $L_S$ is a loss for similar points and $L_D$ is a loss for dissimilar points.

More formally, the contrastive loss is given by,
\[\begin{aligned} &L\left(W, Y, \vec{X}_1, \vec{X}_2\right)= \\ &\quad(1Y) \frac{1}{2}\left(D_W\right)^2+(Y) \frac{1}{2}\left\{\max \left(0, mD_W\right)\right\}^2 \end{aligned}\] where $$ m $$ is a predefined margin.

The gradient is given by the simple equations:

Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. During training, an image pair is fed into the model with their ground truth relationship: equals 1 if the two images are similar and 0 otherwise. The loss function for a single pair is:
\[y d^2+(1y) \max (\operatorname{margin}d, 0)^2\] where \(d\) is the Euclidean distance between the two image features (suppose their features are \(f_1\) and \(f_2\)): \(d=\left \mid f_1f_2\right \mid_2\). The \(margin\) term is used to “tighten” the constraint: if two images in a pair are dissimilar, then their distance should be at least \(margin\), or a loss will be incurred.

Shown below are the results from the paper which are quite convincing:
 Note that while this is one of the earliest of the contrastive losses, this is not the only one. For instance, the contrastive loss used in SimCLR is quite different.
Multiple Negative Ranking Loss
 Proposed in Efficient Natural Language Response Suggestion for Smart Reply by Henderson et al. from Google in 2017.
 Multiple Negative Ranking (MNR) Loss is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of
(query, response)
, or pairs of(source_language, target_language)
.  This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g.
(query, relevant_doc)
) as it will samplen1
negative docs in each batch randomly.The performance usually increases with increasing batch sizes.  This is because with MNR loss, we drop all rows with neutral or contradiction labels — keeping only the positive entailment pairs (source).
 Models trained with MNR loss outperform those trained with softmax loss in highperforming sentence embeddings problems.
 Below is a code sample referenced from sbert.net:
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer('distilbertbaseuncased')
train_examples = [InputExample(texts=['Anchor 1', 'Positive 1']),
InputExample(texts=['Anchor 2', 'Positive 2'])]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
 For more, please refer NextGen Sentence Embeddings with Multiple Negatives Ranking Loss and SBert MNR Loss.
Regression Loss Functions
Mean Average Error or L1 loss
 As the name suggests, MAE takes the average sum of the absolute differences between the actual and the predicted values.
 Regression problems may have variables that are not strictly Gaussian in nature due to the presence of outliers (values that are very different from the rest of the data).

Mean Absolute Error would be an ideal option in such cases because it does not take into account the direction of the outliers (unrealistically high positive or negative values).
\[M A E=\frac{1}{m} \sum_{i=1}^{m}\lefth\left(x^{(i)}\right)y^{(i)}\right\] where,
 MAE: mean absolute error
 \(\mathrm{m}\): number of samples
 \(x^{(i)}\): \(i^{th}\) sample from dataset
 \(h\left(x^{(i)}\right)\): prediction for ith sample (thesis)
 \(y^{(i)}\): ground truth label for \(\mathrm{i}\)th sample
 A quick note here on L1 and L2, these are both used for regularization.
 L1 Loss Function is used to minimize the error which is the sum of the all the absolute differences between the true value and the predicted value.
 L1 is not affected by outliers and thus is preferrable if the dataset contains outliers.
 where,
Mean Squared Error or L2 loss
\[M S E=\frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)}\hat{y}^{(i)}\right)^{2}\] where,
 MSE: mean square error
 \(\mathrm{m}\): number of samples
 \(y^{(i)}\): ground truth label for ith sample
 \(\hat{y}^{(i)}\): predicted label for ith sample
 Mean Squared Error is the average of the squared differences between the actual and the predicted values.
 L2 Loss Function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value. It is also the more preferred loss function compared to L1.
 However, when outliers are present in the dataset, L2 will not perform as well because the squared differences will lead to a much larger error.
Huber Loss / Smooth Mean Absolute Error
 Huber loss is a loss function used in regression, that is less sensitive to outliers in data than the squared error loss.
 Huber loss is the combination of MSE and MAE. It takes the good properties of both the loss functions by being less sensitive to outliers and differentiable at minima.
 When the error is smaller, the MSE part of the Huber is utilized and when the error is large, the MAE part of Huber loss is used.
 A new hyperparameter \(\delta\) is introduced which tells the loss function where to switch from MSE to MAE.
 Additional \(\delta\) terms are introduced in the loss function to smoothen the transition from MSE to MAE.
 The Huber loss function describes the penalty incurred by an estimation procedure \(f\). Huber loss defines the loss function piecewise by:
 This function is quadratic for small values of \(a\), and linear for large values, with equal values and slopes of the different sections at the two points where \(\a\=\delta\). The variable a often refers to the residuals, that is to the difference between the observed and predicted values \(a=yf(x)\), so the former can be expanded to:
 The below diagram (source) compares Huber loss with squared loss and absolute loss:
Further Reading
References
 Machine Learning Mastery
 ML CheatSheet
 Neptune.ai
 Section.ai
 After Academy
 Programmathically
 Aman.ai
 Papers with code
 Research Gate
 Medium Analytics Vidhya
 PolyLoss
 Generalized EndtoEnd Loss
 Wikipedia article on Huber loss
 Wikipedia article on Triplet loss
 Towards Data Science
 Papers With Code infoNCE
 Lilian Weng
 Dice Loss Shuchen Du
 Margin Ranking Loss
 Margin Ranking Loss Official Paper
 Wikipedia: Kullback–Leibler divergence
 KullbackLeibler Divergence Explained
 What is the difference Crossentropy and KL divergence?
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledLossFunctions,
title = {Loss Functions},
author = {Chadha, Aman and Jain, Vinija},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}