Primers • Evaluation Metrics
 Introduction
 Classification problem
 What is Precision and Recall?
 Precision and Recall Formulae
 F1 Score
 Calculating Precision and Recall
 Precision and Recall vs. F1score
 Precision and Recall vs. Sensitivity and Specificity
 Calculating precision, sensitivity and specificity
 Precision and Recall vs. ROC curve and AUC
 Applications of Precision and Recall
 History
 Object Detection: IoU, AP, and mAP
 Frechet Inception Distance
 Evaluation Metrics for NLP Tasks
 Evaluations Metrics for GANs
 Overview of evaluation metrics
 Further reading
 References
 Citation
Introduction
 Deep learning tasks can be complex and hard to measure: how do we know whether one network is better than another? In some simpler cases such as regression, the loss function used to train a network can be a good measurement of the network’s performance.
 However, for many realworld tasks, there are evaluation metrics that encapsulate, in a single number, how well a network is doing in terms of real world performance. These evaluation metrics allow us to quickly see the quality of a model, and easily compare different models on the same tasks.
 Next, we’ll go through some case studies of different tasks and their metrics.
Classification problem
 Let’s consider a simple binary classification problem, where we are trying to predict if a patient is healthy or has pneumonia. We have a test set with 10 patients, where 9 patients are healthy (shown as green squares) and 1 patient has pneumonia (shown as a red square). The ground truth for your test set is shown below:
 We’ve trained three models for this task (Model1, Model2, Model3), and we’d like to compare the performance of these models. The predictions from each model on the test set are shown below:
Accuracy
 To compare the models, we could first use accuracy, which is the number of correctly classified examples divided by the total:
 If we use accuracy as your evaluation metric, it seems that the best model is Model1.
 In general, when you have class imbalance (which is most of the time!), accuracy is not a good metric to use.
Confusion Matrix
 Accuracy doesn’t discriminate between errors (i.e., it treats misclassifying a patient with pneumonia as healthy the same as misclassifying a visualizing patient with having pneumonia). A confusion matrix is a tabular format for showing a more detailed breakdown of a model’s correct and incorrect classifications.
 A confusion matrix for binary classification is shown below:
What is Precision and Recall?
 Precision and recall are two numbers which together are used to evaluate the performance of classification or information retrieval systems. Precision is defined as the fraction of relevant instances among all retrieved instances. Recall, sometimes referred to as ‘sensitivity, is the fraction of retrieved instances among all relevant instances. A perfect classifier has precision and recall both equal to \(1\).
 It is often possible to calibrate the number of results returned by a model and improve precision at the expense of recall, or vice versa.
 Precision and recall should always be reported together.
Precision and Recall Formulae

Formally, the mathematical definition of precision is given by,
\[\begin{aligned} \text { precision } &=\frac{t_p}{t_p+f_p} \\ &=\frac{\text { retrieved and relevant documents }}{\text { all retrieved documents }} \end{aligned}\] 
Similarly, the formal definition of recall is given by,
\[\begin{aligned} \operatorname{recall} &=\frac{t_p}{t_p+f_n} \\ &=\frac{\text { retrieved and relevant documents }}{\text { all relevant documents }} \end{aligned}\] where,
 \(t_p\) is the true positive rate, that is the number of instances which are relevant and which the model correctly identified as relevant.
 \(f_p\) is the false positive rate, that is the number of instances which are not relevant but which the model incorrectly identified as relevant.
 \(f_n\) is the false negative rate, that is the number of instances which are relevant and which the model incorrectly identified as not relevant.
 where,
F1 Score
 Precision and recall are both useful, but having multiple evaluation metrics makes it difficult to directly compare models. Precision and recall are sometimes combined together into the F1score, if a single numerical measurement of a system’s performance is required. From Andrew Ng’s machine learning book:
“Having multiplenumber evaluation metrics makes it harder to compare algorithms. Better to combine them to a single evaluation metric. Having a singlenumber evaluation metric speeds up your ability to make a decision when you are selecting among a large number of classifiers. It gives a clear preference ranking among all of them, and therefore a clear direction for progress.”  Machine Learning Yearning
 F1 score is a metric that combines recall and precision by taking their harmonic mean:
Calculating Precision and Recall
Example #1: Disease diagnosis
 Consider our classification problem of pneumonia detection. It is crucial that we find all the patients that are suffering from pneumonia. Predicting patients with pneumonia as healthy is not acceptable (since the patients will be left untreated).
 Thus, a natural question to ask when evaluating our models is: Out of all the patients with pneumonia, how many did the model predict as having pneumonia? The answer to this question is given by the recall.
 The recall for each model is given by:
 Imagine that the treatment for pneumonia is very costly and therefore you would also like to make sure only patients with pneumonia receive treatment.
 A natural question to ask would be: Out of all the patients that are predicted to have pneumonia, how many actually have pneumonia? This metric is the precision.
 The precision for each model is given by:
 Next, let’s calculate the F1 score for each model,
Example #2: Search engine
 Imagine that you are searching for information about cats on your favorite search engine. You type ‘cat’ into the search bar.
 The search engine finds four web pages for you. Three pages are about cats, the topic of interest, and one page is about something entirely different, and the search engine gave it to you by mistake. In addition, there are four relevant documents on the internet, which the search engine missed.

In this case we have three true positives, so \(t_p=3\). There is one false positive, \(f_p=1\). And there are four false negatives, so \(f_n=4\). Note that to calculate precision and recall, we do not need to know the total number of true negatives (the irrelevant documents which were not retrieved).

The precision is given by,
 and the recall is,
Precision and Recall vs. F1score
 Usually, precision and recall scores are given together and are not quoted individually. This is because it is easy to vary the sensitivity of a model to improve precision at the expense of recall, or vice versa.
 If a single number is required to describe the performance of a model, the most convenient figure is the Fscore, which is the harmonic mean of the precision and recall:
 This allows us to combine the precision and recall into a single number.
 If we consider either precision or recall to be more important than the other, then we can use the \(F_{\beta}\) score, which is a weighted harmonic mean of precision and recall. This is useful, for example, in the case of a medical test, where a false negative may be extremely costly compared to a false positive. The \(F_{\beta}\) score formula is more complex:
Calculating Precision, Recall and the Fscore
 For the above example of the search engine, we obtained precision of 0.75 and recall of \(0.43\).
 Imagine that we consider precision and recall to be of equal importance for our purposes. In this case, we will use the Fscore to summarize precision and recall together.
 Putting the figures for the precision and recall into the formula for the Fscore, we obtain:
 Note that the Fscore of \(0.55\) lies between the recall and precision values (\(0.43\) and \(0.75\). This illustrates how the Fscore can be a convenient way of averaging the precision and recall in order to condense them into a single number.
Precision and Recall vs. Sensitivity and Specificity
 When we need to express model performance in two numbers, an alternative twonumber metric to precision and recall is sensitivity and specificity. This is commonly used for medical stated sensitivity and specificity for a device or testing kit printed on the side of the box, or in the instruction leaflet.
 Sensitivity and specificity are defined as follows. Note that sensitivity is equivalent to recall:
 Specificity also uses \(t_n\), the number of true negatives. This means that sensitivity and specificity use all four numbers in the confusion matrix, as opposed to precision and recall which only use three.
 The number of true negatives corresponds to the number of patients identified by the test as having the disease when they did not have the disease, or alternatively the number of irrelevant documents which the search engine did not retrieve.
 Taking a probabilistic interpretation, we can view specificity as the probability of a negative test given that the patient is well, while the sensitivity is the probability of a positive test given that the patient has the disease.
 Sensitivity and specificity are preferred to precision and recall in the medical domain, while precision and recall are the most commonly used metrics for information retrieval. This initially seems strange, since both pairs of metrics are measuring the same thing: the performance of a binary classifier.
 The reason for this discrepancy is that when we are measuring the performance of a search engine, we only care about the returned results, so both precision and recall are measured in terms of the true and false positives. However, if we are testing a medical device, it is important to take into account the number of true negatives, since these represent the large number of patients who do not have the disease and were correctly categorized by the device.
Calculating precision, sensitivity and specificity
 Let us calculate the precision, sensitivity and specificity for the below example of disease diagnosis.
 Suppose we have a medical test which is able to identify patients with a certain disease.
 We test 20 patients and the test identifies 8 of them as having the disease.
 Of the 8 identified by the test, 5 actually had the disease (true positives), while the other 3 did not (false positives).
 We later find out that the test missed 4 additional patients who turned out to really have the disease (false negatives).
 We can represent the 20 patients using the following confusion matrix:
True state of patient's health  

Disease  No disease  
Test result  Alert  5  3 
No alert  4  8 
 The relevant values for calculating precision and recall are \(t_p=5\), \(f_p=3,\) and \(f_n=4\). Putting these values into the formulae for precision and recall, we obtain:

we have \(t_p=5, f_p=3,\) and \(t_n=8\).

Sensitivity of course comes out as the same value as recall:
 whereas specificity gives:
Precision and Recall vs. ROC curve and AUC
 Let us imagine that the manufacturer of a pregnancy test needed to reach a certain level of precision, or of specificity, for FDA approval. The pregnancy test shows one line if it is moderately confident of the pregnancy, and a double line if it is very sure. If the manufacturer decides to only count the double lines as positives, the test will return far fewer positives overall, but the precision will improve, while the recall will go down. This shows why precision and recall should always be reported together.
 Adjusting threshold values like this enables us to improve either precision or recall at the expense of the other. For this reason, it is useful to have a clear view of how the false positive rate and true positive rate vary together.
 A common visualization of this is the ROC curve, or Receiver Operating Characteristic curve. The ROC curve shows the variation of the error rates for all values of the manuallydefined threshold.
 For example, if a search engine assigns a score to all candidate documents that it has retrieved, we can set the search engine to display all documents with a score greater than \(10\), or \(11\), or \(12\). The freedom to set this threshold value generates a smooth curve as below. The figure below shows a ROC curve for a binary classifier with AUC = \(0.93\). The orange line shows the model’s false positive and false negative rates, and the dotted blue line is the baseline of a random classifier with zero predictive power, achieving AUC = \(0.5\).
 The area under the ROC curve (AUC) is a good metric for measuring the classifier’s performance. This value is normally between \(0.5\) (for a useless classifier) and \(1.0\) (a perfect classifier). The better the classifier, the closer the ROC curve will be to the top left corner.
Applications of Precision and Recall
Information Retrieval
 Precision and recall are best known for their use in evaluating search engines and other information retrieval systems.
 Search engines must index large numbers of documents, and display a small number of relevant results to a user on demand. It is important for the user experience to ensure that both all relevant results are identified, and that as few as possible irrelevant documents are displayed to the user. For this reason, precision and recall are the natural choice for quantifying the performance of a search engine, with some small modifications.
 Over 90% of users do not look past the first page of results. This means that the results on the second and third pages are not very relevant for evaluating a search engine in practice. For this reason, rather than calculating the standard precision and recall, we often calculate the precision for the first 10 results and call this precision @ 10. This allows us to have a measure of the precision that is more relevant to the user experience, for a user who is unlikely to look past the first page. Generalizing this, the precision for the first k results is called the precision @ \(k\).
 In fact, search engine overall performance is often expressed as mean average precision, which is the average of precision @ k, for a number of k values, and for a large set of search queries. This allows an evaluation of the search precision taking into account a variety of different user queries, and the possibility of users remaining on the first results page, vs. scrolling through to the subsequent results pages.
History
 This section is optional and offers a historical walkthrough of how precision, recall and F1score came about, so you may skip to the next section if so desired.
 Precision and recall were first defined by the American scientist Allen Kent and his colleagues in their 1955 paper Machine literature searching VIII. Operational criteria for designing information retrieval systems.
 Kent served in the US Army Air Corps in World War II, and was assigned after the war by the US military to a classified project at MIT in mechanized document encoding and search.
 In 1955, Kent and his colleagues Madeline Berry, Fred Luehrs, and J.W. Perry were working on a project in information retrieval using punch cards and reeltoreel tapes. The team found a need to be able to quantify the performance of an information retrieval system objectively, allowing improvements in a system to be measured consistently, and so they published their definition of precision and recall.
 They described their ideas as a theory underlying the field of information retrieval, just as the second law of thermodynamics “underlies the design of a steam engine, regardless of its type or power rating”.
 Since then, the definitions of precision and recall have remained fundamentally the same, although for search engines the definitions have been modified to take into account certain nuances of human behavior, giving rise to the modified metrics precision @ \(k\) and mean average precision, which are the values normally quoted in information retrieval contexts today.
 In 1979 the Dutch computer science professor Cornelis Joost van Rijsbergen recognized the problems of defining search engine performance in terms of two numbers and decided on a convenient scalar function that combines the two. He called this metric the Effectiveness function and assigned it the letter E. This was later modified to the Fscore, or Fβ score, which is still used today to summarize precision and recall.
Object Detection: IoU, AP, and mAP
 In object detection, two primary metrics are used: intersection over union (IoU) and mean average precision (mAP).
 Let’s walk through a small example (figure credit to J. Hui’s excellent post).
Intersection over Union (IoU)
 Object detection involves finding objects, classifying them, and localizing them by drawing bounding boxes around them. Intersection over union is an intuitive metric that measures the goodness of fit of a bounding box:
 The higher the IoU, the better the fit. IoU is a great metric since it works well for any size and shape of object. This perobject metric, along with precision and recall, form the basis for the full object detection metric, mean average precision (mAP).
Average Precision (AP): the Area Under Curve (AUC)

Object detectors create multiple predictions: each image can have multiple predicted objects, and there are many images to run inference on. Each predicted object has a confidence assigned with it: this is how confident the detector is in its prediction.

We can choose different confidence thresholds to use, to decide which predictions to accept from the detector. For instance, if we set the threshold to 0.7, then any predictions with confidence greater than 0.7 are accepted, and the low confidence predictions are discarded. Since there are so many different thresholds to choose, how do we summarize the performance of the detector?

The answer uses a precisionrecall curve. At each confidence threshold, we can measure the precision and recall of the detector, giving us one data point. If we connect these points together, one for each threshold, we get a precision recall curve like the following:
 The better the model, the higher the precision and recall at its points: this pushes the boundary of the curve (the dark line) towards the top and right. We can summarize the performance of the model with one metric, by taking the area under the curve (shown in blue). This gives us a number between 0 and 1, where higher is better. This metric is commonly known as average precision (AP).
Mean Average Precision (mAP)

Object detection is a complex task: we want to accurately detect all the objects in an image, draw accurate bounding boxes around each one, and accurately predict each object’s class. We can actually encapsulate all of this into one metric: mean average precision (mAP).

To start, let’s compute AP for a single image and class. Imagine our network predicts 10 objects of some class in an image: each prediction is a single bounding box, predicted class, and predicted confidence (how confident the network is in its prediction).

We start with IoU to decide if each prediction is correct or not. For a ground truth object and nearby prediction, if,
 the predicted class matches the actual class, and
 the IoU is greater than a threshold,
 … we say that the network got that prediction right (true positive). Otherwise, the prediction is a false positive.

We can now sort our predictions by their confidence, descending, resulting in the following table. Table of predictions, from most confident to least confident. Cumulative recision and recall shown on the right:
 For each confidence level (starting from largest to smallest), we compute the precision and recall up to that point. If we graph this, we get the raw precisionrecall curve for this image and class:
 Notice how our precisionrecall curve is jagged: this is due to some predictions being correct (increasing recall) and others being incorrect (decreasing precision). We smooth out the kinks in this graph to produce our network’s final PR curve for this image and class. The smoothed precisionrecall curve used to calculate average precision (area under the curve):

The average precision (AP) for this image and class is the area under this smoothed curve.

To compute the mean average precision over the whole dataset, we average the AP for each image and class, giving us one single metric of our network’s performance on classification! This is the metric that is used for common object detection benchmarks such as Pascal VOC and COCO.
Frechet Inception Distance
 This metric compares the statistics of the generated samples and real samples. It models both distributions as multivariate Gaussian. Thus, these two distributions can be compactly represented by their mean \(\mu\) and covariance matrix \(\Sigma\) exclusively. That is:

These two distributions are estimated with 2048dimensional activations of the Inceptionv3 pool3 layer for real and generated samples respectively.

Finally, the FID between the real image distribution (\(Xr\)) and the generated image distribution (\(Xg\)) is computed as:
 Therefore, lower FID corresponds to more similar real and generated samples as measured by the distance between their activation distributions.
Evaluation Metrics for NLP Tasks
 Generative language models
 Perplexity
 Cross entropy
 Machine Translation
 Multitask learning
 Manual evaluation by humans for fluency, grammar, comparative ranking, etc.
Evaluations Metrics for GANs
 Inception Score
 Wasserstein distance
 Fréchet Inception Distance
 Traditional metrics such as precision, recall, and F1score
Overview of evaluation metrics
 In A survey on news recommender systems (2020), Raza and Ding offer a summary of the definitions for a range of evaluation metrics in the conext of recommender systems:
Metric  Description  Type 

Precision  The proportion of relevant recommended items over total recommended items.  Accuracy 
Recall  The proportion of relevant recommended items over total relevant items.  Accuracy 
F1score  Weighted average of the precision and recall.  Accuracy 
Customer Satisfaction Index  The satisfaction degree of a user on the recommendations (Xia et al. 2010).  Beyondaccuracy 
Mean Reciprocal Rank (MRR)  The multiplicative inverse of the rank of the first correct item.  Ranking accuracy 
Mean Average Precision (MAP)  The average precisions across all relevant queries.  Ranking accuracy 
\(\overline{Rank}\)  The percentileranking of article within the ordered list of all articles.  Ranking accuracy 
Cumulative rating  The total relevance of all documents at or above each rank position in the top \(k\).  Ranking accuracy 
Success @ \(k\)  A current news item that is in sequence and in a set of recommended news items.  Ranking accuracy 
Personalized @ \(k\)  A current news item that is in a given sequence and in a set of recommended news items without popular items (Garcin et al. 2013).  Personalization accuracy 
Novelty @ \(k\)  The ratio of unseen and recommended items over the recommended items.  Novelty, beyondaccuracy 
Diversity  The degree of how much dissimilar recommended items are for a user.  Diversity, beyondaccuracy 
Binary Hit rate  The number of hits in an nsized list of ranked items over the number of users for whom the recommendations are produced.  Ranking accuracy 
Logloss  To measure the performance of a classification model where the prediction input is a probability value between 0 and 1.  Accuracy 
Average Reciprocal Hit rate  Each hit is inversely weighted relative to its position in topN recommendations.  Ranking accuracy 
Rootmeansquare error (RMSE)  Difference between the predicted and the actual rating.  Accuracy 
Clickthrough rate (CTR)  The likelihood of a news item that will be clicked.  Accuracy 
Discounted Cumulative Gain (DCG)  The gain of an item according to its position in the result list of a recommender.  Ranking accuracy 
Area under curve (AUC)  A ROC curve plots recall (true positive rate) against fallout (false positive rate).  Accuracy 
Saliency  To evaluate if a news entity is relevant for a text document (Cucchiarelli et al. 2018).  Beyondaccuracy 
FutureImpact  To evaluate how much user attention (views or shares) each news story may receive in the future and is measured between recency and relevancy (Chakraborty et al. 2019).  Beyondaccuracy 
Further reading
Here are some (optional) links you may find interesting for further reading:
 An empirical study on evaluation metrics of generative adversarial networks (paper)
 How to measure GAN performance? (blog post)
 Are GANs Created Equal? A LargeScale Study (paper)
 Pros and Cons of GAN Evaluation Measures (paper)
 How to Evaluate GANs (blog post)
 Evaluation Metrics for Language Modeling (article)
 Evaluation Metrics for Language Models (paper)
References
 Speech and Language Processing (2019) by Jurafsky and Martin.
 A Probabilistic Interpretation of Precision, Recall and FScore, with Implication for Evaluation (2005) by Goutte and Gaussier.
 Information Retrieval (2nd ed.) (1979) by Van Rijsbergen.
 Machine literature searching VIII. Operational criteria for designing information retrieval systems (1955) by Kent et al.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledEvaluationMetrics,
title = {Evaluation Metrics},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}