Accuracy, precision, recall, and F1-score are metrics used to assess automatic classifiers. These metrics are calculated from a confusion matrix. Say we have spam filter which tells whether a mail is spam or not. With a test dataset of 100 spam emails and 900 non-spam, we get the following results.
Actual class: Spam | Actual class: Non-spam | |
---|---|---|
Prediction class: Spam | 85 | 100 |
Prediction class: Non-spam | 15 | 800 |
This confusion matrix shows the confusions the system makes in its predictions. But what if we could summarize this matrix in one number? A number that would tell us if the system is any good?
How "accurate" are our system's predictions
Let's say our system actually finished its graduation exam, and it is now time to grade it. We can count how many good answers the system gave. That is \(85 + 10 = 95\) good answers. Then we give the system a grade which is its number of good answers, over the number of questions. Sounds like what would be done in a common exam, right? Well this is the metric called accuracy in machine learning. \[ \text{Accuracy} = \frac{\text{\# of good predictions}}{\text{\# of samples}} \] In the above example with the spam filter, \(\text{Accuracy} = \frac{85 + 800}{85 + 100 + 15 + 800} = 0.885\).
Accuracy sums all answers disregarding their respective classes, this would be ok if our classes were balanced, that is if they contained a roughly equal amount of samples. However, in the case of spam filtering (and this is the case for many other applications, commonly denoted as anomaly detection) the "negative" class (non-spam) is usually a lot more populated than the "positive" one (spam), which is should be viewed as the class of outliers of the distribution of "non-spam" emails. Let's consider the following confusion matrix.
Actual class: Spam | Actual class: Non-spam | |
---|---|---|
Prediction class: Spam | 0 | 15 |
Prediction class: Non-spam | 100 | 885 |
Can you see how incredibly useless this system is at predicting spams? Yet accuracy is unchanged, it is still 0.885! Ironically, that shows why accuracy is not a valid metric to assess the accuracy of a system. Precision and recall are more fine-tuned metrics, in that they measure the performance of the system with respect to the "true class", or the "spam class" in our example. Let's look at them.
Precision and recall
We just have to refine the way we grade the exam. Let's consider only one class, the positive one, and look how well our system performs. This is the definition for recall. Recall concentrates the metric on one class only, and measures how well the system predicts it.
There is a bias when we chose which class to consider as positive. In the case of spam filtering, the system should be able to recognize spams among normal examples, so recall should be the highest for spams. The recall metric does not care how you handle non-spam, as long as you recognize correctly "spams". \[ \text{Recall} = \frac{\text{\# of correctly classified spams}}{\text{\# of spams}} \] Ok. But what if your system thinks everything is a spam?
Actual class: Spam | Actual class: Non-spam | |
---|---|---|
Prediction class: Spam | 100 | 900 |
Prediction class: Non-spam | 0 | 0 |
It gets \(\text{Recall} = 1\), the maximum score! So we also need a metric telling how precise the system is when annotating a mail as "spam". For this we restrict ourselves to the set of mails predicted as "spam" by the system. \[ \text{Precision} = \frac{\text{\# of correctly classified spams}}{\text{\# of mails classified as "spams"}} \] So the system has maximum precision when it selects spams without errors. For the previous example, the precision would be \(\frac{100}{900} = 0.1111\).
We constructed two metrics recall and precision, which are more refined than accuracy, by selecting a "positive" class ("spam") as the target of our classification problem.
F1-score, because one number is better than two
The optimal system for our dataset would have maximum recall and precision scores. By taking the harmonic mean of precision and recall, F1-score is a metric of overall performance, that we can use to grade our system's answers to the exam. \[ \text{F1-score} = 2\cdot \frac{\text{precision}\cdot \text{recall}}{\text{precision}+\text{recall}} \] Harmonic mean is preferred over arithmetic mean for example because it penalizes bad score. If recall is high but precision is low, harmonic mean will be low. This is due to the geometric properties of the harmonic mean.