- 5 min read

Dive into Confusion Matrix and F1 Score

Learn what the confusion matrix and F1 Score essentially are.

Dive into Confusion Matrix and F1 Score
Email OCR | Unitlab Annotate

Precision is absolutely essential when judging computer vision models. How precise is a particular model, and how effectively does it handle the task at hand? To gauge performance, we use several metrics, especially the confusion matrix and the F1 score.

Both metrics are tailored for binary classification tasks, where the model decides between “Yes” and “No.” For instance, an email might be spam or legitimate, a tumor might be harmful or benign, and a vehicle might break parking regulations or remain compliant. Having these two outcomes means the confusion matrix not only shows the total accuracy but also pinpoints how the model makes mistakes. Picture it as a coin: one side shows how accurate the model is, and the other side details its errors. Meanwhile, the F1 score offers a single measure that captures both precision and recall.

By the time you finish reading, you will understand:

  • Confusion Matrix in the context of email spam detection
  • True Positive, True Negative, False Positive, and False Negative
  • Accuracy, Precision, Recall, and Specificity
  • The F1 Score

Confusion Matrix

A confusion matrix is a simple 2×2 table that measures how well a binary classification model performs. Instead of only telling you how many predictions are correct, it breaks them down into true positives, false positives, false negatives, and true negatives.

For an email spam detection, the confusion matrix looks like this:

Confusion Matrix For Email Spam

True Positive (TP): The model accurately flags spam emails as spam.

True Positive | Mark Spam as Spam

False Positive (FP): The model wrongly marks a legitimate email as spam.

False Positive | Mark Non-spam as Spam

True Negative (TN): The model accurately identifies a legitimate email as legitimate.

True Negative | Mark Non-Spam as Non-Spam

False Negative (FN): The model overlooks a spam email and labels it as legitimate.

False Negative | Mark Spam as Non-Spam

Understanding these four outcomes is vital for calculating the performance metrics discussed next. Let's, for the sake of example, we run our AI model and get these results:

Results

We can see that we had 100 emails, 80 legitimate and 20 spam. Our model classified them according to the table. Now, we can run different analyses to find out how well our model is doing.

Key Metrics Derived from the Confusion Matrix

From TP, FP, FN, and TN, we can compute several metrics that run from 0 to 1, with 1 being perfect. In an ideal world, each metric would be close to 1, signifying a highly effective model.

  1. Accuracy
Accuracy
  • Our Accuracy: (75+15)/(75+5+15+5) = 90/100 = 90%, a high accuracy score.
  • Our model is, in general, accurate.
  • Accuracy shines in datasets where the positive and negative classes are fairly balanced.
  • In imbalanced datasets, accuracy alone can be deceptive. For example, if 95% of samples fall under one class, always predicting that class yields 95% accuracy but offers little practical value.
  1. Precision
Precision
  • Our Precision: 75/(75+5) = 75/80 = 93.75%, a very high precision.
  • Most of the emails that are not spam are actually classified as non-spam.
  • Precision matters most where false positives are expensive, such as fraud detection or spam filtering.
  1. Recall
Recall
  • Our Recall: 75/(75+5) = 75/80 = 93.75%, a very high recall.
  • Most of the emails that the model classified as non-spam are actually not spam.
  • Recall is essential where missing a positive case has severe consequences, like in diagnosing diseases or detecting security breaches.

F1 Score

One major drawback of depending solely on accuracy is that it can distort performance in imbalanced datasets, where one class is much larger than the other. For instance, in spam detection, legitimate emails typically outnumber spam by a significant margin.

Imagine a dataset with 90 legitimate emails and 10 spam emails. A model labeling all 100 as legitimate reaches 90% accuracy but catches no spam at all, making it practically ineffective.

Accuracy is misleading in imbalanced datasets

To tackle this limitation, analysts frequently use the F1 Score, which is the harmonic mean of precision and recall. It’s better suited for situations with uneven class distributions. The F1 Score goes from 0 to 1, hitting 1 only if both precision and recall are perfect. Although an ideal score is rare, data scientists aim to get as close to 1 as possible.

F1 Score

Our F1 Score then becomes = 2 * (0.9375 * 0.9375) / (0.9375 + 0.9375) = 0.9375. This means that our model is in general effective at detecting spam emails.

Example: Fraud Detection

  • High precision lowers the risk of tagging legitimate users as fraudsters.
  • High recall decreases the chances of overlooking actual fraudulent activities.
  • The F1 Score finds a balanced midpoint between these factors.

Cases

  • For balanced datasets, accuracy usually works well.
  • For cancer screening, recall is crucial—missing even a single positive can be dangerous.
  • In spam detection, precision is vital—misclassifying key emails as spam is problematic.
  • In many real-world applications, the F1 Score is prized for balancing both precision and recall.

Conclusion

Although the confusion matrix offers a thorough breakdown of predictions, the F1 Score combines precision and recall into one convenient metric:

  • Choose accuracy for balanced data.
  • Focus on precision when false positives have serious consequences.
  • Emphasize recall when false negatives pose major risks.
  • Opt for the F1 Score when both precision and recall are equally important.

Knowing these metrics is essential for picking the right evaluation framework in machine learning.

💡
Subscribe to our blog for more educational content and practical tips

Explore More

  1. YOLOv8-Seg Deployment using TensorRT and ONNX
  2. Data Annotation with Segment Anything Model (SAM)
  3. Unitlab AI: Data Collection and Annotation for LLMs and Generative AI

References

  1. Contributing Writer. (Jun 27, 2024). What is F1 Score? A Computer Vision Guide. Roboflow Blog: https://blog.roboflow.com/f1-score/
  2. Harikrishnan N B. (Dec 10, 2019). Confusion Matrix, Accuracy, Precision, Recall, F1 Score. Medium: https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd