Accuracy, Precision, Recall & F1 — Choosing the Right Metric
A scroll-driven visual deep dive into classification metrics. Learn why accuracy misleads, what precision and recall actually measure, and when to use F1, F2, or something else entirely.
Classification Metrics
99% accuracy.
Completely useless.
A model that predicts “no cancer” for every patient achieves 99% accuracy if only 1% have cancer. Accuracy hides what matters. Precision tells you how trustworthy positive predictions are. Recall tells you how many positives you actually catch. F1 balances both. Choosing the right metric is choosing what errors you can afford.
The Accuracy Paradox
Accuracy and its limits
Accuracy measures the fraction of all predictions that are correct — both positive and negative combined. It's the most intuitive metric but can be dangerously misleading on imbalanced datasets.
Accuracy = (TP + TN) / (TP + TN + FP + FN) On a 99/1 class split, predicting ALL samples as negative yields 99% accuracy while doing nothing useful. Accuracy hides the failure because it weights all classes equally.
To evaluate performance on the minority (positive) class, you need precision (how trustworthy are positive predictions?) and recall (how many actual positives did we catch?).
A fraud detection model processes 10,000 transactions: 9,900 legitimate, 100 fraudulent. It predicts ALL transactions as legitimate. What's its accuracy?
💡 How many of the 10,000 predictions are correct if you always predict legitimate?
9,900/10,000 = 99% accuracy. But recall for fraud = 0/100 = 0%. This model is useless for its intended purpose despite 'high accuracy.' In fraud detection, catching the 100 bad transactions is the ENTIRE point. This is why you must use precision, recall, and F1 for imbalanced problems — accuracy hides the failures that matter most.
The Two Questions Every Classifier Must Answer
A cancer screening test has 95% recall and 10% precision. In plain English, this means:
💡 In medicine, what's worse: telling a healthy person to get a follow-up test, or telling a cancer patient they're fine?
For screening, recall is king: you want to catch EVERY cancer, even if it means many false alarms. Those false positives go through follow-up testing (biopsy, etc.). Missing a cancer (false negative) could be fatal. So 95% recall + 10% precision is actually a reasonable screening test! The follow-up test should have HIGH PRECISION to confirm which of those positives are real. This two-stage pipeline (high-recall screen → high-precision confirmation) is standard in medicine.
F1: Balancing Precision and Recall
The F1 score and its variants
The harmonic mean of precision and recall. Unlike the arithmetic mean, it heavily penalizes extreme imbalances — if either metric is near zero, F1 collapses to near zero.
F1 = 2 × (Precision × Recall) / (Precision + Recall) With Precision=1.0 and Recall=0.0, the arithmetic mean would be 0.5 (looks okay), but F1 = 0 (reveals the failure). The harmonic mean forces BOTH metrics to be high.
F-beta lets you weight precision vs recall asymmetrically. β=1 is standard F1 (balanced), β=2 weights recall twice as much, β=0.5 weights precision twice as much.
F_β = (1 + β²) × (P × R) / (β²P + R) Use F2 when missing positives is dangerous (cancer screening — catch every case even at the cost of false alarms). Use F0.5 when false positives are costly (spam filtering — don't lose real emails).
Precision = 0.6, Recall = 0.6. What's the F1 score? Now: Precision = 0.9, Recall = 0.3. What's the F1?
💡 Harmonic mean = 2ab/(a+b). Try both pairs...
F1(0.6, 0.6) = 2(0.6)(0.6)/(0.6+0.6) = 0.72/1.2 = 0.6. F1(0.9, 0.3) = 2(0.9)(0.3)/(0.9+0.3) = 0.54/1.2 = 0.45. The arithmetic mean of 0.9 and 0.3 is 0.6 — but F1 is 0.45 because the harmonic mean heavily penalizes the imbalance. This is by design: a model with 90% precision but only 30% recall is NOT as good as one with 60%/60%. The harmonic mean captures this intuition.
You Can’t Maximize Both
Which Metric for Which Problem?
🎓 What You Now Know
✓ Accuracy lies on imbalanced data — Always check class distribution first.
✓ Precision = trust in positive predictions — Minimize false alarms (FP).
✓ Recall = coverage of actual positives — Minimize missed cases (FN).
✓ F1 = harmonic mean, punishes imbalance — Use F2 for recall-heavy, F0.5 for precision-heavy.
✓ Match metric to business cost — Missing cancer ≠ missing spam. Choose accordingly.
The metric you choose IS your optimization objective. A model optimized for accuracy on imbalanced data will learn to ignore the minority class. A model optimized for recall will find every positive at the cost of false alarms. There’s no “best” metric — only the right metric for YOUR problem. 📏
↗ Keep Learning
Confusion Matrix Deep Dive — What Your Model Gets Wrong and Why
A scroll-driven deep dive into the confusion matrix. Master TP, TN, FP, FN, and learn to derive every classification metric from a single 2×2 table.
ROC Curves & AUC — Measuring Classifier Performance Visually
A scroll-driven visual deep dive into ROC curves and AUC. Learn TPR vs FPR, why AUC is threshold-independent, and when to use ROC vs PR curves.
Logistic Regression — The Classifier That's Not Really Regression
A scroll-driven visual deep dive into logistic regression. Learn how a regression model becomes a classifier, why the sigmoid is the key, and how log-loss trains it.
Comments
No comments yet. Be the first!