All articles
· 13 min deep-divemachine-learningmetrics
Article 1 in your session

Accuracy, Precision, Recall & F1 — Choosing the Right Metric

A scroll-driven visual deep dive into classification metrics. Learn why accuracy misleads, what precision and recall actually measure, and when to use F1, F2, or something else entirely.

Introduction 0%
Introduction
🎯 0/3 0%

Classification Metrics

99% accuracy.
Completely useless.

A model that predicts “no cancer” for every patient achieves 99% accuracy if only 1% have cancer. Accuracy hides what matters. Precision tells you how trustworthy positive predictions are. Recall tells you how many positives you actually catch. F1 balances both. Choosing the right metric is choosing what errors you can afford.

Accuracy Trap

The Accuracy Paradox

Accuracy and its limits

🎯 Accuracy Formula

Accuracy measures the fraction of all predictions that are correct — both positive and negative combined. It's the most intuitive metric but can be dangerously misleading on imbalanced datasets.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
⚠️ The Imbalance Trap

On a 99/1 class split, predicting ALL samples as negative yields 99% accuracy while doing nothing useful. Accuracy hides the failure because it weights all classes equally.

🔍 Class-Focused Metrics

To evaluate performance on the minority (positive) class, you need precision (how trustworthy are positive predictions?) and recall (how many actual positives did we catch?).

↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

A fraud detection model processes 10,000 transactions: 9,900 legitimate, 100 fraudulent. It predicts ALL transactions as legitimate. What's its accuracy?

Precision & Recall

The Two Questions Every Classifier Must Answer

Precision”When I say YES, am I right?”TP / (TP + FP)High precision = few false alarmsMatters when: FP is costlye.g., spam filter (don’t lose real email)e.g., criminal sentencingRecall”Of all actual YES, how many did I find?”TP / (TP + FN)High recall = few missed positivesMatters when: FN is costlye.g., cancer screening (don’t miss cases)e.g., airport security
Precision: of predicted positives, how many are correct? Recall: of actual positives, how many did we find?
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A cancer screening test has 95% recall and 10% precision. In plain English, this means:

F1 Score

F1: Balancing Precision and Recall

The F1 score and its variants

⚖️ F1 Score

The harmonic mean of precision and recall. Unlike the arithmetic mean, it heavily penalizes extreme imbalances — if either metric is near zero, F1 collapses to near zero.

F1 = 2 × (Precision × Recall) / (Precision + Recall)
📉 Harmonic Mean Effect

With Precision=1.0 and Recall=0.0, the arithmetic mean would be 0.5 (looks okay), but F1 = 0 (reveals the failure). The harmonic mean forces BOTH metrics to be high.

🎛️ F-beta Generalization

F-beta lets you weight precision vs recall asymmetrically. β=1 is standard F1 (balanced), β=2 weights recall twice as much, β=0.5 weights precision twice as much.

F_β = (1 + β²) × (P × R) / (β²P + R)
💡 Choosing Your Beta

Use F2 when missing positives is dangerous (cancer screening — catch every case even at the cost of false alarms). Use F0.5 when false positives are costly (spam filtering — don't lose real emails).

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Precision = 0.6, Recall = 0.6. What's the F1 score? Now: Precision = 0.9, Recall = 0.3. What's the F1?

The Tradeoff

You Can’t Maximize Both

Choosing Metrics

Which Metric for Which Problem?

FP is costly → Optimize PRECISIONSpam filter, drug approval, criminal convictionFN is costly → Optimize RECALLCancer screening, fraud detection, securityBoth matter equally → F1General classification, search relevanceBalanced classes → Accuracy is fine50/50 split, equal error costs⚠ Never report accuracy alone on imbalanced dataAlways report precision, recall, F1, and the class distribution
Match your metric to your error cost structure

🎓 What You Now Know

Accuracy lies on imbalanced data — Always check class distribution first.

Precision = trust in positive predictions — Minimize false alarms (FP).

Recall = coverage of actual positives — Minimize missed cases (FN).

F1 = harmonic mean, punishes imbalance — Use F2 for recall-heavy, F0.5 for precision-heavy.

Match metric to business cost — Missing cancer ≠ missing spam. Choose accordingly.

The metric you choose IS your optimization objective. A model optimized for accuracy on imbalanced data will learn to ignore the minority class. A model optimized for recall will find every positive at the cost of false alarms. There’s no “best” metric — only the right metric for YOUR problem. 📏

Keep Learning