All articles
· 11 min deep-divemachine-learningmetrics
Article 1 in your session

Confusion Matrix Deep Dive — What Your Model Gets Wrong and Why

A scroll-driven deep dive into the confusion matrix. Master TP, TN, FP, FN, and learn to derive every classification metric from a single 2×2 table.

Introduction 0%
Introduction
🎯 0/4 0%

Model Evaluation

Four cells.
Every classification metric.

Accuracy, precision, recall, F1, specificity, balanced accuracy, MCC — every single one of these metrics is derived from just four numbers arranged in a 2×2 table. Master the confusion matrix and you’ll never be confused by metrics again.

Four Quadrants

The 2×2 Table

PredictedPositiveNegativeActualPositiveNegativeTPTrue PositiveCorrectly caught ✓FNFalse NegativeMissed! Type II ✗FPFalse PositiveFalse alarm! Type I ✗TNTrue NegativeCorrectly ignored ✓
The confusion matrix: actual labels (rows) vs predicted labels (columns). Every sample lands in exactly one cell.
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

A cancer screening test gives 100 results: TP=40, FN=10, FP=5, TN=45. How many actual cancer cases were there?

Derived Metrics

Every Metric From Four Numbers

Metrics derived from the confusion matrix

🎯 Accuracy

Overall correctness — the fraction of all predictions that were right. Sounds good, but it's misleading on imbalanced data where always predicting the majority class gets high accuracy.

(TP + TN) / (TP + TN + FP + FN)
🔍 Precision

Of all the positive PREDICTIONS, how many were actually correct? Measures how trustworthy your alarms are — high precision means few false alarms.

TP / (TP + FP)
🎣 Recall

Of all the actual POSITIVES, how many did the model find? Measures detection thoroughness — high recall means few missed cases.

TP / (TP + FN)
🛡️ Specificity

Of all the actual NEGATIVES, how many were correctly identified as negative? The 'recall for the negative class' — important when false positives are costly.

TN / (TN + FP)
⚖️ F1 Score

The harmonic mean of precision and recall, balancing both into a single number. Useful when you can't afford to ignore either false positives or false negatives.

2 × (Precision × Recall) / (Precision + Recall)
📊 MCC

Matthews Correlation Coefficient: the most balanced metric, using all four quadrants symmetrically. Ranges from −1 to +1, where 0 means no better than random — robust even on imbalanced data.

(TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A model predicts 'not fraud' for ALL 10,000 transactions. 9,900 are legitimate, 100 are fraud. What's its accuracy and recall?

Multi-Class

Beyond 2×2: Multi-Class Confusion Matrices

For K classes, the confusion matrix is K×K. Each cell (i, j) shows how many samples from class i were predicted as class j.

K×K Matrix One-vs-All per class Per-class P, R, F1 Micro-avg (sum cells) Macro-avg (mean metrics) Weighted-avg (by support)
Multi-class strategy: compute per-class metrics, then aggregate.

Averaging strategies

🔢 Micro-Average

Pool all TP, FP, and FN across classes globally, then compute the metric once. Treats every sample equally, giving more weight to larger classes.

⚖️ Macro-Average

Compute the metric independently for each class, then take the unweighted mean. Treats every CLASS equally, so poor performance on rare classes drags down the average.

📊 Weighted-Average

Like macro-averaging but each class's metric is weighted by its support (number of samples). A compromise between micro and macro that accounts for class frequency.

↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

In a 3-class problem (A:1000, B:50, C:50), Class B has F1=0.2, the others have F1=0.95. Macro-F1 vs Micro-F1: which is lower?

Common Mistakes

Pitfalls to Avoid

Common pitfalls checklist:

  • Reporting accuracy on imbalanced data — use the confusion matrix to show the FULL picture
  • Ignoring off-diagonal patterns — in multi-class, which classes get confused with each other? The confusion matrix reveals systematic errors (e.g., always confusing “cat” with “dog”)
  • Forgetting about prevalence — precision depends on class balance; compare models on the SAME test set
  • Using a fixed threshold — the confusion matrix is threshold-dependent; consider sweeping thresholds (ROC curve)
In Practice

Practical Usage

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

You built a model to detect rare disease (1% prevalence). Your confusion matrix shows TP=8, FN=2, FP=50, TN=940. What's the precision and what does it mean practically?

🎓 What You Now Know

The confusion matrix has 4 cells — TP, FN, FP, TN. Every sample goes in exactly one.

Every metric derives from these 4 numbers — Accuracy, precision, recall, F1, specificity, MCC.

Accuracy can be misleading — Always check the confusion matrix on imbalanced data.

Multi-class: use one-vs-all — Then aggregate with micro, macro, or weighted averaging.

MCC is the most balanced metric — Uses all 4 cells symmetrically. Report it.

The confusion matrix is the foundation of classification evaluation. Before computing ANY metric, print the confusion matrix. It tells you not just HOW MUCH the model is wrong, but HOW it’s wrong — and that’s what matters for improvement. 🔍

Keep Learning