Confusion Matrix Deep Dive — What Your Model Gets Wrong and Why
A scroll-driven deep dive into the confusion matrix. Master TP, TN, FP, FN, and learn to derive every classification metric from a single 2×2 table.
Model Evaluation
Four cells.
Every classification metric.
Accuracy, precision, recall, F1, specificity, balanced accuracy, MCC — every single one of these metrics is derived from just four numbers arranged in a 2×2 table. Master the confusion matrix and you’ll never be confused by metrics again.
The 2×2 Table
A cancer screening test gives 100 results: TP=40, FN=10, FP=5, TN=45. How many actual cancer cases were there?
💡 Actual positives include those caught (TP) and those missed (FN).
Actual positives = TP + FN. These are all the people who ACTUALLY have cancer, regardless of what the model predicted. TP (40) were correctly identified, FN (10) were missed. Total = 50 actual cancer cases. Similarly, actual negatives = FP + TN = 5 + 45 = 50.
Every Metric From Four Numbers
Metrics derived from the confusion matrix
Overall correctness — the fraction of all predictions that were right. Sounds good, but it's misleading on imbalanced data where always predicting the majority class gets high accuracy.
(TP + TN) / (TP + TN + FP + FN) Of all the positive PREDICTIONS, how many were actually correct? Measures how trustworthy your alarms are — high precision means few false alarms.
TP / (TP + FP) Of all the actual POSITIVES, how many did the model find? Measures detection thoroughness — high recall means few missed cases.
TP / (TP + FN) Of all the actual NEGATIVES, how many were correctly identified as negative? The 'recall for the negative class' — important when false positives are costly.
TN / (TN + FP) The harmonic mean of precision and recall, balancing both into a single number. Useful when you can't afford to ignore either false positives or false negatives.
2 × (Precision × Recall) / (Precision + Recall) Matthews Correlation Coefficient: the most balanced metric, using all four quadrants symmetrically. Ranges from −1 to +1, where 0 means no better than random — robust even on imbalanced data.
(TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN)) A model predicts 'not fraud' for ALL 10,000 transactions. 9,900 are legitimate, 100 are fraud. What's its accuracy and recall?
💡 If the model never predicts 'fraud', how many true positives does it have?
If the model predicts everything as 'not fraud': TP=0, FN=100, FP=0, TN=9900. Accuracy = (0+9900)/10000 = 99% — looks great! But Recall = TP/(TP+FN) = 0/(0+100) = 0% — it caught ZERO fraud cases. This is the classic accuracy trap on imbalanced data. The confusion matrix instantly reveals the problem that accuracy hides.
Beyond 2×2: Multi-Class Confusion Matrices
For K classes, the confusion matrix is K×K. Each cell (i, j) shows how many samples from class i were predicted as class j.
Averaging strategies
Pool all TP, FP, and FN across classes globally, then compute the metric once. Treats every sample equally, giving more weight to larger classes.
Compute the metric independently for each class, then take the unweighted mean. Treats every CLASS equally, so poor performance on rare classes drags down the average.
Like macro-averaging but each class's metric is weighted by its support (number of samples). A compromise between micro and macro that accounts for class frequency.
In a 3-class problem (A:1000, B:50, C:50), Class B has F1=0.2, the others have F1=0.95. Macro-F1 vs Micro-F1: which is lower?
💡 Macro = unweighted mean across classes. Micro = global pool of TP/FP/FN.
Macro-F1 = (0.95 + 0.2 + 0.95) / 3 = 0.70. Micro-F1 will be dominated by Class A (1000 samples) where F1=0.95, pulling the micro-average up toward ~0.93. Macro-average penalizes poor performance on minority classes much more harshly. If you care about every class equally (regardless of size), report macro. If you care proportionally, report weighted or micro.
Pitfalls to Avoid
Common pitfalls checklist:
- Reporting accuracy on imbalanced data — use the confusion matrix to show the FULL picture
- Ignoring off-diagonal patterns — in multi-class, which classes get confused with each other? The confusion matrix reveals systematic errors (e.g., always confusing “cat” with “dog”)
- Forgetting about prevalence — precision depends on class balance; compare models on the SAME test set
- Using a fixed threshold — the confusion matrix is threshold-dependent; consider sweeping thresholds (ROC curve)
Practical Usage
You built a model to detect rare disease (1% prevalence). Your confusion matrix shows TP=8, FN=2, FP=50, TN=940. What's the precision and what does it mean practically?
💡 Precision = TP/(TP+FP). Count the false positives.
Precision = TP/(TP+FP) = 8/(8+50) = 8/58 = 13.8%. Despite catching 80% of actual cases (recall = 8/10 = 80%), the model generates ~6 false alarms per true case. In a medical setting, each false positive means unnecessary anxiety, follow-up tests, and costs. The confusion matrix reveals this tradeoff that accuracy (948/1000 = 94.8%) completely hides. This is why precision and recall matter more than accuracy on imbalanced data.
🎓 What You Now Know
✓ The confusion matrix has 4 cells — TP, FN, FP, TN. Every sample goes in exactly one.
✓ Every metric derives from these 4 numbers — Accuracy, precision, recall, F1, specificity, MCC.
✓ Accuracy can be misleading — Always check the confusion matrix on imbalanced data.
✓ Multi-class: use one-vs-all — Then aggregate with micro, macro, or weighted averaging.
✓ MCC is the most balanced metric — Uses all 4 cells symmetrically. Report it.
The confusion matrix is the foundation of classification evaluation. Before computing ANY metric, print the confusion matrix. It tells you not just HOW MUCH the model is wrong, but HOW it’s wrong — and that’s what matters for improvement. 🔍
↗ Keep Learning
Accuracy, Precision, Recall & F1 — Choosing the Right Metric
A scroll-driven visual deep dive into classification metrics. Learn why accuracy misleads, what precision and recall actually measure, and when to use F1, F2, or something else entirely.
ROC Curves & AUC — Measuring Classifier Performance Visually
A scroll-driven visual deep dive into ROC curves and AUC. Learn TPR vs FPR, why AUC is threshold-independent, and when to use ROC vs PR curves.
Logistic Regression — The Classifier That's Not Really Regression
A scroll-driven visual deep dive into logistic regression. Learn how a regression model becomes a classifier, why the sigmoid is the key, and how log-loss trains it.
Comments
No comments yet. Be the first!