Cross-Validation & Hyperparameter Tuning — How to Actually Evaluate Models
A scroll-driven visual deep dive into cross-validation and hyperparameter tuning. Learn K-fold CV, stratified splitting, grid search, random search, and Bayesian optimization.
Model Evaluation
Your test accuracy is a lie.
Cross-validation tells the truth.
One train/test split gives you one number. That number could be lucky or unlucky. Cross-validation gives you a distribution — mean accuracy AND confidence interval. Combined with smart hyperparameter search, it’s how professionals evaluate and tune models.
The Problem with a Single Split
What could go wrong?
Easy test examples land in the held-out set by chance, inflating the accuracy to 92%. You'd think your model is great — but it was just a favorable draw.
Split 1: test accuracy = 92% Hard examples concentrate in the test set, making accuracy drop to 81%. Same model, same data — wildly different result due to the random partition.
Split 2: test accuracy = 81% A more typical partition gives 86% — somewhere between lucky and unlucky, but you still wouldn't know that without trying other splits.
Split 3: test accuracy = 86% Only by evaluating on ALL splits can you estimate the mean (≈86%) AND the uncertainty (±5%). A single number without a confidence interval is almost meaningless.
True performance ≈ 86% ± 5% You train a model once, test it once, and get 94% accuracy. You report '94% accuracy' in your paper. What's wrong?
💡 How confident can you be in a single measurement?
A single test result is a data point, not a conclusion. Cross-validation (say, 5-fold) would give you 5 test scores: maybe 94%, 89%, 92%, 91%, 90%. Now you can report 91.2% ± 1.8% — which is more honest and useful than a single 94%. Reviewers and practitioners trust CV results over single splits because they show both the estimate AND the uncertainty.
K-Fold Cross-Validation
K-Fold CV math
The cross-validation score is the average of scores from each fold. This gives a more robust estimate than any single train/test split.
CV score = (1/K) Σᵢ score(foldᵢ) K=5 is the most common default: fast and 80% training data per fold. K=10 gives lower variance. LOOCV (K=N) uses maximum training data but is expensive and high-variance.
Every data point appears in the test set exactly once across all folds. No data is wasted — you get a test prediction for every single sample.
In 5-fold CV, how much of the data does each model train on?
💡 If there are 5 folds and 1 is held out for testing, how many folds are used for training?
In K-fold CV, each model trains on (K−1)/K of the data. For K=5: 4/5 = 80%. For K=10: 9/10 = 90%. Higher K → more training data per fold → lower bias but higher variance (folds overlap more) and more compute. K=5 is the most common default: 80% training is substantial, and 5 fits are fast enough for most pipelines.
Choosing the Right CV Strategy
Hyperparameter Tuning: Finding the Best Settings
You use GridSearchCV to find the best hyperparameters and report the best CV score. Is there a subtle problem?
💡 Selection bias: if you pick the maximum of 100 random numbers, it's higher than average...
This is a critical pitfall! If you try 100 hyperparameter combinations and pick the one with the best CV score, that score is optimistically biased — you've effectively 'fit' the validation data by selecting the best combination. The solution: use a three-way split: train/validation (for CV and tuning), and a final held-out TEST set that is NEVER used until the very end. This test set gives you an unbiased estimate of performance.
Beyond Grid Search
🎓 What You Now Know
✓ Single splits are unreliable — Use K-fold CV (K=5 or 10) for robust estimates.
✓ Stratified for classification, time-series for temporal — Match CV to your data structure.
✓ Grid search is exhaustive but expensive — Use random search for 4+ hyperparameters.
✓ Always keep a final held-out test set — Never use it for tuning or selection.
✓ Bayesian optimization for expensive models — Learns which regions of hyperspace are promising.
Model evaluation isn’t glamorous, but it’s what separates rigorous ML from guesswork. A mediocre model properly evaluated is more trustworthy than a brilliant model tested on one lucky split. Build the habit: every result you report should come from cross-validation. 📊
📄 Random Search for Hyper-Parameter Optimization (Bergstra & Bengio, 2012)
↗ Keep Learning
Bias-Variance Tradeoff — The Most Important Concept in ML
A scroll-driven visual deep dive into the bias-variance tradeoff. Learn why every model makes errors, how underfitting and overfitting emerge, and how to balance them.
Accuracy, Precision, Recall & F1 — Choosing the Right Metric
A scroll-driven visual deep dive into classification metrics. Learn why accuracy misleads, what precision and recall actually measure, and when to use F1, F2, or something else entirely.
Comments
No comments yet. Be the first!