All articles
· 14 min deep-divemachine-learningfundamentals
Article 1 in your session

Cross-Validation & Hyperparameter Tuning — How to Actually Evaluate Models

A scroll-driven visual deep dive into cross-validation and hyperparameter tuning. Learn K-fold CV, stratified splitting, grid search, random search, and Bayesian optimization.

Introduction 0%
Introduction
🎯 0/3 0%

Model Evaluation

Your test accuracy is a lie.
Cross-validation tells the truth.

One train/test split gives you one number. That number could be lucky or unlucky. Cross-validation gives you a distribution — mean accuracy AND confidence interval. Combined with smart hyperparameter search, it’s how professionals evaluate and tune models.

Hold-Out Problem

The Problem with a Single Split

What could go wrong?

🍀 Lucky Split

Easy test examples land in the held-out set by chance, inflating the accuracy to 92%. You'd think your model is great — but it was just a favorable draw.

Split 1: test accuracy = 92%
🎲 Unlucky Split

Hard examples concentrate in the test set, making accuracy drop to 81%. Same model, same data — wildly different result due to the random partition.

Split 2: test accuracy = 81%
📊 Average Split

A more typical partition gives 86% — somewhere between lucky and unlucky, but you still wouldn't know that without trying other splits.

Split 3: test accuracy = 86%
🎯 True Performance

Only by evaluating on ALL splits can you estimate the mean (≈86%) AND the uncertainty (±5%). A single number without a confidence interval is almost meaningless.

True performance ≈ 86% ± 5%
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

You train a model once, test it once, and get 94% accuracy. You report '94% accuracy' in your paper. What's wrong?

K-Fold CV

K-Fold Cross-Validation

5-Fold Cross-ValidationFold 1:TESTTRAIN→ 92%Fold 2:TEST→ 89%Fold 3:TEST→ 91%Fold 4:TEST→ 88%Fold 5:TEST→ 90%Mean: 90.0% ± 1.4% (every point tested!)
5-Fold CV: each fold serves as test once, train 4 times. Every point gets tested!

K-Fold CV math

📊 CV Score

The cross-validation score is the average of scores from each fold. This gives a more robust estimate than any single train/test split.

CV score = (1/K) Σᵢ score(foldᵢ)
⚙️ Choosing K

K=5 is the most common default: fast and 80% training data per fold. K=10 gives lower variance. LOOCV (K=N) uses maximum training data but is expensive and high-variance.

Complete Coverage

Every data point appears in the test set exactly once across all folds. No data is wasted — you get a test prediction for every single sample.

↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

In 5-fold CV, how much of the data does each model train on?

CV Variants

Choosing the Right CV Strategy

Stratified K-FoldEach fold preserves class distributionEssential for imbalanced datasetse.g., 90/10 split → each fold is 90/10DEFAULT for classificationTime Series SplitTrain on past, test on future onlyNever lets future data leak into trainingExpanding window: gradually more train dataREQUIRED for temporal dataGroup K-FoldEnsures same group stays togethere.g., same patient’s data never in bothPrevents data leakage from correlated samplesREQUIRED for grouped/clustered dataLOOCV (Leave-One-Out)K = N, test on single point each timeNearly unbiased but high varianceVery slow: N models to trainOnly for very small datasets (N < 100)
Different CV strategies for different data types
Grid Search

Hyperparameter Tuning: Finding the Best Settings

📋 Param Grid all combinations ⚙️ Combo i e.g., η=0.1, d=3 📊 5-Fold CV avg score 🏆 Best combo highest CV score
GridSearchCV: try every combination, evaluate each with K-fold CV
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

You use GridSearchCV to find the best hyperparameters and report the best CV score. Is there a subtle problem?

Smart Search
Grid SearchTry every combinationExhaustive but exponentialGood for ≤3 hyperparamsO(m₁ × m₂ × … × mₖ)sklearn: GridSearchCVRandom SearchSample N random combosOften as good as grid in 1/10 timeBetter coverage of important dimsO(N) — YOU choose Nsklearn: RandomizedSearchCVBayesian OptModel the objective functionChoose next point intelligentlyBest for expensive evalsFewest evals to find optimumOptuna, hyperopt, Ax
Grid vs Random vs Bayesian search strategies

🎓 What You Now Know

Single splits are unreliable — Use K-fold CV (K=5 or 10) for robust estimates.

Stratified for classification, time-series for temporal — Match CV to your data structure.

Grid search is exhaustive but expensive — Use random search for 4+ hyperparameters.

Always keep a final held-out test set — Never use it for tuning or selection.

Bayesian optimization for expensive models — Learns which regions of hyperspace are promising.

Model evaluation isn’t glamorous, but it’s what separates rigorous ML from guesswork. A mediocre model properly evaluated is more trustworthy than a brilliant model tested on one lucky split. Build the habit: every result you report should come from cross-validation. 📊

📄 Random Search for Hyper-Parameter Optimization (Bergstra & Bengio, 2012)

Keep Learning