All articles
· 14 min deep-divemachine-learningensemble
Article 1 in your session

Random Forests — Why 1000 Bad Models Beat 1 Good One

A scroll-driven visual deep dive into Random Forests. Learn bagging, feature randomness, out-of-bag error, and why ensembles are the most reliable ML technique.

Introduction 0%
Introduction
🎯 0/4 0%

Ensemble Methods

One tree is unstable.
A forest is unshakeable.

Train 500 decision trees on random subsets of data and features. Average their predictions. The errors cancel out, the signal adds up. That’s the Random Forest — the most reliable algorithm in ML.

Bagging

Step 1: Bootstrap Aggregating (Bagging)

📊 Full Data N samples 🎲 Sample 1 N with replacement 🎲 Sample 2 N with replacement 🎲 Sample B N with replacement 🌲 Tree 1 🌲 Tree 2 🌲 Tree B 🗳️ Average / Vote Final prediction
Bagging: sample with replacement, train independent trees, aggregate

Why averaging reduces variance

1
Var(single tree) = σ²
One tree is noisy — high variance
2
Var(average of B independent trees) = σ²/B
If trees are independent, variance drops linearly with B!
3
But trees aren't fully independent...
They're trained on the same data. Correlation reduces the benefit.
4
Solution: make trees MORE different
That's where feature randomness comes in → Random Forest
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Each bootstrap sample draws N points from a dataset of N points WITH replacement. What fraction of original points are typically left out?

Feature Randomness

Step 2: Random Feature Selection

How Random Forest differs from Bagged Trees

🌲 Bagged Trees vs 🌳 Random Forest vs 📐 Variance Formula
🌲 Bagged Trees

Each split considers ALL p features. Since every tree splits on the same dominant features first, the trees end up highly correlated — reducing the benefit of averaging.

Each split considers all p features
🌳 Random Forest

Each split only considers √p random features. This forces trees to use different features and find different patterns, making them decorrelated and far more diverse.

Each split considers √p random features
📐 Variance Formula

The ensemble variance depends on the average correlation ρ between trees. Lower correlation means lower variance — this is why Random Forest beats plain bagging.

Var(RF) = ρσ² + (1-ρ)σ²/B
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

You have 100 features. At each split in a Random Forest, how many features does a tree consider?

OOB Error

Free Validation: Out-of-Bag Error

Training dataTree 1Tree 2Tree 3OOB predPoint A✓ used✗ OOB✗ OOBavg(T2,T3)Point B✗ OOB✓ used✗ OOBavg(T1,T3)Point C✗ OOB✗ OOB✓ usedavg(T1,T2)OOB Error ≈ Test Error — no separate validation set needed!
Each point is predicted by trees that never saw it — free cross-validation
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Why is OOB error nearly as good as cross-validation error?

Feature Importance

Which Features Matter Most?

Two ways to measure feature importance

📊 Impurity-Based vs 🔀 Permutation
📊 Impurity-Based

Sum up the Gini or entropy reduction from every split that uses feature f, across all trees. Fast and built-in, but biased toward high-cardinality features (like IDs) that get artificially inflated importance.

Importance(f) = Σ(gain from splits on feature f)
🔀 Permutation

Shuffle a feature's values randomly and measure how much accuracy drops. If accuracy drops a lot, the feature was important. This measures actual prediction impact, not just split frequency — more reliable than impurity-based.

Importance(f) = accuracy_original − accuracy_shuffled
In Practice

Random Forest: The Reliable Default

✓ Advantages• No feature scaling needed• Handles missing values• Works out-of-the-box• Rarely overfits (more trees = better)• Built-in feature importance• Parallelizable (trees are independent)• Free OOB error estimateBest default for tabular data✗ Limitations• Slower than single tree at prediction• Not as accurate as XGBoost (usually)• Can’t extrapolate beyond training range• Large models (500 trees × many nodes)• Less interpretable than single tree• Bad for images, text, sequencesUse XGBoost for max accuracy on tabular
Why Random Forest is often the first algorithm to try
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

Adding more trees to a Random Forest (e.g., going from 100 to 10,000 trees):

🎓 What You Now Know

Bagging averages noisy models to reduce variance — Train B trees on bootstrap samples, average predictions.

Feature randomness decorrelates trees — Only √p random features per split forces diversity.

OOB error = free validation — ~37% of points are left out of each tree, giving unbiased error estimates.

More trees never hurts — Can’t overfit by adding trees. Diminishing returns after ~500.

Best default for tabular data — No scaling, no tuning, just works. Start here.

Random Forest is the AK-47 of machine learning: reliable, robust, hard to misuse. Every ML practitioner should have it in their toolkit. And understanding it prepares you for Gradient Boosting — the technique that takes ensemble methods to the next level. 🚀

📄 Random Forests (Breiman, 2001)

Keep Learning