Random Forests — Why 1000 Bad Models Beat 1 Good One
A scroll-driven visual deep dive into Random Forests. Learn bagging, feature randomness, out-of-bag error, and why ensembles are the most reliable ML technique.
Ensemble Methods
One tree is unstable.
A forest is unshakeable.
Train 500 decision trees on random subsets of data and features. Average their predictions. The errors cancel out, the signal adds up. That’s the Random Forest — the most reliable algorithm in ML.
Step 1: Bootstrap Aggregating (Bagging)
Why averaging reduces variance
Var(single tree) = σ² Var(average of B independent trees) = σ²/B But trees aren't fully independent... Solution: make trees MORE different Each bootstrap sample draws N points from a dataset of N points WITH replacement. What fraction of original points are typically left out?
💡 What's the probability of NOT picking a specific card from a deck of N cards, N times in a row?
The probability of a specific point NOT being selected in one draw is (1 - 1/N). Over N draws, the probability of never being selected is (1 - 1/N)^N → 1/e ≈ 0.368 as N grows. So about 37% of points are left out of each bootstrap sample. These 'out-of-bag' points become a free validation set — no need for a separate train/test split!
Step 2: Random Feature Selection
How Random Forest differs from Bagged Trees
Each split considers ALL p features. Since every tree splits on the same dominant features first, the trees end up highly correlated — reducing the benefit of averaging.
Each split considers all p features Each split only considers √p random features. This forces trees to use different features and find different patterns, making them decorrelated and far more diverse.
Each split considers √p random features The ensemble variance depends on the average correlation ρ between trees. Lower correlation means lower variance — this is why Random Forest beats plain bagging.
Var(RF) = ρσ² + (1-ρ)σ²/B You have 100 features. At each split in a Random Forest, how many features does a tree consider?
💡 p = 100, and the default for classification is √p...
The default for classification is √p = √100 = 10 random features per split. Crucially, a NEW random subset is drawn at EVERY split, not just once per tree. This means even within a single tree, different nodes use different feature subsets. For regression, the default is p/3 ≈ 33 features per split.
Free Validation: Out-of-Bag Error
Why is OOB error nearly as good as cross-validation error?
💡 What's the defining property of good model evaluation? The model hasn't seen the test data during training...
In k-fold cross-validation, each point is evaluated by a model that didn't train on it. OOB does the same thing naturally: each point is evaluated by the ~37% of trees that didn't include it in their bootstrap sample. The key insight is that these trees have never 'seen' the point — so their predictions are unbiased estimates of generalization performance. And you get this for free, without the computational cost of multiple train/test splits!
Which Features Matter Most?
Two ways to measure feature importance
Sum up the Gini or entropy reduction from every split that uses feature f, across all trees. Fast and built-in, but biased toward high-cardinality features (like IDs) that get artificially inflated importance.
Importance(f) = Σ(gain from splits on feature f) Shuffle a feature's values randomly and measure how much accuracy drops. If accuracy drops a lot, the feature was important. This measures actual prediction impact, not just split frequency — more reliable than impurity-based.
Importance(f) = accuracy_original − accuracy_shuffled Random Forest: The Reliable Default
Adding more trees to a Random Forest (e.g., going from 100 to 10,000 trees):
💡 Does averaging MORE independent estimates make predictions worse?
This is one of Random Forest's best properties: you CANNOT overfit by adding more trees. Each tree is an independent estimate, and averaging more estimates only reduces variance. However, the returns diminish: going from 10 to 100 trees is huge, 100 to 500 is noticeable, 500 to 5000 is marginal. The only cost is compute time. In practice, 300-500 trees is usually sufficient.
🎓 What You Now Know
✓ Bagging averages noisy models to reduce variance — Train B trees on bootstrap samples, average predictions.
✓ Feature randomness decorrelates trees — Only √p random features per split forces diversity.
✓ OOB error = free validation — ~37% of points are left out of each tree, giving unbiased error estimates.
✓ More trees never hurts — Can’t overfit by adding trees. Diminishing returns after ~500.
✓ Best default for tabular data — No scaling, no tuning, just works. Start here.
Random Forest is the AK-47 of machine learning: reliable, robust, hard to misuse. Every ML practitioner should have it in their toolkit. And understanding it prepares you for Gradient Boosting — the technique that takes ensemble methods to the next level. 🚀
↗ Keep Learning
Decision Trees — How Machines Learn to Ask Questions
A scroll-driven visual deep dive into decision trees. Learn how trees split data, what Gini impurity and information gain mean, and why trees overfit like crazy.
Gradient Boosting & XGBoost — The Kaggle King
A scroll-driven visual deep dive into gradient boosting. Learn how weak learners combine sequentially, how XGBoost optimizes the process, and why it dominates tabular ML competitions.
Bagging vs Boosting — The Two Philosophies of Ensemble Learning
A scroll-driven visual deep dive comparing bagging and boosting. Learn when to average independent models vs sequentially correct errors, and why ensembles dominate ML.
Comments
No comments yet. Be the first!