Decision Trees — How Machines Learn to Ask Questions
A scroll-driven visual deep dive into decision trees. Learn how trees split data, what Gini impurity and information gain mean, and why trees overfit like crazy.
Tree-Based Learning
If this, then that.
Learned from data.
Decision trees learn a flowchart of yes/no questions from data. They’re the most interpretable ML model, the building block of random forests and XGBoost, and the only model a non-engineer can actually read.
A Tree Is a Flowchart
What makes decision trees uniquely interpretable compared to other ML models?
💡 Can you explain a neural network's prediction as a series of yes/no questions?
Decision trees are the only ML model where you can point to a prediction and explain it completely to a non-technical person: 'The model predicted NO because: outlook was sunny AND humidity was high.' Try doing that with a neural network or SVM. This interpretability makes trees essential in regulated industries (healthcare, finance) where you must explain every decision.
How Does the Tree Choose Questions?
Measuring impurity
Measures the probability of misclassifying a randomly chosen sample. Gini = 0 means perfectly pure (all one class). Fast to compute and scikit-learn's default.
Gini(t) = 1 - Σ pᵢ² Measures the information content (disorder) in a node. Entropy = 0 means pure. Higher entropy means more mixed classes — borrowed from information theory.
Entropy(t) = -Σ pᵢ · log₂(pᵢ) A 50/50 class split is the worst case — maximum uncertainty, like flipping a fair coin. Both metrics peak here: Gini = 0.5, Entropy = 1.0.
50/50 split → Gini = 0.5, Entropy = 1.0 A 90/10 split is much purer — the node is dominated by one class. Both metrics drop significantly, reflecting the reduced uncertainty.
90/10 split → Gini = 0.18, Entropy = 0.47 Information gain: picking the best split
Information gain measures how much purity improves after splitting. Subtract the weighted child impurity from the parent's impurity — bigger gain means a better split.
Gain = Impurity(parent) - Σ(nⱼ/N) · Impurity(childⱼ) The algorithm exhaustively tries every feature and every possible threshold value (e.g., age ≤ 20, age ≤ 25, ...) to find which split produces the highest information gain.
Select the split with the highest gain — this is a greedy choice, optimizing locally at each node rather than planning the entire tree structure ahead.
A node contains 100 samples: 50 cats and 50 dogs. After splitting on 'has_whiskers', the left child has 45 cats and 5 dogs, the right child has 5 cats and 45 dogs. Is this a good split?
💡 Calculate parent Gini, then weighted child Gini. Is the drop significant?
Parent Gini: 1 - (0.5² + 0.5²) = 0.5 (maximum impurity). Left child Gini: 1 - (0.9² + 0.1²) = 0.18. Right child Gini: same = 0.18. Weighted average child Gini: 0.5(0.18) + 0.5(0.18) = 0.18. Information gain: 0.5 - 0.18 = 0.32. That's a massive gain! The children don't need to be perfectly pure — they just need to be significantly purer than the parent.
The Recursive Algorithm
CART uses a greedy algorithm to build trees. Why is this a problem?
💡 What does 'greedy' mean in algorithms? Best local choice vs best global plan...
Finding the globally optimal decision tree is NP-hard (would require trying all possible split orderings). CART's greedy approach picks the best split at each node independently. This can lead to suboptimal trees — maybe splitting on Feature B first and Feature A second would be better overall, but CART chose Feature A first because it had better immediate gain. Despite this, greedy works well in practice, and ensembles (Random Forest, XGBoost) mitigate suboptimality by averaging many trees.
The Overfitting Monster
Trees for Regression Too
Regression tree differences
Instead of majority-class voting, regression tree leaves output the mean of all target values in that region. Each leaf represents a constant prediction for its partition of the feature space.
Leaf prediction = mean(y values in leaf) Instead of Gini or entropy, regression trees use variance reduction to choose splits. The best split minimizes the weighted variance of the child nodes — same greedy idea, different metric.
The final model produces a piecewise-constant (step) function — flat within each leaf's region of feature space. More leaves mean finer steps, but too many leads to overfitting.
Why are single decision trees rarely used alone in production?
💡 Train a tree on slightly different data subsets. Do you get the same tree?
Decision trees are 'high variance' learners. Add or remove a few training points and the entire tree structure can change — different root, different splits, different predictions. This instability is their biggest weakness. Random Forests fix this by training hundreds of trees on random subsets and averaging. XGBoost fixes it by building trees sequentially, each correcting the previous one's errors.
🎓 What You Now Know
✓ Trees learn a flowchart of questions — Each node splits on one feature + threshold. Leaves make predictions.
✓ Gini impurity measures node purity — Lower = purer. The tree greedily picks splits that maximize purity gain.
✓ Unpruned trees overfit catastrophically — Always set max_depth, min_samples_leaf, or use post-pruning.
✓ High variance is the fatal flaw — Small data changes = completely different tree. Fix: ensembles.
✓ Trees are the foundation of Random Forest and XGBoost — Learn trees, and you understand the building block of the most powerful tabular-data models.
Decision trees are the gateway to ensemble methods. On their own, they overfit. But combine hundreds of them intelligently and you get Random Forests and Gradient Boosting — the most dominant algorithms for tabular data in the real world. 🚀
↗ Keep Learning
Random Forests — Why 1000 Bad Models Beat 1 Good One
A scroll-driven visual deep dive into Random Forests. Learn bagging, feature randomness, out-of-bag error, and why ensembles are the most reliable ML technique.
Gradient Boosting & XGBoost — The Kaggle King
A scroll-driven visual deep dive into gradient boosting. Learn how weak learners combine sequentially, how XGBoost optimizes the process, and why it dominates tabular ML competitions.
Bias-Variance Tradeoff — The Most Important Concept in ML
A scroll-driven visual deep dive into the bias-variance tradeoff. Learn why every model makes errors, how underfitting and overfitting emerge, and how to balance them.
Comments
No comments yet. Be the first!