All articles
· 13 min deep-divemachine-learningclassification
Article 1 in your session

Decision Trees — How Machines Learn to Ask Questions

A scroll-driven visual deep dive into decision trees. Learn how trees split data, what Gini impurity and information gain mean, and why trees overfit like crazy.

Introduction 0%
Introduction
🎯 0/4 0%

Tree-Based Learning

If this, then that.
Learned from data.

Decision trees learn a flowchart of yes/no questions from data. They’re the most interpretable ML model, the building block of random forests and XGBoost, and the only model a non-engineer can actually read.

How Trees Split

A Tree Is a Flowchart

☀️ Outlook? First question 💧 Humidity? If sunny YES Always play! 💨 Wind? If rainy NO Too humid YES Play! NO Too windy YES Play! Sunny Overcast Rainy High Normal Strong Weak
A decision tree for 'Should I play tennis?' — learned from data
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

What makes decision trees uniquely interpretable compared to other ML models?

Gini & Entropy

How Does the Tree Choose Questions?

Measuring impurity

🎲 Gini Impurity

Measures the probability of misclassifying a randomly chosen sample. Gini = 0 means perfectly pure (all one class). Fast to compute and scikit-learn's default.

Gini(t) = 1 - Σ pᵢ²
📡 Entropy

Measures the information content (disorder) in a node. Entropy = 0 means pure. Higher entropy means more mixed classes — borrowed from information theory.

Entropy(t) = -Σ pᵢ · log₂(pᵢ)
⚖️ Maximum Impurity

A 50/50 class split is the worst case — maximum uncertainty, like flipping a fair coin. Both metrics peak here: Gini = 0.5, Entropy = 1.0.

50/50 split → Gini = 0.5, Entropy = 1.0
Near-Pure Node

A 90/10 split is much purer — the node is dominated by one class. Both metrics drop significantly, reflecting the reduced uncertainty.

90/10 split → Gini = 0.18, Entropy = 0.47

Information gain: picking the best split

📐 Compute Gain 🔍 Try All Splits 🏆 Pick the Best
📐 Compute Gain

Information gain measures how much purity improves after splitting. Subtract the weighted child impurity from the parent's impurity — bigger gain means a better split.

Gain = Impurity(parent) - Σ(nⱼ/N) · Impurity(childⱼ)
🔍 Try All Splits

The algorithm exhaustively tries every feature and every possible threshold value (e.g., age ≤ 20, age ≤ 25, ...) to find which split produces the highest information gain.

🏆 Pick the Best

Select the split with the highest gain — this is a greedy choice, optimizing locally at each node rather than planning the entire tree structure ahead.

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A node contains 100 samples: 50 cats and 50 dogs. After splitting on 'has_whiskers', the left child has 45 cats and 5 dogs, the right child has 5 cats and 45 dogs. Is this a good split?

Building a Tree

The Recursive Algorithm

🌳 Start All data at root 🔍 Find best split Max info gain ✂️ Split data Left & right child 🤔 Stop? Pure or max depth 🍃 Make leaf Majority class vote best feature + threshold no → recurse yes → stop
CART: the algorithm behind decision trees
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

CART uses a greedy algorithm to build trees. Why is this a problem?

Overfitting

The Overfitting Monster

Unpruned (overfit)• Depth: unlimited• 1 sample per leaf• Train accuracy: 100%• Test accuracy: 65%Memorized noise, outliers,and measurement errorsPruned (good fit)• max_depth: 5• min_samples_leaf: 10• Train accuracy: 88%• Test accuracy: 85%Captured signal, ignored noise.Generalizes well.
Unpruned trees memorize noise. Pruned trees capture the real pattern.
Regression Trees

Trees for Regression Too

Regression tree differences

📊 Leaf Prediction

Instead of majority-class voting, regression tree leaves output the mean of all target values in that region. Each leaf represents a constant prediction for its partition of the feature space.

Leaf prediction = mean(y values in leaf)
✂️ Split Criterion

Instead of Gini or entropy, regression trees use variance reduction to choose splits. The best split minimizes the weighted variance of the child nodes — same greedy idea, different metric.

📈 Output Shape

The final model produces a piecewise-constant (step) function — flat within each leaf's region of feature space. More leaves mean finer steps, but too many leads to overfitting.

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Why are single decision trees rarely used alone in production?

🎓 What You Now Know

Trees learn a flowchart of questions — Each node splits on one feature + threshold. Leaves make predictions.

Gini impurity measures node purity — Lower = purer. The tree greedily picks splits that maximize purity gain.

Unpruned trees overfit catastrophically — Always set max_depth, min_samples_leaf, or use post-pruning.

High variance is the fatal flaw — Small data changes = completely different tree. Fix: ensembles.

Trees are the foundation of Random Forest and XGBoost — Learn trees, and you understand the building block of the most powerful tabular-data models.

Decision trees are the gateway to ensemble methods. On their own, they overfit. But combine hundreds of them intelligently and you get Random Forests and Gradient Boosting — the most dominant algorithms for tabular data in the real world. 🚀

Keep Learning