Introduction 0%

Introduction

🎯 0/4 0%

Tree-Based Learning

If this, then that.
Learned from data.

Decision trees learn a flowchart of yes/no questions from data. They’re the most interpretable ML model, the building block of random forests and XGBoost, and the only model a non-engineer can actually read.

How Trees Split

A Tree Is a Flowchart

A decision tree for 'Should I play tennis?' — learned from data

↑ Answer the question above to continue ↑

What makes decision trees uniquely interpretable compared to other ML models?

Gini & Entropy

How Does the Tree Choose Questions?

Measuring impurity

🎲 Gini Impurity

Measures the probability of misclassifying a randomly chosen sample. Gini = 0 means perfectly pure (all one class). Fast to compute and scikit-learn's default.

Gini(t) = 1 - Σ pᵢ²

📡 Entropy

Measures the information content (disorder) in a node. Entropy = 0 means pure. Higher entropy means more mixed classes — borrowed from information theory.

Entropy(t) = -Σ pᵢ · log₂(pᵢ)

⚖️ Maximum Impurity

A 50/50 class split is the worst case — maximum uncertainty, like flipping a fair coin. Both metrics peak here: Gini = 0.5, Entropy = 1.0.

50/50 split → Gini = 0.5, Entropy = 1.0

✅ Near-Pure Node

A 90/10 split is much purer — the node is dominated by one class. Both metrics drop significantly, reflecting the reduced uncertainty.

90/10 split → Gini = 0.18, Entropy = 0.47

Information gain: picking the best split

📐 Compute Gain → 🔍 Try All Splits → 🏆 Pick the Best

📐 Compute Gain

Information gain measures how much purity improves after splitting. Subtract the weighted child impurity from the parent's impurity — bigger gain means a better split.

Gain = Impurity(parent) - Σ(nⱼ/N) · Impurity(childⱼ)

🔍 Try All Splits

The algorithm exhaustively tries every feature and every possible threshold value (e.g., age ≤ 20, age ≤ 25, ...) to find which split produces the highest information gain.

🏆 Pick the Best

Select the split with the highest gain — this is a greedy choice, optimizing locally at each node rather than planning the entire tree structure ahead.

↑ Answer the question above to continue ↑

A node contains 100 samples: 50 cats and 50 dogs. After splitting on 'has_whiskers', the left child has 45 cats and 5 dogs, the right child has 5 cats and 45 dogs. Is this a good split?

Building a Tree

The Recursive Algorithm

CART: the algorithm behind decision trees

↑ Answer the question above to continue ↑

CART uses a greedy algorithm to build trees. Why is this a problem?

Overfitting

The Overfitting Monster

Unpruned trees memorize noise. Pruned trees capture the real pattern.

Regression Trees

Trees for Regression Too

Regression tree differences

📊 Leaf Prediction

Instead of majority-class voting, regression tree leaves output the mean of all target values in that region. Each leaf represents a constant prediction for its partition of the feature space.

Leaf prediction = mean(y values in leaf)

✂️ Split Criterion

Instead of Gini or entropy, regression trees use variance reduction to choose splits. The best split minimizes the weighted variance of the child nodes — same greedy idea, different metric.

📈 Output Shape

The final model produces a piecewise-constant (step) function — flat within each leaf's region of feature space. More leaves mean finer steps, but too many leads to overfitting.

↑ Answer the question above to continue ↑

Why are single decision trees rarely used alone in production?

🎓 What You Now Know

✓ Trees learn a flowchart of questions — Each node splits on one feature + threshold. Leaves make predictions.

✓ Gini impurity measures node purity — Lower = purer. The tree greedily picks splits that maximize purity gain.

✓ Unpruned trees overfit catastrophically — Always set max_depth, min_samples_leaf, or use post-pruning.

✓ High variance is the fatal flaw — Small data changes = completely different tree. Fix: ensembles.

✓ Trees are the foundation of Random Forest and XGBoost — Learn trees, and you understand the building block of the most powerful tabular-data models.

Decision trees are the gateway to ensemble methods. On their own, they overfit. But combine hundreds of them intelligently and you get Random Forests and Gradient Boosting — the most dominant algorithms for tabular data in the real world. 🚀

Decision Trees — How Machines Learn to Ask Questions

If this, then that.
Learned from data.

A Tree Is a Flowchart

How Does the Tree Choose Questions?

Measuring impurity

Information gain: picking the best split

The Recursive Algorithm

The Overfitting Monster

Trees for Regression Too

Regression tree differences

🎓 What You Now Know

Comments

↗ Keep Learning

Random Forests — Why 1000 Bad Models Beat 1 Good One

Gradient Boosting & XGBoost — The Kaggle King

Bias-Variance Tradeoff — The Most Important Concept in ML

Random Forests — Why 1000 Bad Models Beat 1 Good One

If this, then that. Learned from data.

A Tree Is a Flowchart

How Does the Tree Choose Questions?

Measuring impurity

Information gain: picking the best split

The Recursive Algorithm

The Overfitting Monster

Trees for Regression Too

Regression tree differences

🎓 What You Now Know

Comments

↗ Keep Learning

Random Forests — Why 1000 Bad Models Beat 1 Good One

Gradient Boosting & XGBoost — The Kaggle King

Bias-Variance Tradeoff — The Most Important Concept in ML

Random Forests — Why 1000 Bad Models Beat 1 Good One

If this, then that.
Learned from data.