Introduction 0%

Introduction

🎯 0/3 0%

Ensemble Methods · Sequential

Each tree fixes the mistakes
of the one before it.

Random Forest trains trees independently. Gradient Boosting trains them sequentially — each tree learns from the errors of the previous ensemble. This greedy, additive strategy wins over 70% of ML competitions on tabular data.

Fitting Residuals

Step 1: Learn the Errors

Sequential trees: each learns from the residuals of the ensemble so far

Gradient Boosting: additive model

F₀(x) = ȳ

Start with the simplest prediction: the mean of training labels

F₁(x) = F₀(x) + η · h₁(x)

Add a small tree h₁ fitted to the residuals (y - F₀(x))

Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)

Each step adds a tree fitted to current residuals, scaled by learning rate η

η (learning rate) is typically 0.01 − 0.3

Small η = more trees needed but better generalization. Classic bias-variance tradeoff

↑ Answer the question above to continue ↑

Why do we multiply each tree's prediction by a learning rate η < 1?

Gradient View

The ‘Gradient’ in Gradient Boosting

Gradient descent in function space

L(y, F) = ½(y − F(x))²

For MSE loss...

−∂L/∂F = y − F(x) = residuals

The negative gradient of MSE is exactly the residuals!

For log-loss: −∂L/∂F = y − p(x)

For classification, the gradient is (true label − predicted probability)

hₘ(x) ≈ −∂L/∂F

Each tree approximates the negative gradient → steepest descent in function space

↑ Answer the question above to continue ↑

For MSE loss, fitting a tree to residuals is equivalent to fitting a tree to negative gradients. For cross-entropy loss, what are the 'pseudo-residuals'?

XGBoost

XGBoost: Gradient Boosting, Engineered

XGBoost's regularized objective

Obj = Σ L(yᵢ, ŷᵢ) + Σ Ω(fₖ)

Loss + regularization over all trees

Ω(f) = γT + ½λ||w||²

T = #leaves (penalizes complex trees), w = leaf weights (L2 penalty)

Uses 2nd-order Taylor expansion of loss

Needs both gradient gᵢ and hessian hᵢ → more precise split finding

Gain = ½[G²_L/(H_L+λ) + G²_R/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ)] − γ

Split gain formula: if gain < 0, don't split → built-in pruning

Key Parameters

The Parameters That Matter

Key hyperparameters and their effects

↑ Answer the question above to continue ↑

In XGBoost, individual trees are typically much shallower (depth 3-6) than in Random Forests (often grown to full depth). Why?

RF vs GB

Random Forest vs Gradient Boosting

The two philosophies of ensemble learning

🎓 What You Now Know

✓ Boosting fits residuals sequentially — Each tree corrects the errors of the ensemble so far.

✓ It’s gradient descent in function space — Residuals = negative gradient of the loss.

✓ XGBoost adds regularization + 2nd-order optimization — Penalizes tree complexity, uses hessians.

✓ Key: learning rate + early stopping — Small η, many shallow trees, stop when validation plateaus.

✓ Wins competitions, but needs tuning — Not as “set-and-forget” as Random Forest.

XGBoost is the tool that turned “good” into “winning” for tabular machine learning. But it’s also a tool that rewards understanding: the better you understand the bias-variance tradeoff, the better you’ll tune it. 🏆

📄 XGBoost: A Scalable Tree Boosting System (Chen & Guestrin, 2016)

Gradient Boosting & XGBoost — The Kaggle King

Each tree fixes the mistakes
of the one before it.

Step 1: Learn the Errors

Gradient Boosting: additive model

The ‘Gradient’ in Gradient Boosting

Gradient descent in function space

XGBoost: Gradient Boosting, Engineered

XGBoost's regularized objective

The Parameters That Matter

Random Forest vs Gradient Boosting

🎓 What You Now Know

Comments

↗ Keep Learning

Decision Trees — How Machines Learn to Ask Questions

Random Forests — Why 1000 Bad Models Beat 1 Good One

Bagging vs Boosting — The Two Philosophies of Ensemble Learning

Decision Trees — How Machines Learn to Ask Questions

Each tree fixes the mistakes of the one before it.

Step 1: Learn the Errors

Gradient Boosting: additive model

The ‘Gradient’ in Gradient Boosting

Gradient descent in function space

XGBoost: Gradient Boosting, Engineered

XGBoost's regularized objective

The Parameters That Matter

Random Forest vs Gradient Boosting

🎓 What You Now Know

Comments

↗ Keep Learning

Decision Trees — How Machines Learn to Ask Questions

Random Forests — Why 1000 Bad Models Beat 1 Good One

Bagging vs Boosting — The Two Philosophies of Ensemble Learning

Decision Trees — How Machines Learn to Ask Questions

Each tree fixes the mistakes
of the one before it.