All articles
· 10 min deep-divemachine-learningensemble
Article 1 in your session

Gradient Boosting & XGBoost — The Kaggle King

A scroll-driven visual deep dive into gradient boosting. Learn how weak learners combine sequentially, how XGBoost optimizes the process, and why it dominates tabular ML competitions.

Introduction 0%
Introduction
🎯 0/3 0%

Ensemble Methods · Sequential

Each tree fixes the mistakes
of the one before it.

Random Forest trains trees independently. Gradient Boosting trains them sequentially — each tree learns from the errors of the previous ensemble. This greedy, additive strategy wins over 70% of ML competitions on tabular data.

Fitting Residuals

Step 1: Learn the Errors

🌲 Tree 0 Predict mean Residuals y − F₀(x) 🌲 Tree 1 Fit residuals Residuals y − F₁(x) 🌲 Tree 2 Fit residuals ... F_M(x) Sum of all trees
Sequential trees: each learns from the residuals of the ensemble so far

Gradient Boosting: additive model

1
F₀(x) = ȳ
Start with the simplest prediction: the mean of training labels
2
F₁(x) = F₀(x) + η · h₁(x)
Add a small tree h₁ fitted to the residuals (y - F₀(x))
3
Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)
Each step adds a tree fitted to current residuals, scaled by learning rate η
4
η (learning rate) is typically 0.01 − 0.3
Small η = more trees needed but better generalization. Classic bias-variance tradeoff
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Why do we multiply each tree's prediction by a learning rate η < 1?

Gradient View

The ‘Gradient’ in Gradient Boosting

Gradient descent in function space

1
L(y, F) = ½(y − F(x))²
For MSE loss...
2
−∂L/∂F = y − F(x) = residuals
The negative gradient of MSE is exactly the residuals!
3
For log-loss: −∂L/∂F = y − p(x)
For classification, the gradient is (true label − predicted probability)
4
hₘ(x) ≈ −∂L/∂F
Each tree approximates the negative gradient → steepest descent in function space
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

For MSE loss, fitting a tree to residuals is equivalent to fitting a tree to negative gradients. For cross-entropy loss, what are the 'pseudo-residuals'?

XGBoost

XGBoost: Gradient Boosting, Engineered

XGBoost's regularized objective

1
Obj = Σ L(yᵢ, ŷᵢ) + Σ Ω(fₖ)
Loss + regularization over all trees
2
Ω(f) = γT + ½λ||w||²
T = #leaves (penalizes complex trees), w = leaf weights (L2 penalty)
3
Uses 2nd-order Taylor expansion of loss
Needs both gradient gᵢ and hessian hᵢ → more precise split finding
4
Gain = ½[G²_L/(H_L+λ) + G²_R/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ)] − γ
Split gain formula: if gain < 0, don't split → built-in pruning
Key Parameters

The Parameters That Matter

n_estimators (100–10000)Number of trees. More trees + lower learning rate = better. Use early stopping!learning_rate / eta (0.01–0.3)Shrinks each tree’s contribution. Lower = more trees needed, better generalization.max_depth (3–8)Unlike RF, boosted trees should be SHALLOW (weak learners). Depth 3-6 is common.subsample + colsample_bytree (0.5–1.0)Random subsets of data and features per tree. Adds stochasticity → reduces overfitting.This is “Stochastic Gradient Boosting” — combines boosting with bagging ideas.
Key hyperparameters and their effects
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

In XGBoost, individual trees are typically much shallower (depth 3-6) than in Random Forests (often grown to full depth). Why?

RF vs GB

Random Forest vs Gradient Boosting

🌲 Random Forest• Trees trained INDEPENDENTLY• Deep trees (strong learners)• Reduces VARIANCE• Parallelizable• Hard to overfit• Fewer hyperparametersBest for: reliable baseline, quick results🚀 Gradient Boosting• Trees trained SEQUENTIALLY• Shallow trees (weak learners)• Reduces BIAS• Sequential (harder to parallelize)• CAN overfit — needs tuning• Many hyperparametersBest for: max accuracy, competitions
The two philosophies of ensemble learning

🎓 What You Now Know

Boosting fits residuals sequentially — Each tree corrects the errors of the ensemble so far.

It’s gradient descent in function space — Residuals = negative gradient of the loss.

XGBoost adds regularization + 2nd-order optimization — Penalizes tree complexity, uses hessians.

Key: learning rate + early stopping — Small η, many shallow trees, stop when validation plateaus.

Wins competitions, but needs tuning — Not as “set-and-forget” as Random Forest.

XGBoost is the tool that turned “good” into “winning” for tabular machine learning. But it’s also a tool that rewards understanding: the better you understand the bias-variance tradeoff, the better you’ll tune it. 🏆

📄 XGBoost: A Scalable Tree Boosting System (Chen & Guestrin, 2016)

Keep Learning