All articles
· 14 min deep-divemachine-learningregularization
Article 1 in your session

Ridge & Lasso — Taming Overfitting with Regularization

A scroll-driven visual deep dive into Ridge and Lasso regression. Learn why models overfit, how penalizing large weights fixes it, and why Lasso kills features.

Introduction 0%
Introduction
🎯 0/5 0%

Regularization Fundamentals

Your model is
too eager to please

It memorizes training data instead of learning patterns. The fix? Put a leash on the weights. That’s regularization.

Why Regularize?

The Overfitting Problem

When models overfit, their weight values become extremely large. Why?

  • Large weights create sharp, wiggly curves that pass through every training point
  • Small weights create smooth, gentle curves that capture the general trend
  • Regularization adds a penalty for large weights to the loss function

The regularization idea

🎯 Regularized Loss = 📏 Prediction Error + 🛡️ Weight Penalty
📏 Prediction Error

The standard loss function measures how far predictions are from actual values. Without regularization, the model is free to use any weight values — often resulting in huge, overfitting weights.

Σ(yᵢ − ŷᵢ)²
🛡️ Weight Penalty

A cost for having large weights, controlled by λ. When λ = 0, there's no penalty (standard regression). As λ → ∞, all weights shrink to zero (flat line). The sweet spot balances accuracy and simplicity.

λ · Penalty(w)
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Why do overfit models tend to have very large weight values?

Ridge (L2)

Ridge Regression (L2 Penalty)

Ridge regression math

📐 L2 Loss Function

Ridge adds the sum of squared weights to the loss. A weight of 10 is penalized 100× more than a weight of 1, so large weights get aggressively shrunk while small weights are barely affected.

Loss = Σ(yᵢ − ŷᵢ)² + λΣwⱼ²
🧮 Closed-Form Solution

Unlike Lasso, Ridge has an exact analytical solution. The λI term makes the matrix invertible even when features are correlated (multicollinear), which is one of Ridge's key practical advantages.

w* = (XᵀX + λI)⁻¹Xᵀy
🔥 λ Small

Weights stay large and the model fits the training data closely — more flexibility but higher risk of overfitting.

❄️ λ Large

Weights shrink toward zero, producing a smoother and simpler model — better generalization but potentially underfitting if too extreme.

w₁w₂w₁² + w₂² ≤ t (L2 ball)OLS optimumRidgesmaller weights, on circle edge
Ridge constrains weights to a circle. The optimal point is where the loss contours touch the circle.
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

In Ridge regression, what happens as λ increases?

Lasso (L1)

Lasso Regression (L1 Penalty)

Lasso regression math

💎 L1 Loss Function

Lasso adds the sum of absolute weight values to the loss. The linear penalty treats all weight magnitudes proportionally — unlike Ridge's quadratic penalty, it doesn't disproportionately punish large weights.

Loss = Σ(yᵢ − ŷᵢ)² + λΣ|wⱼ|
🔄 No Closed-Form

The absolute value function isn't differentiable at zero, so there's no neat analytical solution. Lasso requires iterative optimization like coordinate descent or subgradient methods.

✂️ Sparse Solutions

Lasso's killer feature: it drives many weights to exactly zero, effectively removing features from the model. This gives you automatic feature selection — the model only uses the features that truly matter.

w₁w₂|w₁| + |w₂| ≤ t (L1 ball)OLS optimumLassow₂ = 0 (feature killed!)
Lasso constrains weights to a diamond. The corners touch the axes — that's where weights hit exactly zero.
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Why does Lasso (L1) drive weights to exactly zero while Ridge (L2) doesn't?

Ridge vs Lasso

Head-to-Head Comparison

Ridge (L2)Penalty: λΣwⱼ²Geometry: circle✓ Shrinks all weights✗ No feature selection✓ Closed-form solution✓ Handles correlated featuresBest when: all features matterand you suspect multicollinearityOutput: dense weights (all nonzero)Lasso (L1)Penalty: λΣ|wⱼ|Geometry: diamond✓ Drives weights to zero✓ Automatic feature selection✗ No closed-form (needs optimization)✗ Arbitrarily picks among correlated featsBest when: many irrelevant featuresand you want a sparse modelOutput: sparse weights (many zeros)
Same data, different regularizers — fundamentally different behaviors
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

You're building a model with 500 features but suspect only ~20 actually matter. Which regularizer should you use?

Elastic Net

Elastic Net: The Best of Both Worlds

Elastic Net

💎 L1 Penalty (Lasso) + L2 Penalty (Ridge) + 🎛️ Mixing Ratio α
💎 L1 Penalty (Lasso)

The L1 component provides sparsity — it can drive weights to exactly zero for automatic feature selection, keeping only the most relevant features in the model.

λ₁Σ|wⱼ|
L2 Penalty (Ridge)

The L2 component provides stability with correlated features — it groups related features together instead of arbitrarily picking one, giving more reliable and reproducible results.

λ₂Σwⱼ²
🎛️ Mixing Ratio α

Controls the balance: α = 1 is pure Lasso, α = 0 is pure Ridge. In scikit-learn, you tune both λ (overall penalty strength) and α (L1 vs L2 ratio) using cross-validation.

α = λ₁/(λ₁ + λ₂), α ∈ [0, 1]
🤔 Start Need regularization? Ridge All features matter 💎 Lasso Many irrelevant features 🔗 Elastic Net Sparse + correlated None N ≫ p, no overfit dense signal sparse signal + correlated + sparse
Decision guide for choosing regularization
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

You have 100 features that are highly correlated (e.g., gene expression data). Many are irrelevant. Which regularizer works best?

🎓 What You Now Know

Regularization penalizes large weights — It trades a small increase in training error for a big decrease in test error.

Ridge (L2) shrinks all weights — Circle constraint. No feature selection. Great for correlated features.

Lasso (L1) kills features — Diamond constraint. Drives weights to exactly zero. Automatic feature selection.

Elastic Net combines both — L1 for sparsity + L2 for stability with correlated features.

λ controls the trade-off — Use cross-validation to find the sweet spot between underfitting and overfitting.

Regularization is not optional in modern ML. Every production model uses some form of it — L2 weight decay in neural networks is just Ridge regression in disguise. Master this concept and you understand half of what prevents ML models from being useless. 🚀

📄 Regularization Paths for Generalized Linear Models via Coordinate Descent (Friedman, Hastie, Tibshirani, 2010)

Keep Learning