Introduction 0%

Introduction

🎯 0/5 0%

Regularization Fundamentals

Your model is
too eager to please

It memorizes training data instead of learning patterns. The fix? Put a leash on the weights. That’s regularization.

Why Regularize?

The Overfitting Problem

When models overfit, their weight values become extremely large. Why?

Large weights create sharp, wiggly curves that pass through every training point
Small weights create smooth, gentle curves that capture the general trend
Regularization adds a penalty for large weights to the loss function

The regularization idea

🎯 Regularized Loss = 📏 Prediction Error + 🛡️ Weight Penalty

📏 Prediction Error

The standard loss function measures how far predictions are from actual values. Without regularization, the model is free to use any weight values — often resulting in huge, overfitting weights.

Σ(yᵢ − ŷᵢ)²

🛡️ Weight Penalty

A cost for having large weights, controlled by λ. When λ = 0, there's no penalty (standard regression). As λ → ∞, all weights shrink to zero (flat line). The sweet spot balances accuracy and simplicity.

λ · Penalty(w)

↑ Answer the question above to continue ↑

Why do overfit models tend to have very large weight values?

Ridge (L2)

Ridge Regression (L2 Penalty)

Ridge regression math

📐 L2 Loss Function

Ridge adds the sum of squared weights to the loss. A weight of 10 is penalized 100× more than a weight of 1, so large weights get aggressively shrunk while small weights are barely affected.

Loss = Σ(yᵢ − ŷᵢ)² + λΣwⱼ²

🧮 Closed-Form Solution

Unlike Lasso, Ridge has an exact analytical solution. The λI term makes the matrix invertible even when features are correlated (multicollinear), which is one of Ridge's key practical advantages.

w* = (XᵀX + λI)⁻¹Xᵀy

🔥 λ Small

Weights stay large and the model fits the training data closely — more flexibility but higher risk of overfitting.

❄️ λ Large

Weights shrink toward zero, producing a smoother and simpler model — better generalization but potentially underfitting if too extreme.

Ridge constrains weights to a circle. The optimal point is where the loss contours touch the circle.

↑ Answer the question above to continue ↑

In Ridge regression, what happens as λ increases?

Lasso (L1)

Lasso Regression (L1 Penalty)

Lasso regression math

💎 L1 Loss Function

Lasso adds the sum of absolute weight values to the loss. The linear penalty treats all weight magnitudes proportionally — unlike Ridge's quadratic penalty, it doesn't disproportionately punish large weights.

Loss = Σ(yᵢ − ŷᵢ)² + λΣ|wⱼ|

🔄 No Closed-Form

The absolute value function isn't differentiable at zero, so there's no neat analytical solution. Lasso requires iterative optimization like coordinate descent or subgradient methods.

✂️ Sparse Solutions

Lasso's killer feature: it drives many weights to exactly zero, effectively removing features from the model. This gives you automatic feature selection — the model only uses the features that truly matter.

Lasso constrains weights to a diamond. The corners touch the axes — that's where weights hit exactly zero.

↑ Answer the question above to continue ↑

Why does Lasso (L1) drive weights to exactly zero while Ridge (L2) doesn't?

Ridge vs Lasso

Head-to-Head Comparison

Same data, different regularizers — fundamentally different behaviors

↑ Answer the question above to continue ↑

You're building a model with 500 features but suspect only ~20 actually matter. Which regularizer should you use?

Elastic Net

Elastic Net: The Best of Both Worlds

Elastic Net

💎 L1 Penalty (Lasso) + ⭕ L2 Penalty (Ridge) + 🎛️ Mixing Ratio α

💎 L1 Penalty (Lasso)

The L1 component provides sparsity — it can drive weights to exactly zero for automatic feature selection, keeping only the most relevant features in the model.

λ₁Σ|wⱼ|

⭕ L2 Penalty (Ridge)

The L2 component provides stability with correlated features — it groups related features together instead of arbitrarily picking one, giving more reliable and reproducible results.

λ₂Σwⱼ²

🎛️ Mixing Ratio α

Controls the balance: α = 1 is pure Lasso, α = 0 is pure Ridge. In scikit-learn, you tune both λ (overall penalty strength) and α (L1 vs L2 ratio) using cross-validation.

α = λ₁/(λ₁ + λ₂), α ∈ [0, 1]

Decision guide for choosing regularization

↑ Answer the question above to continue ↑

You have 100 features that are highly correlated (e.g., gene expression data). Many are irrelevant. Which regularizer works best?

🎓 What You Now Know

✓ Regularization penalizes large weights — It trades a small increase in training error for a big decrease in test error.

✓ Ridge (L2) shrinks all weights — Circle constraint. No feature selection. Great for correlated features.

✓ Lasso (L1) kills features — Diamond constraint. Drives weights to exactly zero. Automatic feature selection.

✓ Elastic Net combines both — L1 for sparsity + L2 for stability with correlated features.

✓ λ controls the trade-off — Use cross-validation to find the sweet spot between underfitting and overfitting.

Regularization is not optional in modern ML. Every production model uses some form of it — L2 weight decay in neural networks is just Ridge regression in disguise. Master this concept and you understand half of what prevents ML models from being useless. 🚀

📄 Regularization Paths for Generalized Linear Models via Coordinate Descent (Friedman, Hastie, Tibshirani, 2010)

Ridge & Lasso — Taming Overfitting with Regularization

Your model is
too eager to please

The Overfitting Problem

The regularization idea

Ridge Regression (L2 Penalty)

Ridge regression math

Lasso Regression (L1 Penalty)

Lasso regression math

Head-to-Head Comparison

Elastic Net: The Best of Both Worlds

Elastic Net

🎓 What You Now Know

Comments

↗ Keep Learning

Linear Regression — The Foundation of Machine Learning

Polynomial Regression — When Lines Aren't Enough

Bias-Variance Tradeoff — The Most Important Concept in ML

Linear Regression — The Foundation of Machine Learning

Your model is too eager to please

The Overfitting Problem

The regularization idea

Ridge Regression (L2 Penalty)

Ridge regression math

Lasso Regression (L1 Penalty)

Lasso regression math

Head-to-Head Comparison

Elastic Net: The Best of Both Worlds

Elastic Net

🎓 What You Now Know

Comments

↗ Keep Learning

Linear Regression — The Foundation of Machine Learning

Polynomial Regression — When Lines Aren't Enough

Bias-Variance Tradeoff — The Most Important Concept in ML

Linear Regression — The Foundation of Machine Learning

Your model is
too eager to please