Ridge & Lasso — Taming Overfitting with Regularization
A scroll-driven visual deep dive into Ridge and Lasso regression. Learn why models overfit, how penalizing large weights fixes it, and why Lasso kills features.
Regularization Fundamentals
Your model is
too eager to please
It memorizes training data instead of learning patterns. The fix? Put a leash on the weights. That’s regularization.
The Overfitting Problem
When models overfit, their weight values become extremely large. Why?
- Large weights create sharp, wiggly curves that pass through every training point
- Small weights create smooth, gentle curves that capture the general trend
- Regularization adds a penalty for large weights to the loss function
The regularization idea
The standard loss function measures how far predictions are from actual values. Without regularization, the model is free to use any weight values — often resulting in huge, overfitting weights.
Σ(yᵢ − ŷᵢ)² A cost for having large weights, controlled by λ. When λ = 0, there's no penalty (standard regression). As λ → ∞, all weights shrink to zero (flat line). The sweet spot balances accuracy and simplicity.
λ · Penalty(w) Why do overfit models tend to have very large weight values?
💡 Think about what a curve looks like when it passes through every single point...
To pass exactly through scattered training points, the model needs dramatic swings between consecutive points. These swings require huge positive and negative coefficients. For example, a degree-10 polynomial might need weights like +50,000 and -48,000 to create the wiggles that hit every point. Regularization caps this by making large weights expensive.
Ridge Regression (L2 Penalty)
Ridge regression math
Ridge adds the sum of squared weights to the loss. A weight of 10 is penalized 100× more than a weight of 1, so large weights get aggressively shrunk while small weights are barely affected.
Loss = Σ(yᵢ − ŷᵢ)² + λΣwⱼ² Unlike Lasso, Ridge has an exact analytical solution. The λI term makes the matrix invertible even when features are correlated (multicollinear), which is one of Ridge's key practical advantages.
w* = (XᵀX + λI)⁻¹Xᵀy Weights stay large and the model fits the training data closely — more flexibility but higher risk of overfitting.
Weights shrink toward zero, producing a smoother and simpler model — better generalization but potentially underfitting if too extreme.
In Ridge regression, what happens as λ increases?
💡 What's the penalty term? λΣwⱼ². If λ gets bigger, what must happen to wⱼ to keep the total loss manageable?
As λ increases, the penalty for large weights increases. The model responds by shrinking ALL weights toward zero (but never exactly to zero). This makes the predictions smoother and less sensitive to individual data points. Training error goes UP (worse fit) but test error usually goes DOWN (better generalization) — up to a point.
Lasso Regression (L1 Penalty)
Lasso regression math
Lasso adds the sum of absolute weight values to the loss. The linear penalty treats all weight magnitudes proportionally — unlike Ridge's quadratic penalty, it doesn't disproportionately punish large weights.
Loss = Σ(yᵢ − ŷᵢ)² + λΣ|wⱼ| The absolute value function isn't differentiable at zero, so there's no neat analytical solution. Lasso requires iterative optimization like coordinate descent or subgradient methods.
Lasso's killer feature: it drives many weights to exactly zero, effectively removing features from the model. This gives you automatic feature selection — the model only uses the features that truly matter.
Why does Lasso (L1) drive weights to exactly zero while Ridge (L2) doesn't?
💡 Look at the shapes: circle vs diamond. Where on each shape are the loss contours most likely to intersect?
This is a geometric insight. Ridge's L2 constraint is a circle (no corners). The elliptical loss contours almost always touch the circle at a point where BOTH weights are nonzero. Lasso's L1 constraint is a diamond with sharp corners sitting on the axes. The loss contours are much more likely to first touch a corner — where one or more weights are exactly zero. In high dimensions, this effect is even more pronounced.
Head-to-Head Comparison
You're building a model with 500 features but suspect only ~20 actually matter. Which regularizer should you use?
💡 Which regularizer makes weights exactly zero?
Lasso (L1) is designed for exactly this scenario. With high λ, it will drive the weights of irrelevant features to exactly zero, effectively performing automatic feature selection. You'll end up with a sparse model using only the ~20 features that actually predict the target. Ridge would shrink all 500 weights but keep them all nonzero — not ideal when most features are noise.
Elastic Net: The Best of Both Worlds
Elastic Net
The L1 component provides sparsity — it can drive weights to exactly zero for automatic feature selection, keeping only the most relevant features in the model.
λ₁Σ|wⱼ| The L2 component provides stability with correlated features — it groups related features together instead of arbitrarily picking one, giving more reliable and reproducible results.
λ₂Σwⱼ² Controls the balance: α = 1 is pure Lasso, α = 0 is pure Ridge. In scikit-learn, you tune both λ (overall penalty strength) and α (L1 vs L2 ratio) using cross-validation.
α = λ₁/(λ₁ + λ₂), α ∈ [0, 1] You have 100 features that are highly correlated (e.g., gene expression data). Many are irrelevant. Which regularizer works best?
💡 What goes wrong with pure Lasso when features are correlated?
Elastic Net inherits Lasso's ability to zero out irrelevant features AND Ridge's ability to handle correlated features gracefully. With pure Lasso, if features 5, 12, and 37 are all correlated, it might arbitrarily keep feature 5 and zero out 12 and 37 — unstable results. Elastic Net's L2 component encourages correlated features to have similar weights, giving more stable feature selection.
🎓 What You Now Know
✓ Regularization penalizes large weights — It trades a small increase in training error for a big decrease in test error.
✓ Ridge (L2) shrinks all weights — Circle constraint. No feature selection. Great for correlated features.
✓ Lasso (L1) kills features — Diamond constraint. Drives weights to exactly zero. Automatic feature selection.
✓ Elastic Net combines both — L1 for sparsity + L2 for stability with correlated features.
✓ λ controls the trade-off — Use cross-validation to find the sweet spot between underfitting and overfitting.
Regularization is not optional in modern ML. Every production model uses some form of it — L2 weight decay in neural networks is just Ridge regression in disguise. Master this concept and you understand half of what prevents ML models from being useless. 🚀
↗ Keep Learning
Linear Regression — The Foundation of Machine Learning
A scroll-driven visual deep dive into linear regression. From data points to loss functions to gradient descent — understand the building block behind all of ML.
Polynomial Regression — When Lines Aren't Enough
A scroll-driven visual deep dive into polynomial regression. See why straight lines fail, how curves capture nonlinear patterns, and when you're overfitting vs underfitting.
Bias-Variance Tradeoff — The Most Important Concept in ML
A scroll-driven visual deep dive into the bias-variance tradeoff. Learn why every model makes errors, how underfitting and overfitting emerge, and how to balance them.
Comments
No comments yet. Be the first!