Linear Regression — The Foundation of Machine Learning
A scroll-driven visual deep dive into linear regression. From data points to loss functions to gradient descent — understand the building block behind all of ML.
📈
Every ML model started
with a line.
Linear regression is the “Hello World” of machine learning.
It’s simple, but it teaches you every concept you’ll use in deep learning:
loss functions, gradient descent, overfitting, regularization — all of it starts here.
↓ Scroll to learn — even if you know this, the quizzes might surprise you
The Problem: Predict a Number from Data
Imagine you have data about houses — square footage and sale price. You want to predict the price of a new house given its size. That’s regression.
The Model: y = wx + b
Linear regression asks: what’s the best straight line through these points?
The simplest model in all of ML
The weight w is the slope — it tells you how much the prediction changes per unit increase in x. For house prices, if w = 200, each extra square foot adds $200 to the predicted price.
w · x The intercept b shifts the entire line up or down. It's the base prediction when x = 0 — in practice, just a vertical offset that anchors the line to the data.
b In the equation ŷ = wx + b, what does 'w' represent?
💡 Think about what the slope of a line means geometrically...
w is the weight (slope) of the line. It tells you: for every 1 unit increase in x, ŷ changes by w units. In our house example, if w = 200, that means every extra square foot adds $200 to the predicted price.
Try It Yourself: Fit the Line
Now it’s your turn! Use the sliders below to adjust the slope (w) and intercept (b). Watch the green regression line and red residual lines update in real time. Try to minimize the MSE (Mean Squared Error). Can you get it below 100?
How Do You Know If Your Line Is Good?
You need a way to measure how wrong your predictions are. That’s what a loss function does.
Mean Squared Error (MSE) — the standard loss
error_i = ŷ_i - y_i error_i² = (ŷ_i - y_i)² MSE = (1/n) Σ(ŷ_i - y_i)² Goal: find w, b that minimize MSE Your model predicts [3, 5, 7] but the actual values are [2, 5, 10]. What is the MSE?
💡 Calculate each squared error, sum them, divide by n...
Errors: (3-2)²=1, (5-5)²=0, (7-10)²=9. Sum = 10. MSE = 10/3 = 3.33. Notice that the third point (error of 3) dominates the loss because squaring amplifies large errors.
Finding the Best Line: Gradient Descent
Now the big question: how do you find the w and b that minimize the loss? Gradient descent — the same algorithm used to train GPT-4.
The gradient descent update rule
∂MSE/∂w = (2/n) Σ (ŷ_i - y_i) · x_i ∂MSE/∂b = (2/n) Σ (ŷ_i - y_i) w ← w - α · ∂MSE/∂w b ← b - α · ∂MSE/∂b Explore: How Learning Rate Affects Convergence
Adjust the learning rate slider and click Run Step to watch gradient descent in action. Try a very small learning rate vs. a larger one. What happens when it’s too large?
During gradient descent, you compute ∂MSE/∂w = 5.0 and your learning rate α = 0.1. How do you update w?
💡 The update formula is w ← w - α · gradient. The minus sign means 'go opposite to the gradient'...
The update rule is: w = w - α × gradient = w - 0.1 × 5.0 = w - 0.5. You move in the OPPOSITE direction of the gradient (because you want to decrease the loss, not increase it). The learning rate (0.1) controls how big each step is.
The Shortcut: Normal Equation
Here’s a cool fact: for linear regression (and only linear regression), you can skip gradient descent entirely and solve for the optimal w and b in one shot.
The Normal Equation — instant solution
Instead of iterating with gradient descent, directly compute the optimal weights in one matrix operation. No learning rate to tune, no convergence to worry about.
w* = (XᵀX)⁻¹Xᵀy Gives the exact analytical solution in a single computation step. Zero hyperparameters, no risk of overshooting or getting stuck in local minima.
Requires inverting the matrix XᵀX, which costs O(d³). When d (number of features) exceeds ~10,000, this becomes impractically slow — use gradient descent instead.
Cost: O(d³) Why don't we use the Normal Equation for training neural networks?
💡 What shape is the loss surface of a linear regression? What about a neural network?
The Normal Equation only works because linear regression has a convex, quadratic loss surface with exactly one minimum. Neural networks have highly non-linear, non-convex loss surfaces with millions of parameters — there's no formula to solve for the optimal weights directly. Gradient descent is the only viable approach.
How Good Is Your Model? R² Score
MSE tells you the loss, but it’s hard to interpret in isolation. The R² score gives you a percentage: how much of the variance in y does your model explain?
R² — the goodness-of-fit metric
SS_res = Σ(y_i - ŷ_i)² SS_tot = Σ(y_i - ȳ)² R² = 1 - SS_res / SS_tot Your linear regression model has R² = 0.85 on training data but R² = 0.40 on test data. What's happening?
💡 When a student aces practice tests but fails the real exam, what went wrong?
A big gap between training R² (0.85) and test R² (0.40) is the classic sign of overfitting. The model learned patterns specific to the training data (including noise) that don't generalize. Solutions: fewer features, regularization (L1/L2), or more training data.
🎓 What You Now Know
✓ Linear regression fits a line: ŷ = wx + b — Find the weight and bias that minimize prediction errors.
✓ MSE measures how wrong you are — Average squared errors. Penalizes big mistakes more.
✓ Gradient descent finds the minimum — Follow the slope downhill. Learning rate controls step size.
✓ Normal equation gives an exact solution — One formula for small problems. Gradient descent for big ones.
✓ R² tells you how good the fit is — 1.0 = perfect, 0.0 = useless. Watch for overfitting with train/test gaps.
These concepts — loss functions, gradient descent, overfitting — are the exact same ideas used in deep learning. You just learned the foundation of all of ML. 🚀
↗ Keep Learning
Polynomial Regression — When Lines Aren't Enough
A scroll-driven visual deep dive into polynomial regression. See why straight lines fail, how curves capture nonlinear patterns, and when you're overfitting vs underfitting.
Ridge & Lasso — Taming Overfitting with Regularization
A scroll-driven visual deep dive into Ridge and Lasso regression. Learn why models overfit, how penalizing large weights fixes it, and why Lasso kills features.
MSE, MAE, R-Squared and Beyond — Regression Metrics That Actually Matter
A scroll-driven deep dive into regression metrics. Understand MSE, RMSE, MAE, MAPE, R-squared, and Adjusted R-squared — when to use each, their gotchas, and how to report results properly.
Comments
No comments yet. Be the first!