All articles
· 12 min deep-divemachine-learningmath
Article 1 in your session

Linear Regression — The Foundation of Machine Learning

A scroll-driven visual deep dive into linear regression. From data points to loss functions to gradient descent — understand the building block behind all of ML.

Introduction 0%
Introduction
🎯 0/5 0%

📈

Every ML model started
with a line.

Linear regression is the “Hello World” of machine learning.
It’s simple, but it teaches you every concept you’ll use in deep learning:
loss functions, gradient descent, overfitting, regularization — all of it starts here.

↓ Scroll to learn — even if you know this, the quizzes might surprise you

The Setup

The Problem: Predict a Number from Data

Imagine you have data about houses — square footage and sale price. You want to predict the price of a new house given its size. That’s regression.

Square Feet →Price ($K) →What line best fits these points?
Each dot is a house. Can you see the pattern?
The Line

The Model: y = wx + b

Linear regression asks: what’s the best straight line through these points?

The simplest model in all of ML

🎯 Prediction ŷ = 📐 Weight × Input + 📍 Bias
📐 Weight × Input

The weight w is the slope — it tells you how much the prediction changes per unit increase in x. For house prices, if w = 200, each extra square foot adds $200 to the predicted price.

w · x
📍 Bias

The intercept b shifts the entire line up or down. It's the base prediction when x = 0 — in practice, just a vertical offset that anchors the line to the data.

b
ŷ = wx + bRed lines = errors (residuals) — we want to minimize these
The best-fit line minimizes the total distance to all points
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

In the equation ŷ = wx + b, what does 'w' represent?

Try It Yourself: Fit the Line

Now it’s your turn! Use the sliders below to adjust the slope (w) and intercept (b). Watch the green regression line and red residual lines update in real time. Try to minimize the MSE (Mean Squared Error). Can you get it below 100?

✦ Interactive
Drag the sliders to fit the line yourself — aim for the lowest MSE!
Loss Function

How Do You Know If Your Line Is Good?

You need a way to measure how wrong your predictions are. That’s what a loss function does.

Mean Squared Error (MSE) — the standard loss

1
error_i = ŷ_i - y_i
The error for one data point: predicted minus actual. Can be positive or negative.
2
error_i² = (ŷ_i - y_i)²
Square it! This makes all errors positive and penalizes big errors more than small ones.
3
MSE = (1/n) Σ(ŷ_i - y_i)²
Average the squared errors over all n data points. This is your loss.
4
Goal: find w, b that minimize MSE
The 'best' line is the one with the smallest average squared error.
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Your model predicts [3, 5, 7] but the actual values are [2, 5, 10]. What is the MSE?

Gradient Descent

Finding the Best Line: Gradient Descent

Now the big question: how do you find the w and b that minimize the loss? Gradient descent — the same algorithm used to train GPT-4.

w (weight) →Loss →StartStep 1Step 2Step 3Minimum!
Gradient descent: follow the slope downhill to the minimum

The gradient descent update rule

1
∂MSE/∂w = (2/n) Σ (ŷ_i - y_i) · x_i
The gradient tells you: which direction should w move to reduce the loss?
2
∂MSE/∂b = (2/n) Σ (ŷ_i - y_i)
Same idea for the bias — which direction reduces the loss?
3
w ← w - α · ∂MSE/∂w
Update w: step in the opposite direction of the gradient. α (alpha) is the learning rate — how big each step is.
4
b ← b - α · ∂MSE/∂b
Same update for b. Repeat until loss converges!

Explore: How Learning Rate Affects Convergence

Adjust the learning rate slider and click Run Step to watch gradient descent in action. Try a very small learning rate vs. a larger one. What happens when it’s too large?

✦ Interactive
Adjust α and step through gradient descent — can you reach the minimum?
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

During gradient descent, you compute ∂MSE/∂w = 5.0 and your learning rate α = 0.1. How do you update w?

Closed Form

The Shortcut: Normal Equation

Here’s a cool fact: for linear regression (and only linear regression), you can skip gradient descent entirely and solve for the optimal w and b in one shot.

The Normal Equation — instant solution

Closed-Form Solution

Instead of iterating with gradient descent, directly compute the optimal weights in one matrix operation. No learning rate to tune, no convergence to worry about.

w* = (XᵀX)⁻¹Xᵀy
Strengths

Gives the exact analytical solution in a single computation step. Zero hyperparameters, no risk of overshooting or getting stuck in local minima.

⚠️ Limitation

Requires inverting the matrix XᵀX, which costs O(d³). When d (number of features) exceeds ~10,000, this becomes impractically slow — use gradient descent instead.

Cost: O(d³)
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Why don't we use the Normal Equation for training neural networks?

Evaluation

How Good Is Your Model? R² Score

MSE tells you the loss, but it’s hard to interpret in isolation. The R² score gives you a percentage: how much of the variance in y does your model explain?

R² — the goodness-of-fit metric

1
SS_res = Σ(y_i - ŷ_i)²
Residual sum of squares — how much error your model makes
2
SS_tot = Σ(y_i - ȳ)²
Total sum of squares — how much y varies around its mean (baseline)
3
R² = 1 - SS_res / SS_tot
The fraction of variance your model explains. R²=1 → perfect fit. R²=0 → no better than predicting the mean.
UnderfittingToo simpleGood FitJust right ✓OverfittingToo complex
Underfitting → Good fit → Overfitting
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

Your linear regression model has R² = 0.85 on training data but R² = 0.40 on test data. What's happening?

🎓 What You Now Know

Linear regression fits a line: ŷ = wx + b — Find the weight and bias that minimize prediction errors.

MSE measures how wrong you are — Average squared errors. Penalizes big mistakes more.

Gradient descent finds the minimum — Follow the slope downhill. Learning rate controls step size.

Normal equation gives an exact solution — One formula for small problems. Gradient descent for big ones.

R² tells you how good the fit is — 1.0 = perfect, 0.0 = useless. Watch for overfitting with train/test gaps.

These concepts — loss functions, gradient descent, overfitting — are the exact same ideas used in deep learning. You just learned the foundation of all of ML. 🚀

Keep Learning