Linear Regression — The Foundation of Machine Learning

Introduction 0%

Introduction

🎯 0/5 0%

📈

Every ML model started
with a line.

Linear regression is the “Hello World” of machine learning.
It’s simple, but it teaches you every concept you’ll use in deep learning:
loss functions, gradient descent, overfitting, regularization — all of it starts here.

↓ Scroll to learn — even if you know this, the quizzes might surprise you

The Setup

The Problem: Predict a Number from Data

Imagine you have data about houses — square footage and sale price. You want to predict the price of a new house given its size. That’s regression.

Each dot is a house. Can you see the pattern?

The Line

The Model: y = wx + b

Linear regression asks: what’s the best straight line through these points?

The simplest model in all of ML

🎯 Prediction ŷ = 📐 Weight × Input + 📍 Bias

📐 Weight × Input

The weight w is the slope — it tells you how much the prediction changes per unit increase in x. For house prices, if w = 200, each extra square foot adds $200 to the predicted price.

w · x

📍 Bias

The intercept b shifts the entire line up or down. It's the base prediction when x = 0 — in practice, just a vertical offset that anchors the line to the data.

b

The best-fit line minimizes the total distance to all points

↑ Answer the question above to continue ↑

In the equation ŷ = wx + b, what does 'w' represent?

Try It Yourself: Fit the Line

Now it’s your turn! Use the sliders below to adjust the slope (w) and intercept (b). Watch the green regression line and red residual lines update in real time. Try to minimize the MSE (Mean Squared Error). Can you get it below 100?

Drag the sliders to fit the line yourself — aim for the lowest MSE!

Loss Function

How Do You Know If Your Line Is Good?

You need a way to measure how wrong your predictions are. That’s what a loss function does.

Mean Squared Error (MSE) — the standard loss

error_i = ŷ_i - y_i

The error for one data point: predicted minus actual. Can be positive or negative.

error_i² = (ŷ_i - y_i)²

Square it! This makes all errors positive and penalizes big errors more than small ones.

MSE = (1/n) Σ(ŷ_i - y_i)²

Average the squared errors over all n data points. This is your loss.

Goal: find w, b that minimize MSE

The 'best' line is the one with the smallest average squared error.

↑ Answer the question above to continue ↑

Your model predicts [3, 5, 7] but the actual values are [2, 5, 10]. What is the MSE?

Gradient Descent

Finding the Best Line: Gradient Descent

Now the big question: how do you find the w and b that minimize the loss? Gradient descent — the same algorithm used to train GPT-4.

Gradient descent: follow the slope downhill to the minimum

The gradient descent update rule

∂MSE/∂w = (2/n) Σ (ŷ_i - y_i) · x_i

The gradient tells you: which direction should w move to reduce the loss?

∂MSE/∂b = (2/n) Σ (ŷ_i - y_i)

Same idea for the bias — which direction reduces the loss?

w ← w - α · ∂MSE/∂w

Update w: step in the opposite direction of the gradient. α (alpha) is the learning rate — how big each step is.

b ← b - α · ∂MSE/∂b

Same update for b. Repeat until loss converges!

Explore: How Learning Rate Affects Convergence

Adjust the learning rate slider and click Run Step to watch gradient descent in action. Try a very small learning rate vs. a larger one. What happens when it’s too large?

Adjust α and step through gradient descent — can you reach the minimum?

↑ Answer the question above to continue ↑

During gradient descent, you compute ∂MSE/∂w = 5.0 and your learning rate α = 0.1. How do you update w?

Closed Form

The Shortcut: Normal Equation

Here’s a cool fact: for linear regression (and only linear regression), you can skip gradient descent entirely and solve for the optimal w and b in one shot.

The Normal Equation — instant solution

⚡ Closed-Form Solution

Instead of iterating with gradient descent, directly compute the optimal weights in one matrix operation. No learning rate to tune, no convergence to worry about.

w* = (XᵀX)⁻¹Xᵀy

✅ Strengths

Gives the exact analytical solution in a single computation step. Zero hyperparameters, no risk of overshooting or getting stuck in local minima.

⚠️ Limitation

Requires inverting the matrix XᵀX, which costs O(d³). When d (number of features) exceeds ~10,000, this becomes impractically slow — use gradient descent instead.

Cost: O(d³)

↑ Answer the question above to continue ↑

Why don't we use the Normal Equation for training neural networks?

Evaluation

How Good Is Your Model? R² Score

MSE tells you the loss, but it’s hard to interpret in isolation. The R² score gives you a percentage: how much of the variance in y does your model explain?

R² — the goodness-of-fit metric

SS_res = Σ(y_i - ŷ_i)²

Residual sum of squares — how much error your model makes

SS_tot = Σ(y_i - ȳ)²

Total sum of squares — how much y varies around its mean (baseline)

R² = 1 - SS_res / SS_tot

The fraction of variance your model explains. R²=1 → perfect fit. R²=0 → no better than predicting the mean.

Underfitting → Good fit → Overfitting

↑ Answer the question above to continue ↑

Your linear regression model has R² = 0.85 on training data but R² = 0.40 on test data. What's happening?

🎓 What You Now Know

✓ Linear regression fits a line: ŷ = wx + b — Find the weight and bias that minimize prediction errors.

✓ MSE measures how wrong you are — Average squared errors. Penalizes big mistakes more.

✓ Gradient descent finds the minimum — Follow the slope downhill. Learning rate controls step size.

✓ Normal equation gives an exact solution — One formula for small problems. Gradient descent for big ones.

✓ R² tells you how good the fit is — 1.0 = perfect, 0.0 = useless. Watch for overfitting with train/test gaps.

These concepts — loss functions, gradient descent, overfitting — are the exact same ideas used in deep learning. You just learned the foundation of all of ML. 🚀

Linear Regression — The Foundation of Machine Learning

Every ML model started
with a line.

The Problem: Predict a Number from Data

The Model: y = wx + b

The simplest model in all of ML

Try It Yourself: Fit the Line

How Do You Know If Your Line Is Good?

Mean Squared Error (MSE) — the standard loss

Finding the Best Line: Gradient Descent

The gradient descent update rule

Explore: How Learning Rate Affects Convergence

The Shortcut: Normal Equation

The Normal Equation — instant solution

How Good Is Your Model? R² Score

R² — the goodness-of-fit metric

🎓 What You Now Know

Comments

↗ Keep Learning

Polynomial Regression — When Lines Aren't Enough

Ridge & Lasso — Taming Overfitting with Regularization

MSE, MAE, R-Squared and Beyond — Regression Metrics That Actually Matter

Polynomial Regression — When Lines Aren't Enough

Every ML model started with a line.

The Problem: Predict a Number from Data

The Model: y = wx + b

The simplest model in all of ML

Try It Yourself: Fit the Line

How Do You Know If Your Line Is Good?

Mean Squared Error (MSE) — the standard loss

Finding the Best Line: Gradient Descent

The gradient descent update rule

Explore: How Learning Rate Affects Convergence

The Shortcut: Normal Equation

The Normal Equation — instant solution

How Good Is Your Model? R² Score

R² — the goodness-of-fit metric

🎓 What You Now Know

Comments

↗ Keep Learning

Polynomial Regression — When Lines Aren't Enough

Ridge & Lasso — Taming Overfitting with Regularization

MSE, MAE, R-Squared and Beyond — Regression Metrics That Actually Matter

Polynomial Regression — When Lines Aren't Enough

Every ML model started
with a line.