Introduction 0%

Introduction

🎯 0/4 0%

Maximum Margin Classification

Not just any boundary.
The best one.

Logistic regression finds a boundary. SVMs find the boundary with the maximum margin — the widest possible gap between classes. This makes them more robust and theoretically elegant.

Maximum Margin

What Is a Margin?

Many boundaries separate the classes. SVM picks the one with the widest gap.

↑ Answer the question above to continue ↑

Why does maximizing the margin lead to better generalization?

Support Vectors

Support Vectors: The Points That Matter

The SVM optimization problem

Maximize: margin = 2/||w||

The margin width is inversely proportional to the weight vector's magnitude

Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i

Every point must be correctly classified and outside the margin

Equivalent: Minimize ||w||²/2

Minimizing weight magnitude = maximizing margin (quadratic programming)

Only support vectors have non-zero αᵢ

In the dual formulation, most Lagrange multipliers are zero — only support vectors matter

↑ Answer the question above to continue ↑

You train an SVM on 10,000 data points. Only 50 are support vectors. If you remove one non-support-vector point and retrain, what happens?

The Kernel Trick

Nonlinear Boundaries: The Kernel Trick

Common kernels

📏 Linear Kernel

The simplest kernel — no transformation at all. It finds a straight hyperplane boundary in the original feature space. Use when your data is already linearly separable.

K(x, y) = x · y

📐 Polynomial Kernel

Maps data into a polynomial feature space, allowing curved decision boundaries of degree d. Higher degrees capture more complex patterns but risk overfitting.

K(x, y) = (x · y + c)ᵈ

🌀 RBF (Gaussian) Kernel

The most powerful and popular kernel — it implicitly maps data to infinite dimensions, enabling the SVM to learn any smooth boundary. The γ parameter controls how far the influence of each point reaches.

K(x, y) = exp(-γ||x-y||²)

🎩 The Kernel Trick

The mathematical magic behind all kernels: compute the inner product in high-dimensional space without ever actually transforming the data. This makes SVMs computationally tractable even with infinite-dimensional feature spaces.

K(x,y) = φ(x)·φ(y)

Data not linearly separable in 2D → add x² feature → linearly separable in 3D

↑ Answer the question above to continue ↑

The RBF kernel maps data to infinite dimensions. Why doesn't this cause overfitting?

Soft Margin

Real Data Isn’t Clean: Soft Margins

Soft margin SVM

⚖️ Soft Margin Objective = 📏 Margin Width Term + 🎚️ Slack Variables + 🚧 Violation Penalty + 🔒 Large C (Hard Margin) + 🔓 Small C (Soft Margin)

📏 Margin Width Term

Encourages a wide margin by minimizing the weight vector's magnitude. A smaller ||w|| means a wider gap between classes, leading to a simpler and more generalizable boundary.

||w||²/2

🎚️ Slack Variables

Each training point gets a slack variable ξᵢ measuring how much it violates the margin. Points correctly classified outside the margin have ξᵢ = 0 (no penalty); points inside the margin or misclassified have ξᵢ > 0, proportional to violation distance.

ξᵢ ≥ 0 for each point i

🚧 Violation Penalty

The total cost of all margin violations, weighted by the hyperparameter C. This term pulls the boundary toward correct classification at the expense of margin width.

C·Σξᵢ

🔒 Large C (Hard Margin)

High C punishes violations harshly — the optimizer prioritizes correct classification over margin width. This creates a complex boundary that fits the training data tightly, risking overfitting.

🔓 Small C (Soft Margin)

Low C tolerates violations — the optimizer prioritizes a wide margin over perfect classification. This creates a simpler boundary that may misclassify some training points but generalizes better to new data.

↑ Answer the question above to continue ↑

In a soft margin SVM, what does the hyperparameter C control?

In Practice

SVMs Today

SVMs dominated ML from the late '90s to early 2010s

🎓 What You Now Know

✓ SVMs maximize the margin — Not just any boundary, but the one with the widest gap between classes.

✓ Only support vectors matter — Remove any other point and the boundary stays the same.

✓ Kernel trick enables nonlinear boundaries — Map to higher dimensions implicitly. RBF kernel = infinite dimensions.

✓ C controls the bias-variance tradeoff — Large C = hard margin (overfit risk). Small C = soft margin (underfit risk).

✓ SVMs are elegant but niche today — Still great for small datasets and theoretical insights.

SVMs are one of the most beautiful algorithms in ML. They teach you margins, kernels, optimization, and regularization — all in one framework. Even if you use XGBoost or neural nets in practice, understanding SVMs makes you a better ML engineer. 🚀

📄 Support Vector Machine Active Learning with Applications to Text Classification (Tong & Koller, 2001)

Support Vector Machines — Finding the Perfect Boundary

Not just any boundary.
The best one.

What Is a Margin?

Support Vectors: The Points That Matter

The SVM optimization problem

Nonlinear Boundaries: The Kernel Trick

Common kernels

Real Data Isn’t Clean: Soft Margins

Soft margin SVM

SVMs Today

🎓 What You Now Know

Comments

↗ Keep Learning

Logistic Regression — The Classifier That's Not Really Regression

Decision Trees — How Machines Learn to Ask Questions

Ridge & Lasso — Taming Overfitting with Regularization

Logistic Regression — The Classifier That's Not Really Regression

Not just any boundary. The best one.

What Is a Margin?

Support Vectors: The Points That Matter

The SVM optimization problem

Nonlinear Boundaries: The Kernel Trick

Common kernels

Real Data Isn’t Clean: Soft Margins

Soft margin SVM

SVMs Today

🎓 What You Now Know

Comments

↗ Keep Learning

Logistic Regression — The Classifier That's Not Really Regression

Decision Trees — How Machines Learn to Ask Questions

Ridge & Lasso — Taming Overfitting with Regularization

Logistic Regression — The Classifier That's Not Really Regression

Not just any boundary.
The best one.