Support Vector Machines — Finding the Perfect Boundary
A scroll-driven visual deep dive into SVMs. Learn about maximum margin, the kernel trick, support vectors, and why SVMs dominated ML before deep learning.
Maximum Margin Classification
Not just any boundary.
The best one.
Logistic regression finds a boundary. SVMs find the boundary with the maximum margin — the widest possible gap between classes. This makes them more robust and theoretically elegant.
What Is a Margin?
Why does maximizing the margin lead to better generalization?
💡 If a new data point is slightly different from the training data, which boundary — wide or narrow margin — is more likely to classify it correctly?
With a wide margin, even if a new data point is slightly unusual (shifted from the training distribution), it's still likely to fall on the correct side of the boundary. A narrow margin means even tiny shifts could push a point to the wrong side. This is formalized in statistical learning theory: wider margins correspond to lower VC dimension and better generalization bounds.
Support Vectors: The Points That Matter
The SVM optimization problem
Maximize: margin = 2/||w|| Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i Equivalent: Minimize ||w||²/2 Only support vectors have non-zero αᵢ You train an SVM on 10,000 data points. Only 50 are support vectors. If you remove one non-support-vector point and retrain, what happens?
💡 What determines the SVM boundary? Only the support vectors. So what happens when you remove a non-support vector?
This is one of SVM's most elegant properties. The decision boundary is entirely determined by the support vectors — the points closest to the boundary. Removing any of the 9,950 non-support-vector points has ZERO effect on the boundary. This means SVMs are robust to noise in the 'easy' regions of feature space.
Nonlinear Boundaries: The Kernel Trick
Common kernels
The simplest kernel — no transformation at all. It finds a straight hyperplane boundary in the original feature space. Use when your data is already linearly separable.
K(x, y) = x · y Maps data into a polynomial feature space, allowing curved decision boundaries of degree d. Higher degrees capture more complex patterns but risk overfitting.
K(x, y) = (x · y + c)ᵈ The most powerful and popular kernel — it implicitly maps data to infinite dimensions, enabling the SVM to learn any smooth boundary. The γ parameter controls how far the influence of each point reaches.
K(x, y) = exp(-γ||x-y||²) The mathematical magic behind all kernels: compute the inner product in high-dimensional space without ever actually transforming the data. This makes SVMs computationally tractable even with infinite-dimensional feature spaces.
K(x,y) = φ(x)·φ(y) The RBF kernel maps data to infinite dimensions. Why doesn't this cause overfitting?
💡 What regularizes the SVM? Think about what the optimization objective minimizes...
This is key to understanding SVMs. Even though the feature space is infinite, the SVM's margin maximization constrains the model: it can only use the support vectors to define the boundary, and it penalizes complex boundaries (small margins). The number of support vectors, not the dimension of the feature space, controls the effective complexity — this is why SVMs generalize well even with RBF kernels.
Real Data Isn’t Clean: Soft Margins
Soft margin SVM
Encourages a wide margin by minimizing the weight vector's magnitude. A smaller ||w|| means a wider gap between classes, leading to a simpler and more generalizable boundary.
||w||²/2 Each training point gets a slack variable ξᵢ measuring how much it violates the margin. Points correctly classified outside the margin have ξᵢ = 0 (no penalty); points inside the margin or misclassified have ξᵢ > 0, proportional to violation distance.
ξᵢ ≥ 0 for each point i The total cost of all margin violations, weighted by the hyperparameter C. This term pulls the boundary toward correct classification at the expense of margin width.
C·Σξᵢ High C punishes violations harshly — the optimizer prioritizes correct classification over margin width. This creates a complex boundary that fits the training data tightly, risking overfitting.
Low C tolerates violations — the optimizer prioritizes a wide margin over perfect classification. This creates a simpler boundary that may misclassify some training points but generalizes better to new data.
In a soft margin SVM, what does the hyperparameter C control?
💡 The objective is ||w||²/2 + C·Σξᵢ. What happens when you increase C?
C is the misclassification penalty. Large C says 'classify everything correctly, even if the margin is narrow' → complex, potentially overfit boundary. Small C says 'keep the margin wide, even if some points are misclassified' → simple, potentially underfit boundary. It's the SVM's regularization parameter — analogous to λ in Ridge/Lasso (but inverted: large C = less regularization).
SVMs Today
🎓 What You Now Know
✓ SVMs maximize the margin — Not just any boundary, but the one with the widest gap between classes.
✓ Only support vectors matter — Remove any other point and the boundary stays the same.
✓ Kernel trick enables nonlinear boundaries — Map to higher dimensions implicitly. RBF kernel = infinite dimensions.
✓ C controls the bias-variance tradeoff — Large C = hard margin (overfit risk). Small C = soft margin (underfit risk).
✓ SVMs are elegant but niche today — Still great for small datasets and theoretical insights.
SVMs are one of the most beautiful algorithms in ML. They teach you margins, kernels, optimization, and regularization — all in one framework. Even if you use XGBoost or neural nets in practice, understanding SVMs makes you a better ML engineer. 🚀
↗ Keep Learning
Logistic Regression — The Classifier That's Not Really Regression
A scroll-driven visual deep dive into logistic regression. Learn how a regression model becomes a classifier, why the sigmoid is the key, and how log-loss trains it.
Decision Trees — How Machines Learn to Ask Questions
A scroll-driven visual deep dive into decision trees. Learn how trees split data, what Gini impurity and information gain mean, and why trees overfit like crazy.
Ridge & Lasso — Taming Overfitting with Regularization
A scroll-driven visual deep dive into Ridge and Lasso regression. Learn why models overfit, how penalizing large weights fixes it, and why Lasso kills features.
Comments
No comments yet. Be the first!