Logistic Regression — The Classifier That's Not Really Regression
A scroll-driven visual deep dive into logistic regression. Learn how a regression model becomes a classifier, why the sigmoid is the key, and how log-loss trains it.
Classification Fundamentals
How do you predict
yes or no?
Will this email be spam? Will this patient have diabetes? Will this user click? Regression gives numbers. Classification gives answers.
From Numbers to Probabilities
Logistic regression in three steps
z = w₁x₁ + w₂x₂ + ... + b σ(z) = 1 / (1 + e⁻ᶻ) ŷ = 1 if σ(z) ≥ 0.5, else 0 What's the output range of the sigmoid function σ(z)?
💡 What happens to 1/(1+e⁻ᶻ) as z → +∞? The denominator becomes 1+0 = 1...
The sigmoid σ(z) = 1/(1+e⁻ᶻ) asymptotically approaches 0 as z→-∞ and approaches 1 as z→+∞, but never actually reaches either. Its output is always strictly between 0 and 1 — open interval (0,1). This makes it perfect for modeling probabilities.
The Decision Boundary
Logistic regression's decision boundary is always a:
💡 Set σ(z) = 0.5. What does z equal? What shape does w·x + b = 0 define?
The decision boundary is where σ(z) = 0.5, which means z = 0, which means w₁x₁ + w₂x₂ + ... + b = 0. That's the equation of a hyperplane (a line in 2D, a plane in 3D). Logistic regression can ONLY learn linear boundaries. If the true boundary is curved, you need polynomial features, SVMs with kernels, or neural networks.
How We Train It: Log-Loss
Binary cross-entropy (log-loss)
If the true class is 1, loss equals −log(ŷ). A confident correct prediction (ŷ ≈ 0.99) costs nearly zero, but a confident wrong prediction (ŷ ≈ 0.01) costs ~4.6 — exponentially harsh punishment.
−y · log(ŷ) If the true class is 0, loss equals −log(1−ŷ). The model is penalized for assigning high probability to the wrong class — the more confident the mistake, the steeper the penalty.
−(1−y) · log(1−ŷ) A logistic regression model predicts 0.99 probability for a sample that is actually class 0. What's the log-loss for this sample?
💡 Plug in: y=0, ŷ=0.99. Loss = -log(1-0.99) = -log(0.01) = ?
When y=0, loss = -log(1-ŷ) = -log(1-0.99) = -log(0.01) ≈ 4.6. The model is 99% confident about the WRONG class. Log-loss punishes confident mistakes exponentially — this is much harsher than MSE and forces the model to be well-calibrated rather than overconfident.
Beyond Binary: Multi-class Classification
Softmax for multi-class
z₁, z₂, ..., zₖ = raw scores for K classes P(class k) = eᶻᵏ / Σⱼ eᶻʲ Prediction = argmax(P(class 1), ..., P(class K)) In logistic regression with 5 classes, how many weight vectors does the model learn?
💡 Look at the softmax formula. Where do z₁, z₂, ..., z₅ come from?
Each class k has its own weight vector wₖ and bias bₖ. For input x, the raw score for class k is zₖ = wₖᵀx + bₖ. These K scores are fed through softmax to produce K probabilities. With 5 classes and d features, the model has 5(d+1) parameters total.
When to Use Logistic Regression
🎓 What You Now Know
✓ Logistic regression = linear model + sigmoid — Squash z ∈ (-∞,∞) into p ∈ (0,1).
✓ Decision boundary is always linear — A hyperplane where σ(z) = 0.5.
✓ Train with log-loss, not MSE — Cross-entropy is convex and punishes confident mistakes.
✓ Multi-class uses softmax — K scores → K probabilities that sum to 1.
✓ Always start with logistic regression — It’s fast, interpretable, and sets a strong baseline.
Logistic regression is the workhorse of classification. It’s used everywhere — spam detection, medical diagnosis, click-through prediction, credit scoring. And the sigmoid + cross-entropy combination is the same output layer used in neural networks. You just learned a neural net’s final layer. 🚀
↗ Keep Learning
Linear Regression — The Foundation of Machine Learning
A scroll-driven visual deep dive into linear regression. From data points to loss functions to gradient descent — understand the building block behind all of ML.
Accuracy, Precision, Recall & F1 — Choosing the Right Metric
A scroll-driven visual deep dive into classification metrics. Learn why accuracy misleads, what precision and recall actually measure, and when to use F1, F2, or something else entirely.
ROC Curves & AUC — Measuring Classifier Performance Visually
A scroll-driven visual deep dive into ROC curves and AUC. Learn TPR vs FPR, why AUC is threshold-independent, and when to use ROC vs PR curves.
Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly
A scroll-driven visual deep dive into Naive Bayes. Learn Bayes' theorem, why the 'naive' independence assumption is wrong but works anyway, and why it dominates spam filtering.
Comments
No comments yet. Be the first!