Text Classification — Teaching Machines to Sort Your Inbox
A scroll-driven visual deep dive into text classification. From spam filters to Gmail's categories — learn how ML models read text, extract features, and assign labels at scale.
📬
How does Gmail know
this is a promotion?
Every email you receive is silently read by a text classifier. Spam or not spam. Primary, Social, Promotions, Updates. Important or ignorable. Behind every label is the same fundamental task: text classification.
↓ Scroll to learn how machines read and categorize text at scale
The Text Classification Pipeline
Feature Extraction: How Models “Read”
Gmail classifies billions of emails daily into Primary, Social, Promotions, Updates, and Spam. Which feature approach makes sense at this scale?
💡 Think about cost per email. Can you afford a GPU inference for every single email?
At Gmail's scale (300B+ emails/year), you can't run a BERT model on every email — it's too expensive. The real-world approach is cascaded classification: (1) Simple rules catch obvious spam (known bad senders, blocked domains), (2) TF-IDF + lightweight ML handles most categorization (cheap, fast), (3) Neural models handle ambiguous cases that simpler models aren't confident about. This cascade reduces expensive GPU inference by 90%+ while maintaining accuracy.
Choosing the Right Classifier
Multi-Class & Multi-Label: Beyond Binary
Gmail needs to classify an email as exactly ONE of: Primary, Social, Promotions, Updates. What's the output layer?
💡 The categories are mutually exclusive — an email can't be in both Primary AND Social...
Since an email goes into exactly ONE category (mutually exclusive), softmax is the right choice. Softmax forces all 4 probabilities to sum to 1: P(Primary) + P(Social) + P(Promotions) + P(Updates) = 1. The email goes to the category with the highest probability. Sigmoid would allow multiple categories simultaneously (multi-label), which isn't what Gmail's inbox tabs do.
Deep Learning for Text Classification
You have 500 labeled emails and need to build a classifier. Which approach is most likely to give the best accuracy?
💡 Which approach starts with the most 'prior knowledge' about language?
With only 500 labeled examples, training a neural network from scratch would badly overfit. TF-IDF + Naive Bayes is a solid baseline (~85% on many tasks). But fine-tuned BERT consistently outperforms both, even with very few examples, because it brings knowledge from pretraining on billions of words. BERT already 'understands' language — you just need to teach it your specific label scheme. Studies show BERT matches 100K-example traditional ML accuracy with as few as 500 examples.
Production Considerations
Your email classifier labels messages as Primary, Social, Promotions, Updates, and Forums with 94% overall accuracy. But users complain important emails keep going to Promotions. What should you optimize?
💡 Which type of misclassification error has the highest real-world cost for the user?
Overall accuracy hides per-class performance in multi-class settings. The cost matrix is asymmetric: missing an important email (false negative for Primary) has far higher real-world cost than showing a promotional email in Primary (false positive). Gmail uses a confidence threshold: only classify as non-Primary if the model is >95% confident. Borderline cases default to Primary as a safety measure. This is a design decision, not a model limitation — asymmetric costs demand asymmetric thresholds.
🎓 What You Now Know
✓ Text classification is a pipeline — raw text → preprocessing → feature extraction → model → label. Each step matters.
✓ Naive Bayes → LogReg → SVM → Neural Nets — always start simple. You’ll be surprised how far linear models go with text.
✓ Fine-tuned BERT revolutionized text classification — matching 100K-example performance with just 500 labeled examples via transfer learning.
✓ Production systems use cascaded classifiers — cheap models for easy cases, expensive models for hard cases. This is how Gmail, Outlook, and Yahoo Mail work at scale.
✓ The hardest part isn’t the model — it’s getting good labeled data, handling concept drift, and choosing the right evaluation metric.
Text classification is the workhorse of NLP in production. Every time Gmail sorts your inbox, every time a review site shows star ratings, every time a news app categorizes articles — a text classifier is quietly doing its job. 📬
↗ Keep Learning
Spam Detection — The Original ML Success Story
A scroll-driven visual deep dive into spam detection. From Bayesian filters to modern adversarial ML — learn how email services block 15 billion spam messages daily and why spammers keep finding ways around it.
Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly
A scroll-driven visual deep dive into Naive Bayes. Learn Bayes' theorem, why the 'naive' independence assumption is wrong but works anyway, and why it dominates spam filtering.
Sentiment Analysis — Reading Between the Lines at Scale
A scroll-driven visual deep dive into sentiment analysis. Learn how machines detect opinion, sarcasm, and emotion in text — from star ratings to brand monitoring to Gmail's tone detection.
Comments
No comments yet. Be the first!