All articles
· 16 min deep-diveNLPclassification
Article 1 in your session

Text Classification — Teaching Machines to Sort Your Inbox

A scroll-driven visual deep dive into text classification. From spam filters to Gmail's categories — learn how ML models read text, extract features, and assign labels at scale.

Introduction 0%
Introduction
🎯 0/4 0%

📬

How does Gmail know
this is a promotion?

Every email you receive is silently read by a text classifier. Spam or not spam. Primary, Social, Promotions, Updates. Important or ignorable. Behind every label is the same fundamental task: text classification.

↓ Scroll to learn how machines read and categorize text at scale

The Pipeline

The Text Classification Pipeline

📧 Raw Text Email / Doc 🧹 Preprocess Clean & tokenize 🔢 Features TF-IDF / embed 🧠 Classifier NB / SVM / NN 🏷️ Label spam / ham
Every text classifier follows this pipeline: raw text → features → model → label
Feature Extraction

Feature Extraction: How Models “Read”

Era 1: Counting1990s–2010s• Bag of Words• TF-IDF vectors• N-gram features• Character patterns✓ Simple & fast✓ Interpretable✓ Works with few labels✗ No word meaning✗ High-dimensionalEra 2: Embeddings2013–2018• Word2Vec averages• GloVe vectors• Doc2Vec• FastText✓ Captures meaning✓ Dense, compact✓ Transfer learning✗ One vec per word✗ No contextEra 3: Transformers2018–present• BERT embeddings• Fine-tuned LLMs• Sentence-BERT• Zero-shot classify✓ Context-dependent✓ State-of-the-art✓ Few labels needed✗ Needs GPU✗ Expensive inference
Three eras of text features — from counting to understanding
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Gmail classifies billions of emails daily into Primary, Social, Promotions, Updates, and Spam. Which feature approach makes sense at this scale?

Model Choices

Choosing the Right Classifier

Naive Bayes• Fastest to train (just count words)• Works with tiny datasets (100 examples!)• Great baseline — always start hereUse for: quick prototypes, spam, SMS classificationLogistic Regression• Linear boundary — very interpretable• Regularization handles high dims• Probability outputs for confidenceUse for: when you need interpretable decisionsSVM (Linear Kernel)• Maximum margin → generalizes well• Excellent with TF-IDF features• Often best for medium-sized datasetsUse for: document classification, topic labelingRandom Forest / XGBoost• Non-linear decision boundaries• Feature importance for free• Handles mixed features wellUse for: when text + metadata (sender, time, etc.)
Classic ML models for text classification — each with different strengths
Multi-Class

Multi-Class & Multi-Label: Beyond Binary

Multi-ClassOne label per document📧 Email → Primary | Social | Promo | Update📰 News → Sports | Politics | Tech | EntertainmentUses: softmax output (probabilities sum to 1)Multi-LabelMultiple labels per document📧 Email → [Important] + [Finance] + [Urgent]📰 News → [Sports] + [Business] (if it’s both)Uses: sigmoid per label (independent probabilities)Strategies for Multi-Class with Binary Classifiers• One-vs-Rest (OvR): Train K binary classifiers, one per class• One-vs-One (OvO): Train K(K-1)/2 pairwise classifiersNaive Bayes and Logistic Regression are naturally multi-class; SVM uses OvR/OvO
Gmail's inbox categories are a multi-class problem — each email gets exactly one label
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Gmail needs to classify an email as exactly ONE of: Primary, Social, Promotions, Updates. What's the output layer?

Deep Learning

Deep Learning for Text Classification

📧 Input Text Email body ✂️ Tokenizer WordPiece 🧠 BERT Pretrained 🎯 [CLS] Head Classification 🏷️ Label Promo ✓
Fine-tuning BERT for text classification — the modern approach
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

You have 500 labeled emails and need to build a classifier. Which approach is most likely to give the best accuracy?

Production

Production Considerations

⚡ Latency vs AccuracyNB: ~0.1ms per emailLogReg + TF-IDF: ~0.5msBERT: ~50ms (GPU)At 1M emails/sec, BERT needs 50K GPUs🔄 Concept DriftSpam evolves to evade filtersNew topics emerge (COVID in 2020)User behavior changes over timeModels need continuous retraining📊 Evaluation Beyond AccuracySpam: optimize for RECALL (catch all spam)Ham: optimize for PRECISION (never block legit)Use confusion matrix per classFalse positive (legit marked spam) > false negative🏗️ Model DistillationTrain a big BERT model (teacher)Train a small model to mimic it (student)Get 95% of BERT accuracy at 10% costDistilBERT: 40% smaller, 60% faster, 97% accuracy
Deploying text classifiers at scale introduces challenges beyond accuracy
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

Your email classifier labels messages as Primary, Social, Promotions, Updates, and Forums with 94% overall accuracy. But users complain important emails keep going to Promotions. What should you optimize?

🎓 What You Now Know

Text classification is a pipeline — raw text → preprocessing → feature extraction → model → label. Each step matters.

Naive Bayes → LogReg → SVM → Neural Nets — always start simple. You’ll be surprised how far linear models go with text.

Fine-tuned BERT revolutionized text classification — matching 100K-example performance with just 500 labeled examples via transfer learning.

Production systems use cascaded classifiers — cheap models for easy cases, expensive models for hard cases. This is how Gmail, Outlook, and Yahoo Mail work at scale.

The hardest part isn’t the model — it’s getting good labeled data, handling concept drift, and choosing the right evaluation metric.

Text classification is the workhorse of NLP in production. Every time Gmail sorts your inbox, every time a review site shows star ratings, every time a news app categorizes articles — a text classifier is quietly doing its job. 📬

Keep Learning