Introduction 0%

Introduction

🎯 0/4 0%

🛡️

45% of all email
is spam.

That’s ~15 billion spam messages per day. Gmail alone blocks 99.9% of them before they reach your inbox. Spam detection was the first massive commercial success of machine learning — and it’s still one of the hardest adversarial ML problems.

↓ Scroll to understand the arms race between spammers and filters

History

A Brief History of Spam Fighting

The evolution of spam detection — from blocklists to neural networks

Features

Beyond Words: The Feature Engineering Arsenal

Modern spam filters use far more than just email text

↑ Answer the question above to continue ↑

A spammer sends an email with zero text — just a large image containing the spam message. A text-based Naive Bayes filter will:

Bayesian Filtering

The Bayesian Revolution (1998)

Per-user Bayesian spam score

For each word w, compute P(spam|w) from YOUR email history

Words in emails YOU marked as spam get high spam probability

P(spam|w) = (spam_count(w) / total_spam) / ((spam_count(w) / total_spam) + (ham_count(w) / total_ham))

Ratio of word occurrence in your spam vs your ham emails

Combine top 15 most 'interesting' words (furthest from 0.5)

Don't use all words — just the most discriminative ones

P(spam|email) = p₁p₂...p₁₅ / (p₁p₂...p₁₅ + (1-p₁)(1-p₂)...(1-p₁₅))

Combined probability using naive independence assumption

If P(spam|email) > 0.9 → SPAM

High threshold: better to let some spam through than block a legitimate email

Adversarial

The Adversarial Arms Race

Spammers constantly evolve techniques to bypass filters — and filters evolve back

↑ Answer the question above to continue ↑

A spammer adds 'baseball weather family vacation sunshine garden recipe' at the bottom of a spam email. This is an attack on:

Modern Systems

How Gmail’s Spam Filter Actually Works

Gmail's multi-layer spam detection system — each layer catches what the previous missed

↑ Answer the question above to continue ↑

Gmail blocks 99.9% of spam. That sounds amazing. But with ~15 billion spam emails sent daily, how many spam emails still get through?

Metrics

Evaluating Spam Filters: It’s Not About Accuracy

Why accuracy is misleading for spam detection

Dataset: 90% ham, 10% spam

Typical email ratio

A model that predicts EVERYTHING as 'ham' gets 90% accuracy!

But it catches ZERO spam — completely useless

What matters: Precision and Recall for the SPAM class

Spam Precision = TP / (TP + FP)

Of emails marked spam, how many actually were? (Don't block legit email!)

Spam Recall = TP / (TP + FN)

Of all actual spam, how much did we catch? (Don't miss spam!)

↑ Answer the question above to continue ↑

A spam filter has 99.5% recall and 99.8% precision. Processing 10 million emails/day (~10% spam), approximately how many legitimate emails are incorrectly blocked daily?

🎓 What You Now Know

✓ Spam detection was ML’s first killer app — Bayesian filters in 1998–2002 proved that ML could solve real-world problems at scale.

✓ It’s an adversarial problem — spammers actively evolve to evade filters, creating a perpetual arms race.

✓ Content is only part of the signal — sender reputation, authentication, behavioral data, and network analysis are equally important.

✓ Modern systems use multi-layer ensembles — cheap filters first, expensive ML for ambiguous cases, crowd-sourced signals from billions of users.

✓ False positives are costlier than false negatives — blocking a legitimate email is worse than letting spam through. Thresholds must be asymmetric.

Spam detection is where ML meets adversarial intelligence. It taught the industry that models must continuously evolve, that feature engineering trumps model sophistication, and that at billion-scale, even 99.9% accuracy leaves millions of errors. 🛡️

Spam Detection — The Original ML Success Story

45% of all email
is spam.

A Brief History of Spam Fighting

Beyond Words: The Feature Engineering Arsenal

The Bayesian Revolution (1998)

Per-user Bayesian spam score

The Adversarial Arms Race

How Gmail’s Spam Filter Actually Works

Evaluating Spam Filters: It’s Not About Accuracy

Why accuracy is misleading for spam detection

🎓 What You Now Know

Comments

↗ Keep Learning

Text Classification — Teaching Machines to Sort Your Inbox

Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Text Preprocessing — Turning Messy Words into Clean Features

Text Classification — Teaching Machines to Sort Your Inbox

45% of all email is spam.

A Brief History of Spam Fighting

Beyond Words: The Feature Engineering Arsenal

The Bayesian Revolution (1998)

Per-user Bayesian spam score

The Adversarial Arms Race

How Gmail’s Spam Filter Actually Works

Evaluating Spam Filters: It’s Not About Accuracy

Why accuracy is misleading for spam detection

🎓 What You Now Know

Comments

↗ Keep Learning

Text Classification — Teaching Machines to Sort Your Inbox

Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Text Preprocessing — Turning Messy Words into Clean Features

Text Classification — Teaching Machines to Sort Your Inbox

45% of all email
is spam.