Introduction 0%

Introduction

🎯 0/4 0%

🔍

How did Google rank pages
before deep learning?

For decades, search engines used a beautifully simple idea: count words, weight them by rarity, and rank by relevance. TF-IDF powered the internet’s search systems from the 1970s until the 2010s — and it still runs inside Elasticsearch, Lucene, and Solr today.

↓ Scroll to understand the math behind document ranking

Bag of Words

Step 1: Bag of Words — Counting What Matters

Two documents become vectors by counting word occurrences

↑ Answer the question above to continue ↑

In a Bag of Words representation, 'Dog bites man' and 'Man bites dog' produce:

Term Frequency

Step 2: Term Frequency — How Important Is a Word in This Document?

Term Frequency (TF)

📊 TF Formula

Term Frequency measures how often a specific word appears in a document, normalized by document length. It captures local importance — how much of THIS document is about THIS term.

TF(t, d) = count(t in d) / total_words(d)

📧 Worked Example

If 'spam' appears 5 times in a 100-word email, TF = 5/100 = 0.05 — meaning 5% of the email is the word 'spam'. That's a strong signal this email is about spam!

TF('spam', email) = 5/100 = 0.05

⚠️ The Limitation

TF alone is misleading because common words like 'the' dominate every document with high frequency, yet tell you nothing about the document's topic. This is why we need IDF to balance things out.

High TF doesn't mean high relevance — 'the' dominates every document

IDF

Step 3: Inverse Document Frequency — Rewarding Rarity

Inverse Document Frequency (IDF)

📏 IDF Formula

IDF rewards rarity: words that appear in fewer documents get higher weight. It's a mathematical stopword detector — common words are automatically suppressed without needing a manual list.

IDF(t) = log(N / df(t))

🚫 Common Word → Zero

'the' appears in every document, so IDF('the') = log(10000/10000) = log(1) = 0. A word in every document carries zero distinguishing information — it's completely worthless for ranking.

IDF('the') = log(10000 / 10000) = 0

💎 Rare Word → High

'cancer' appears in only 50 out of 10,000 documents, giving it IDF ≈ 5.3. Rare words are informative because they help distinguish one document from another.

IDF('cancer') = log(10000 / 50) ≈ 5.3

🦄 Ultra-Rare → Maximum

'xylophagous' appears in just 2 documents, yielding IDF ≈ 8.5. Extremely rare words get extremely high IDF weight — they're practically unique identifiers for the documents that contain them.

IDF('xylophagous') = log(10000 / 2) ≈ 8.5

The IDF spectrum: common words get crushed, rare words get boosted

↑ Answer the question above to continue ↑

In a corpus of 1,000,000 web pages, a word appears in 999,999 of them. Its IDF score is approximately:

TF-IDF

The Magic: TF × IDF

TF-IDF: the complete formula

✨ TF × IDF

TF-IDF multiplies local importance (how frequent in THIS document) by global rarity (how rare across ALL documents). The product captures what makes a word uniquely relevant to a specific document.

TF-IDF(t, d) = TF(t, d) × IDF(t)

🏆 High TF + High IDF

A word that's frequent in THIS document and rare overall is extremely relevant — this is the sweet spot. Example: 'mitochondria' appearing 8 times in a biology paper.

🚫 High TF + Low IDF

A word that's frequent everywhere ('the', 'is', 'and') contributes nothing to ranking, no matter how often it appears in a single document. IDF crushes it to zero.

💡 Low TF + High IDF

A rare word that appears even once provides moderate signal — it's informative precisely because it's unusual in the corpus.

📉 Low TF + Low IDF

A common word appearing rarely is doubly useless — low frequency AND low rarity means near-zero contribution to the document's relevance score.

How a search engine uses TF-IDF to rank documents for a query

TF-IDF in action: ranking documents for 'machine learning algorithms'

↑ Answer the question above to continue ↑

A document contains the word 'algorithm' 10 times. The TF component is high. But the document scores LOW with TF-IDF. What's the most likely explanation?

Limitations

What TF-IDF Can’t Do

Three fundamental limitations that led to neural approaches

↑ Answer the question above to continue ↑

A user searches for 'affordable running shoes'. A document about 'budget jogging sneakers' is highly relevant but scores 0 in TF-IDF. Why?

🎓 What You Now Know

✓ Bag of Words turns documents into vectors — by counting word occurrences. Simple, interpretable, but ignores word order and meaning.

✓ TF measures local importance — how frequent a word is in a specific document.

✓ IDF measures global rarity — words appearing everywhere get zero weight. IDF is a mathematical stopword detector.

✓ TF-IDF = TF × IDF — ranks documents by how much they match a query, considering both frequency and rarity.

✓ BM25 improves on TF-IDF — with saturation and length normalization. Still the default in Elasticsearch and Solr.

✓ The vocabulary mismatch problem — TF-IDF can’t match synonyms. This drove the invention of word embeddings and semantic search.

TF-IDF is one of the most elegant ideas in CS: a simple multiplication that captures both local and global importance. It powered search for 40 years and still runs behind the scenes today. 🔍

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

How did Google rank pages
before deep learning?

Step 1: Bag of Words — Counting What Matters

Step 2: Term Frequency — How Important Is a Word in This Document?

Term Frequency (TF)

Step 3: Inverse Document Frequency — Rewarding Rarity

Inverse Document Frequency (IDF)

The Magic: TF × IDF

TF-IDF: the complete formula

What TF-IDF Can’t Do

🎓 What You Now Know

Comments

↗ Keep Learning

Text Preprocessing — Turning Messy Words into Clean Features

Word Embeddings — When Words Learned to Be Vectors

Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks

Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly

Text Preprocessing — Turning Messy Words into Clean Features

How did Google rank pages before deep learning?

Step 1: Bag of Words — Counting What Matters

Step 2: Term Frequency — How Important Is a Word in This Document?

Term Frequency (TF)

Step 3: Inverse Document Frequency — Rewarding Rarity

Inverse Document Frequency (IDF)

The Magic: TF × IDF

TF-IDF: the complete formula

What TF-IDF Can’t Do

🎓 What You Now Know

Comments

↗ Keep Learning

Text Preprocessing — Turning Messy Words into Clean Features

Word Embeddings — When Words Learned to Be Vectors

Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks

Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly

Text Preprocessing — Turning Messy Words into Clean Features

How did Google rank pages
before deep learning?