All articles
· 15 min deep-diveNLPinformation-retrieval
Article 1 in your session

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.

Introduction 0%
Introduction
🎯 0/4 0%

🔍

How did Google rank pages
before deep learning?

For decades, search engines used a beautifully simple idea: count words, weight them by rarity, and rank by relevance. TF-IDF powered the internet’s search systems from the 1970s until the 2010s — and it still runs inside Elasticsearch, Lucene, and Solr today.

↓ Scroll to understand the math behind document ranking

Bag of Words

Step 1: Bag of Words — Counting What Matters

Doc 1”the cat sat on the mat”Doc 2”the dog sat on the log”Vocabulary: [the, cat, sat, on, mat, dog, log]Doc 1 Vector[2, 1, 1, 1, 1, 0, 0]Doc 2 Vector[2, 0, 1, 1, 0, 1, 1]Now they’re just numbers!We can compute distances, train classifiers, rank documents…
Two documents become vectors by counting word occurrences
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

In a Bag of Words representation, 'Dog bites man' and 'Man bites dog' produce:

Term Frequency

Step 2: Term Frequency — How Important Is a Word in This Document?

Term Frequency (TF)

📊 TF Formula

Term Frequency measures how often a specific word appears in a document, normalized by document length. It captures local importance — how much of THIS document is about THIS term.

TF(t, d) = count(t in d) / total_words(d)
📧 Worked Example

If 'spam' appears 5 times in a 100-word email, TF = 5/100 = 0.05 — meaning 5% of the email is the word 'spam'. That's a strong signal this email is about spam!

TF('spam', email) = 5/100 = 0.05
⚠️ The Limitation

TF alone is misleading because common words like 'the' dominate every document with high frequency, yet tell you nothing about the document's topic. This is why we need IDF to balance things out.

TF scores for a medical research paper:“the” — TF = 0.07 (highest!)“of” — TF = 0.04”and” — TF = 0.03”cancer”TF = 0.008 ← THIS is the important word!“tumor”TF = 0.005 ← And this one!
High TF doesn't mean high relevance — 'the' dominates every document
IDF

Step 3: Inverse Document Frequency — Rewarding Rarity

Inverse Document Frequency (IDF)

📏 IDF Formula

IDF rewards rarity: words that appear in fewer documents get higher weight. It's a mathematical stopword detector — common words are automatically suppressed without needing a manual list.

IDF(t) = log(N / df(t))
🚫 Common Word → Zero

'the' appears in every document, so IDF('the') = log(10000/10000) = log(1) = 0. A word in every document carries zero distinguishing information — it's completely worthless for ranking.

IDF('the') = log(10000 / 10000) = 0
💎 Rare Word → High

'cancer' appears in only 50 out of 10,000 documents, giving it IDF ≈ 5.3. Rare words are informative because they help distinguish one document from another.

IDF('cancer') = log(10000 / 50) ≈ 5.3
🦄 Ultra-Rare → Maximum

'xylophagous' appears in just 2 documents, yielding IDF ≈ 8.5. Extremely rare words get extremely high IDF weight — they're practically unique identifiers for the documents that contain them.

IDF('xylophagous') = log(10000 / 2) ≈ 8.5
the, is, a, ofcomputer, dataxylophagousIDF ≈ 0 (useless)IDF ≈ 8+ (very rare)IDF = log(total_docs / docs_containing_word)Words in every doc → IDF = 0 | Words in 1 doc → IDF = max
The IDF spectrum: common words get crushed, rare words get boosted
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

In a corpus of 1,000,000 web pages, a word appears in 999,999 of them. Its IDF score is approximately:

TF-IDF

The Magic: TF × IDF

TF-IDF: the complete formula

TF × IDF

TF-IDF multiplies local importance (how frequent in THIS document) by global rarity (how rare across ALL documents). The product captures what makes a word uniquely relevant to a specific document.

TF-IDF(t, d) = TF(t, d) × IDF(t)
🏆 High TF + High IDF

A word that's frequent in THIS document and rare overall is extremely relevant — this is the sweet spot. Example: 'mitochondria' appearing 8 times in a biology paper.

🚫 High TF + Low IDF

A word that's frequent everywhere ('the', 'is', 'and') contributes nothing to ranking, no matter how often it appears in a single document. IDF crushes it to zero.

💡 Low TF + High IDF

A rare word that appears even once provides moderate signal — it's informative precisely because it's unusual in the corpus.

📉 Low TF + Low IDF

A common word appearing rarely is doubly useless — low frequency AND low rarity means near-zero contribution to the document's relevance score.

🔍 Query 'cancer treatment' 🔢 Compute TF-IDF For each doc × term 📊 Score Docs Sum TF-IDF matches 🏆 Return Top K Ranked results
How a search engine uses TF-IDF to rank documents for a query
🥇 Doc A: ML textbook chapterScore: 8.45🥈 Doc B: AI research paperScore: 5.21🥉 Doc C: General tech blogScore: 1.03Doc A wins because “machine”, “learning”, “algorithms” all have HIGH TF in Doc AAND high IDF (they don’t appear in every document in the corpus)
TF-IDF in action: ranking documents for 'machine learning algorithms'
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A document contains the word 'algorithm' 10 times. The TF component is high. But the document scores LOW with TF-IDF. What's the most likely explanation?

Limitations

What TF-IDF Can’t Do

No Semantics”car” and “automobile”are UNRELATED in TF-IDF”bank” (river) and”bank” (money) areIDENTICAL in TF-IDFSearch “cheap flights”won’t find “affordableairfare” — different words!No Word Order”dog bites man” and”man bites dog” areIDENTICAL vectors”not good” becomes{"not":1, "good":1}— looks positive!Fix: N-grams capturesome word orderSparse & Huge50K+ dimensionsper document99.9% of valuesare zeroDistance metricsbreak down inhigh dimensions→ This led to wordembeddings (next!)
Three fundamental limitations that led to neural approaches
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

A user searches for 'affordable running shoes'. A document about 'budget jogging sneakers' is highly relevant but scores 0 in TF-IDF. Why?

🎓 What You Now Know

Bag of Words turns documents into vectors — by counting word occurrences. Simple, interpretable, but ignores word order and meaning.

TF measures local importance — how frequent a word is in a specific document.

IDF measures global rarity — words appearing everywhere get zero weight. IDF is a mathematical stopword detector.

TF-IDF = TF × IDF — ranks documents by how much they match a query, considering both frequency and rarity.

BM25 improves on TF-IDF — with saturation and length normalization. Still the default in Elasticsearch and Solr.

The vocabulary mismatch problem — TF-IDF can’t match synonyms. This drove the invention of word embeddings and semantic search.

TF-IDF is one of the most elegant ideas in CS: a simple multiplication that captures both local and global importance. It powered search for 40 years and still runs behind the scenes today. 🔍

Keep Learning