Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
🔍
How did Google rank pages
before deep learning?
For decades, search engines used a beautifully simple idea: count words, weight them by rarity, and rank by relevance. TF-IDF powered the internet’s search systems from the 1970s until the 2010s — and it still runs inside Elasticsearch, Lucene, and Solr today.
↓ Scroll to understand the math behind document ranking
Step 1: Bag of Words — Counting What Matters
In a Bag of Words representation, 'Dog bites man' and 'Man bites dog' produce:
💡 Does BoW care about the position of words in the sentence?
Both sentences contain exactly the same words with the same frequencies: {dog: 1, bites: 1, man: 1}. Bag of Words throws away all word order — that's its fundamental limitation. Despite this, BoW works surprisingly well for document classification because topic is determined more by WHICH words appear than by their order. 'Election results polls voting' is clearly about politics regardless of word order.
Step 2: Term Frequency — How Important Is a Word in This Document?
Term Frequency (TF)
Term Frequency measures how often a specific word appears in a document, normalized by document length. It captures local importance — how much of THIS document is about THIS term.
TF(t, d) = count(t in d) / total_words(d) If 'spam' appears 5 times in a 100-word email, TF = 5/100 = 0.05 — meaning 5% of the email is the word 'spam'. That's a strong signal this email is about spam!
TF('spam', email) = 5/100 = 0.05 TF alone is misleading because common words like 'the' dominate every document with high frequency, yet tell you nothing about the document's topic. This is why we need IDF to balance things out.
Step 3: Inverse Document Frequency — Rewarding Rarity
Inverse Document Frequency (IDF)
IDF rewards rarity: words that appear in fewer documents get higher weight. It's a mathematical stopword detector — common words are automatically suppressed without needing a manual list.
IDF(t) = log(N / df(t)) 'the' appears in every document, so IDF('the') = log(10000/10000) = log(1) = 0. A word in every document carries zero distinguishing information — it's completely worthless for ranking.
IDF('the') = log(10000 / 10000) = 0 'cancer' appears in only 50 out of 10,000 documents, giving it IDF ≈ 5.3. Rare words are informative because they help distinguish one document from another.
IDF('cancer') = log(10000 / 50) ≈ 5.3 'xylophagous' appears in just 2 documents, yielding IDF ≈ 8.5. Extremely rare words get extremely high IDF weight — they're practically unique identifiers for the documents that contain them.
IDF('xylophagous') = log(10000 / 2) ≈ 8.5 In a corpus of 1,000,000 web pages, a word appears in 999,999 of them. Its IDF score is approximately:
💡 If a word appears everywhere, how much does it help you find a SPECIFIC document?
IDF = log(1,000,000 / 999,999) = log(1.000001) ≈ 0.000001. Nearly zero! A word that appears in almost every document can't help distinguish one document from another. This is the genius of IDF: it automatically discovers that 'the', 'is', 'and' are useless for search, without anyone having to manually create a stopword list. IDF is a mathematical stopword detector.
The Magic: TF × IDF
TF-IDF: the complete formula
TF-IDF multiplies local importance (how frequent in THIS document) by global rarity (how rare across ALL documents). The product captures what makes a word uniquely relevant to a specific document.
TF-IDF(t, d) = TF(t, d) × IDF(t) A word that's frequent in THIS document and rare overall is extremely relevant — this is the sweet spot. Example: 'mitochondria' appearing 8 times in a biology paper.
A word that's frequent everywhere ('the', 'is', 'and') contributes nothing to ranking, no matter how often it appears in a single document. IDF crushes it to zero.
A rare word that appears even once provides moderate signal — it's informative precisely because it's unusual in the corpus.
A common word appearing rarely is doubly useless — low frequency AND low rarity means near-zero contribution to the document's relevance score.
A document contains the word 'algorithm' 10 times. The TF component is high. But the document scores LOW with TF-IDF. What's the most likely explanation?
💡 IDF depends on how many documents contain the word. In a CS corpus, how common is 'algorithm'?
If 'algorithm' appears in 95% of documents in a computer science corpus, IDF('algorithm') = log(N/0.95N) = log(1.05) ≈ 0.05. Even with TF = 10, the score is 10 × 0.05 = 0.5 — very low. This makes sense: in a CS corpus, 'algorithm' is as common as 'the' and can't distinguish one paper from another. The same word might have HIGH IDF in a cooking recipe corpus (where it's very rare). IDF is always relative to your corpus.
What TF-IDF Can’t Do
A user searches for 'affordable running shoes'. A document about 'budget jogging sneakers' is highly relevant but scores 0 in TF-IDF. Why?
💡 Does TF-IDF know that synonyms exist?
TF-IDF treats each word as an independent dimension with no notion of meaning. 'Running' and 'jogging' are as unrelated as 'running' and 'quantum' — just different strings. This is called the 'vocabulary mismatch problem' and it's the #1 reason search engines adopted semantic search (word embeddings, vector databases). TF-IDF can only find documents with the exact words you typed.
🎓 What You Now Know
✓ Bag of Words turns documents into vectors — by counting word occurrences. Simple, interpretable, but ignores word order and meaning.
✓ TF measures local importance — how frequent a word is in a specific document.
✓ IDF measures global rarity — words appearing everywhere get zero weight. IDF is a mathematical stopword detector.
✓ TF-IDF = TF × IDF — ranks documents by how much they match a query, considering both frequency and rarity.
✓ BM25 improves on TF-IDF — with saturation and length normalization. Still the default in Elasticsearch and Solr.
✓ The vocabulary mismatch problem — TF-IDF can’t match synonyms. This drove the invention of word embeddings and semantic search.
TF-IDF is one of the most elegant ideas in CS: a simple multiplication that captures both local and global importance. It powered search for 40 years and still runs behind the scenes today. 🔍
↗ Keep Learning
Text Preprocessing — Turning Messy Words into Clean Features
A scroll-driven visual deep dive into text preprocessing. Learn tokenization, stemming, lemmatization, stopword removal, and normalization — the essential first step of every NLP pipeline.
Word Embeddings — When Words Learned to Be Vectors
A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.
Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks
A scroll-driven visual deep dive into information retrieval. From inverted indices to BM25 to learning-to-rank — learn how Google, Bing, and enterprise search find the most relevant documents in milliseconds.
Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly
A scroll-driven visual deep dive into Naive Bayes. Learn Bayes' theorem, why the 'naive' independence assumption is wrong but works anyway, and why it dominates spam filtering.
Comments
No comments yet. Be the first!