Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks
A scroll-driven visual deep dive into information retrieval. From inverted indices to BM25 to learning-to-rank — learn how Google, Bing, and enterprise search find the most relevant documents in milliseconds.
🔍
Google searches
100 billion web pages
in 0.3 seconds.
That’s reading through 100 petabytes of text faster than you can blink. No ML model can score every document for every query. The secret? Don’t score everything — use an index to narrow billions to thousands, then rank only those.
↓ Scroll to learn the architecture behind every search engine
The Inverted Index: The Data Structure That Enables Search
Google's index is estimated at ~100 petabytes. If you had to scan every document for every query, at 10 GB/s SSD read speed, how long would one query take?
💡 100 PB = 100,000,000 GB. At 10 GB/s...
100 PB = 100,000,000 GB. At 10 GB/s, that's 10,000,000 seconds = ~116 days for a SINGLE query. Even with 10,000 machines in parallel, it's still ~17 minutes — far too slow for Google's 0.3-second target. This is why the inverted index is the single most important data structure in search: it turns an O(N) scan into an O(1) lookup + small intersection, making sub-second search over 100B docs possible.
From Matching to Ranking: TF-IDF Scores
Scoring documents with TF-IDF
Query: 'machine learning algorithms' Step 1: Find matching docs via inverted index Step 2: Intersect → candidate set = {D1, D7, ...} Step 3: Score each candidate using TF-IDF Score(Q,D) = Σ_{t ∈ Q} TF(t,D) × IDF(t) Step 4: Sort by score, return top 10 BM25: The Gold Standard of Lexical Retrieval
The BM25 ranking formula
BM25 sums a weighted score for each query term found in the document. Each term's contribution combines its rarity (IDF), its frequency with saturation, and a document length correction.
BM25(Q,D) = Σ IDF(t) · (TF · (k₁+1)) / (TF + k₁ · (1 - b + b · |D|/avg_dl)) Controls how much credit repeated terms get. Higher k₁ rewards repetition more; k₁=0 treats terms as binary (present or not). Typically set to 1.2.
k₁ = 1.2 (default) Compensates for document length — longer documents naturally contain more terms. b=1 fully normalizes for length; b=0 ignores it entirely. Typically 0.75.
b = 0.75 (default) Measures each document's length against the corpus average. Long documents are penalized so they aren't unfairly boosted just for containing more words.
|D| / avg_dl No matter how many times a term appears, the TF component converges to (k₁ + 1). This elegant ceiling prevents keyword stuffing from gaming the rankings.
When TF → ∞, score → (k₁ + 1) A spammer creates a page that repeats 'best pizza near me' 1000 times. Under raw TF-IDF, this page would rank very high. Under BM25 with k₁=1.2:
💡 What happens to TF(t,D)/(TF(t,D) + k₁·...) when TF(t,D) → 1000?
With k₁=1.2, the BM25 TF component converges to k₁+1 = 2.2. After about 5-10 occurrences, adding more barely increases the score. A page with 'pizza' appearing 5 times vs 1000 times would score almost identically under BM25. BUT the length normalization (b parameter) would also penalize the spammy page: repeating the phrase 1000 times makes |D| >> avg_dl, reducing the score further. BM25's design elegantly handles keyword stuffing without explicit rules.
Learning to Rank: Beyond Static Formulas
Dense Retrieval: Beyond Keywords
A search engine uses only dense retrieval (no BM25). A user searches for 'RFC 7231 section 6.5.1'. Dense retrieval will probably:
💡 Think about what happens to 'RFC 7231' in an embedding space. Is the exact number preserved?
Dense retrieval maps queries to a continuous semantic space. 'RFC 7231 section 6.5.1' (which is the HTTP 404 spec) would likely be mapped near other HTTP/REST API documents — but the exact RFC number and section are treated as tokens without special meaning. BM25 would trivially find documents containing the exact string 'RFC 7231 section 6.5.1'. This is why hybrid search (BM25 + dense) is the gold standard: BM25 handles exact matching, dense handles semantic matching.
The Modern Search Stack: Hybrid Retrieval
A user searches 'How to fix a flat tire'. The top BM25 result is 'Flat Rate Tire Service Pricing' (high word overlap with 'flat' and 'tire'). How does the cross-encoder reranker fix this?
💡 BM25 counts word matches. What can the cross-encoder understand that BM25 can't?
This is the vocabulary mismatch problem BM25 suffers from: 'flat' and 'tire' appear frequently in the pricing page, giving high BM25 scores, but the intent is completely different. The cross-encoder processes '[CLS] query [SEP] document' as a single input, computing cross-attention between all token pairs. It learns that 'how to fix' signals repair/tutorial intent that doesn't match pricing content. This is why retrieve-then-rerank works: BM25 efficiently finds lexical candidates, then the cross-encoder applies expensive but accurate semantic understanding to reorder them.
🎓 What You Now Know
✓ Inverted indices make search possible — converting O(N) scans to O(1) lookups. Without them, a single Google query would take 116 days.
✓ BM25 is TF-IDF done right — with term frequency saturation (k₁) and length normalization (b). Still competitive after 30 years.
✓ Learning to Rank combines hundreds of signals — BM25, PageRank, CTR, embeddings, freshness all feed into a gradient boosted model (LambdaMART).
✓ Dense retrieval finds meaning, not just keywords — but it can’t handle exact matches. That’s why hybrid (BM25 + dense) is the modern standard.
✓ Two-stage retrieve-then-rerank — cheap retrieval narrows billions to thousands, then expensive reranking scores those precisely. This is how every modern search engine works.
Information retrieval is the backbone of search engines, email search, enterprise knowledge bases, and RAG systems. Every time you hit “search,” this architecture is silently working to find your answer. 🔍
↗ Keep Learning
Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
Query Understanding — What Did the User Actually Mean?
A scroll-driven visual deep dive into query understanding. From spell correction to query expansion to intent classification — learn how search engines interpret ambiguous, misspelled, and complex queries.
Text Similarity — From Jaccard to Neural Matching
A scroll-driven visual deep dive into text similarity. Learn how search engines detect duplicates, match queries to documents, and measure how 'close' two texts really are — from set overlap to cosine similarity to learned embeddings.
Word Embeddings — When Words Learned to Be Vectors
A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.
Comments
No comments yet. Be the first!