All articles
· 18 min deep-diveNLPinformation-retrievalsearch
Article 1 in your session

Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks

A scroll-driven visual deep dive into information retrieval. From inverted indices to BM25 to learning-to-rank — learn how Google, Bing, and enterprise search find the most relevant documents in milliseconds.

Introduction 0%
Introduction
🎯 0/4 0%

🔍

Google searches
100 billion web pages
in 0.3 seconds.

That’s reading through 100 petabytes of text faster than you can blink. No ML model can score every document for every query. The secret? Don’t score everything — use an index to narrow billions to thousands, then rank only those.

↓ Scroll to learn the architecture behind every search engine

Inverted Index
Documents (Forward Index)Doc 1: “the cat sat on the mat”Doc 2: “the dog sat on the log”Doc 3: “the cat and the dog”Inverted Index”cat”→ [Doc1:pos2, Doc3:pos2]“sat”→ [Doc1:pos3, Doc2:pos3]“dog”→ [Doc2:pos2, Doc3:pos4]“mat”→ [Doc1:pos6]“the”→ [Doc1:1,5, Doc2:1,5, Doc3:1,4]Query: “cat sat” → look up “cat” ∩ “sat” = {Doc1} → Done in O(1) lookups!Without the index: scan all 3 docs. With the index: 2 dictionary lookups + 1 intersection.At Google’s scale: 100 billion docs → O(1) lookups instead of O(100B) scans
An inverted index maps each word to the list of documents (and positions) containing it
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Google's index is estimated at ~100 petabytes. If you had to scan every document for every query, at 10 GB/s SSD read speed, how long would one query take?

TF-IDF Ranking

From Matching to Ranking: TF-IDF Scores

Scoring documents with TF-IDF

1
Query: 'machine learning algorithms'
User types a search query
2
Step 1: Find matching docs via inverted index
'machine' → {D1,D3,D7,...}, 'learning' → {D1,D2,D7,...}, 'algorithms' → {D1,D5,D7,...}
3
Step 2: Intersect → candidate set = {D1, D7, ...}
Only consider docs containing ALL (or most) query terms
4
Step 3: Score each candidate using TF-IDF
5
Score(Q,D) = Σ_{t ∈ Q} TF(t,D) × IDF(t)
Sum the TF-IDF weight of each query term in the document
6
Step 4: Sort by score, return top 10
Documents with higher TF-IDF scores for the query terms rank higher
BM25

BM25: The Gold Standard of Lexical Retrieval

The BM25 ranking formula

📐 Full BM25 Formula

BM25 sums a weighted score for each query term found in the document. Each term's contribution combines its rarity (IDF), its frequency with saturation, and a document length correction.

BM25(Q,D) = Σ IDF(t) · (TF · (k₁+1)) / (TF + k₁ · (1 - b + b · |D|/avg_dl))
📈 TF Saturation (k₁)

Controls how much credit repeated terms get. Higher k₁ rewards repetition more; k₁=0 treats terms as binary (present or not). Typically set to 1.2.

k₁ = 1.2 (default)
📏 Length Normalization (b)

Compensates for document length — longer documents naturally contain more terms. b=1 fully normalizes for length; b=0 ignores it entirely. Typically 0.75.

b = 0.75 (default)
📄 Relative Doc Length

Measures each document's length against the corpus average. Long documents are penalized so they aren't unfairly boosted just for containing more words.

|D| / avg_dl
🚫 Saturation Ceiling

No matter how many times a term appears, the TF component converges to (k₁ + 1). This elegant ceiling prevents keyword stuffing from gaming the rankings.

When TF → ∞, score → (k₁ + 1)
Term Frequency (times word appears in doc)ScoreRaw TF(linear — no ceiling!)BM25(saturates at k₁+1)k₁+1
BM25 vs raw TF: BM25 saturates — more repetitions don't help past a point
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A spammer creates a page that repeats 'best pizza near me' 1000 times. Under raw TF-IDF, this page would rank very high. Under BM25 with k₁=1.2:

Learning to Rank

Learning to Rank: Beyond Static Formulas

🔍 Query +Docs 📊 Features BM25+PageRank+... 🧠 LTR Model LambdaMART 🔢 Score Per doc 📋 Ranked Top 10
Learning to Rank: train an ML model to combine hundreds of features into a single relevance score
Text Features• BM25 score• TF-IDF cosine similarity• Query-title word overlapAuthority Features• PageRank• Domain authority• Inbound link countEngagement Features• Click-through rate• Dwell time (long click)• Bounce rateLambdaMART (Gradient Boosted Trees)All features → single relevance score. Trained on human relevance judgments.Used by Bing, Yahoo, and most commercial search engines. Optimizes NDCG directly.
Hundreds of features are combined — BM25 is just one of many inputs
Dense Retrieval

Dense Retrieval: Beyond Keywords

🔍 Query Natural language 🧠 Q Encoder BERT-based 📐 Query Vector [0.2, -0.8,...] ANN Search HNSW/FAISS 📄 Documents Billions offline 🧠 D Encoder BERT-based 💾 Doc Vectors Pre-computed 🏆 Top-K By similarity
Dense retrieval encodes both query and documents into embedding vectors, then finds nearest neighbors
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

A search engine uses only dense retrieval (no BM25). A user searches for 'RFC 7231 section 6.5.1'. Dense retrieval will probably:

Hybrid Systems

The Modern Search Stack: Hybrid Retrieval

↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

A user searches 'How to fix a flat tire'. The top BM25 result is 'Flat Rate Tire Service Pricing' (high word overlap with 'flat' and 'tire'). How does the cross-encoder reranker fix this?

🎓 What You Now Know

Inverted indices make search possible — converting O(N) scans to O(1) lookups. Without them, a single Google query would take 116 days.

BM25 is TF-IDF done right — with term frequency saturation (k₁) and length normalization (b). Still competitive after 30 years.

Learning to Rank combines hundreds of signals — BM25, PageRank, CTR, embeddings, freshness all feed into a gradient boosted model (LambdaMART).

Dense retrieval finds meaning, not just keywords — but it can’t handle exact matches. That’s why hybrid (BM25 + dense) is the modern standard.

Two-stage retrieve-then-rerank — cheap retrieval narrows billions to thousands, then expensive reranking scores those precisely. This is how every modern search engine works.

Information retrieval is the backbone of search engines, email search, enterprise knowledge bases, and RAG systems. Every time you hit “search,” this architecture is silently working to find your answer. 🔍

Keep Learning