Introduction 0%

Introduction

🎯 0/4 0%

🔍

Google searches
100 billion web pages
in 0.3 seconds.

That’s reading through 100 petabytes of text faster than you can blink. No ML model can score every document for every query. The secret? Don’t score everything — use an index to narrow billions to thousands, then rank only those.

↓ Scroll to learn the architecture behind every search engine

Inverted Index

The Inverted Index: The Data Structure That Enables Search

An inverted index maps each word to the list of documents (and positions) containing it

↑ Answer the question above to continue ↑

Google's index is estimated at ~100 petabytes. If you had to scan every document for every query, at 10 GB/s SSD read speed, how long would one query take?

TF-IDF Ranking

From Matching to Ranking: TF-IDF Scores

Scoring documents with TF-IDF

Query: 'machine learning algorithms'

User types a search query

Step 1: Find matching docs via inverted index

'machine' → {D1,D3,D7,...}, 'learning' → {D1,D2,D7,...}, 'algorithms' → {D1,D5,D7,...}

Step 2: Intersect → candidate set = {D1, D7, ...}

Only consider docs containing ALL (or most) query terms

Step 3: Score each candidate using TF-IDF

Score(Q,D) = Σ_{t ∈ Q} TF(t,D) × IDF(t)

Sum the TF-IDF weight of each query term in the document

Step 4: Sort by score, return top 10

Documents with higher TF-IDF scores for the query terms rank higher

BM25

BM25: The Gold Standard of Lexical Retrieval

The BM25 ranking formula

📐 Full BM25 Formula

BM25 sums a weighted score for each query term found in the document. Each term's contribution combines its rarity (IDF), its frequency with saturation, and a document length correction.

BM25(Q,D) = Σ IDF(t) · (TF · (k₁+1)) / (TF + k₁ · (1 - b + b · |D|/avg_dl))

📈 TF Saturation (k₁)

Controls how much credit repeated terms get. Higher k₁ rewards repetition more; k₁=0 treats terms as binary (present or not). Typically set to 1.2.

k₁ = 1.2 (default)

📏 Length Normalization (b)

Compensates for document length — longer documents naturally contain more terms. b=1 fully normalizes for length; b=0 ignores it entirely. Typically 0.75.

b = 0.75 (default)

📄 Relative Doc Length

Measures each document's length against the corpus average. Long documents are penalized so they aren't unfairly boosted just for containing more words.

|D| / avg_dl

🚫 Saturation Ceiling

No matter how many times a term appears, the TF component converges to (k₁ + 1). This elegant ceiling prevents keyword stuffing from gaming the rankings.

When TF → ∞, score → (k₁ + 1)

BM25 vs raw TF: BM25 saturates — more repetitions don't help past a point

↑ Answer the question above to continue ↑

A spammer creates a page that repeats 'best pizza near me' 1000 times. Under raw TF-IDF, this page would rank very high. Under BM25 with k₁=1.2:

Learning to Rank

Learning to Rank: Beyond Static Formulas

Learning to Rank: train an ML model to combine hundreds of features into a single relevance score

Hundreds of features are combined — BM25 is just one of many inputs

Dense Retrieval

Dense Retrieval: Beyond Keywords

Dense retrieval encodes both query and documents into embedding vectors, then finds nearest neighbors

↑ Answer the question above to continue ↑

A search engine uses only dense retrieval (no BM25). A user searches for 'RFC 7231 section 6.5.1'. Dense retrieval will probably:

Hybrid Systems

The Modern Search Stack: Hybrid Retrieval

Modern search engines use a two-stage retrieve-then-rerank architecture

↑ Answer the question above to continue ↑

A user searches 'How to fix a flat tire'. The top BM25 result is 'Flat Rate Tire Service Pricing' (high word overlap with 'flat' and 'tire'). How does the cross-encoder reranker fix this?

🎓 What You Now Know

✓ Inverted indices make search possible — converting O(N) scans to O(1) lookups. Without them, a single Google query would take 116 days.

✓ BM25 is TF-IDF done right — with term frequency saturation (k₁) and length normalization (b). Still competitive after 30 years.

✓ Learning to Rank combines hundreds of signals — BM25, PageRank, CTR, embeddings, freshness all feed into a gradient boosted model (LambdaMART).

✓ Dense retrieval finds meaning, not just keywords — but it can’t handle exact matches. That’s why hybrid (BM25 + dense) is the modern standard.

✓ Two-stage retrieve-then-rerank — cheap retrieval narrows billions to thousands, then expensive reranking scores those precisely. This is how every modern search engine works.

Information retrieval is the backbone of search engines, email search, enterprise knowledge bases, and RAG systems. Every time you hit “search,” this architecture is silently working to find your answer. 🔍

Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks

Google searches
100 billion web pages
in 0.3 seconds.

The Inverted Index: The Data Structure That Enables Search

From Matching to Ranking: TF-IDF Scores

Scoring documents with TF-IDF

BM25: The Gold Standard of Lexical Retrieval

The BM25 ranking formula

Learning to Rank: Beyond Static Formulas

Dense Retrieval: Beyond Keywords

The Modern Search Stack: Hybrid Retrieval

🎓 What You Now Know

Comments

↗ Keep Learning

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Query Understanding — What Did the User Actually Mean?

Text Similarity — From Jaccard to Neural Matching

Word Embeddings — When Words Learned to Be Vectors

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Google searches 100 billion web pages in 0.3 seconds.

The Inverted Index: The Data Structure That Enables Search

From Matching to Ranking: TF-IDF Scores

Scoring documents with TF-IDF

BM25: The Gold Standard of Lexical Retrieval

The BM25 ranking formula

Learning to Rank: Beyond Static Formulas

Dense Retrieval: Beyond Keywords

The Modern Search Stack: Hybrid Retrieval

🎓 What You Now Know

Comments

↗ Keep Learning

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Query Understanding — What Did the User Actually Mean?

Text Similarity — From Jaccard to Neural Matching

Word Embeddings — When Words Learned to Be Vectors

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Google searches
100 billion web pages
in 0.3 seconds.