All articles
· 8 min deep-divesearchinformation-retrievalsystems
Article 1 in your session

Search Reranking — The Two-Stage Pipeline That Powers Production Search

A visual deep dive into the retrieve + rerank pipeline. How BM25, dense retrieval, and learned sparse retrieval feed into Reciprocal Rank Fusion, then cross-encoder reranking — with a full latency budget breakdown.

Introduction 0%
Introduction
🎯 0/3 0%

🏆

Retrieve 1000. Rerank 100.
Return the best 10.

No single retrieval method is both fast enough AND accurate enough for production search. The solution: a two-stage pipeline. Stage 1 uses cheap methods (BM25, bi-encoder ANN) to retrieve ~1000 candidates in ~25ms. Stage 2 uses an expensive cross-encoder to rerank the top 100 in ~100ms. Result: the accuracy of a cross-encoder at the speed of ANN search.

↓ Scroll to understand the pipeline that powers Google, Bing, and every modern search system

Reranking Pipeline

The Two-Stage Retrieve + Rerank Pipeline

🔍 Query 📝 BM25 Inverted Index 🧠 Dense Bi-Encoder+ANN 📊 Learned Sparse SPLADE 🔄 RRF Fusion 🎯 Reranker Cross-Encoder 🏆 Top 10 ~1000 ~1000 ~1000 ~100-500 Top 10
Modern production search: cheap retrieval narrows the field, expensive reranking perfects the order
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Why does the modern search pipeline use MULTIPLE retrievers (BM25 + dense + learned sparse) instead of just one?

Latency Budget

Where Every Millisecond Goes

200ms Latency Budget BreakdownNetwork30msQU15msBM2510msDense15msFuse5msCROSS-ENCODER RERANKING100ms (50% of budget!)Render25msReranking takes 50% of the latency budget but contributes 70% of qualityThis is why reranking over 100 docs (not 1000) is the production sweet spot
Latency budget for a 200ms end-to-end query — every millisecond is allocated
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Your search system reranks the top 1000 results with a cross-encoder, taking 5 seconds per query. Users complain about latency. What's the best fix?

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Your search latency budget is 200ms. The cross-encoder reranker currently takes 100ms on 100 documents. Product wants you to ALSO add a second reranking pass with a different model. How do you fit it in?

🎓 What You Now Know

The modern search stack is BM25 + Dense + Rerank — retrieve 1000 candidates cheaply, fuse with RRF, rerank top 100 with a cross-encoder.

RRF combines multiple retriever rankings1/(k + rank) elegantly rewards consensus without score calibration.

Reranking takes 50% of the latency budget — but contributes ~70% of quality. The sweet spot is reranking the top 100, not 1000.

Distilled rerankers trade size for speed — MiniLM-L6 is 3× faster than BERT-large with only 1-2% NDCG loss.

Every search query triggers a symphony of indexes, algorithms, and models — each operating at a different scale and cost. Understanding which tool to use at which stage is the essence of search engineering. ⚡

Keep Learning