Retrieve 1000. Rerank 100.
Return the best 10.

No single retrieval method is both fast enough AND accurate enough for production search. The solution: a two-stage pipeline. Stage 1 uses cheap methods (BM25, bi-encoder ANN) to retrieve ~1000 candidates in ~25ms. Stage 2 uses an expensive cross-encoder to rerank the top 100 in ~100ms. Result: the accuracy of a cross-encoder at the speed of ANN search.

↓ Scroll to understand the pipeline that powers Google, Bing, and every modern search system

Reranking Pipeline

The Two-Stage Retrieve + Rerank Pipeline

Modern production search: cheap retrieval narrows the field, expensive reranking perfects the order

↑ Answer the question above to continue ↑

Why does the modern search pipeline use MULTIPLE retrievers (BM25 + dense + learned sparse) instead of just one?

Latency Budget

Where Every Millisecond Goes

Latency budget for a 200ms end-to-end query — every millisecond is allocated

↑ Answer the question above to continue ↑

Your search system reranks the top 1000 results with a cross-encoder, taking 5 seconds per query. Users complain about latency. What's the best fix?

↑ Answer the question above to continue ↑

Your search latency budget is 200ms. The cross-encoder reranker currently takes 100ms on 100 documents. Product wants you to ALSO add a second reranking pass with a different model. How do you fit it in?

🎓 What You Now Know

✓ The modern search stack is BM25 + Dense + Rerank — retrieve 1000 candidates cheaply, fuse with RRF, rerank top 100 with a cross-encoder.

✓ RRF combines multiple retriever rankings — 1/(k + rank) elegantly rewards consensus without score calibration.

✓ Reranking takes 50% of the latency budget — but contributes ~70% of quality. The sweet spot is reranking the top 100, not 1000.

✓ Distilled rerankers trade size for speed — MiniLM-L6 is 3× faster than BERT-large with only 1-2% NDCG loss.

Every search query triggers a symphony of indexes, algorithms, and models — each operating at a different scale and cost. Understanding which tool to use at which stage is the essence of search engineering. ⚡

Search Reranking — The Two-Stage Pipeline That Powers Production Search

Retrieve 1000. Rerank 100.
Return the best 10.

The Two-Stage Retrieve + Rerank Pipeline

Where Every Millisecond Goes

🎓 What You Now Know

Comments

↗ Keep Learning

Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute

BM25 — The 30-Year-Old Algorithm That Still Wins at Search

Approximate Nearest Neighbor Search — Trading 1% Accuracy for 1000× Speed

Vector Databases — Search by Meaning, Not Keywords

Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute

Retrieve 1000. Rerank 100. Return the best 10.

The Two-Stage Retrieve + Rerank Pipeline

Where Every Millisecond Goes

🎓 What You Now Know

Comments

↗ Keep Learning

Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute

BM25 — The 30-Year-Old Algorithm That Still Wins at Search

Approximate Nearest Neighbor Search — Trading 1% Accuracy for 1000× Speed

Vector Databases — Search by Meaning, Not Keywords

Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute

Retrieve 1000. Rerank 100.
Return the best 10.