Search Reranking — The Two-Stage Pipeline That Powers Production Search
A visual deep dive into the retrieve + rerank pipeline. How BM25, dense retrieval, and learned sparse retrieval feed into Reciprocal Rank Fusion, then cross-encoder reranking — with a full latency budget breakdown.
🏆
Retrieve 1000. Rerank 100.
Return the best 10.
No single retrieval method is both fast enough AND accurate enough for production search. The solution: a two-stage pipeline. Stage 1 uses cheap methods (BM25, bi-encoder ANN) to retrieve ~1000 candidates in ~25ms. Stage 2 uses an expensive cross-encoder to rerank the top 100 in ~100ms. Result: the accuracy of a cross-encoder at the speed of ANN search.
↓ Scroll to understand the pipeline that powers Google, Bing, and every modern search system
The Two-Stage Retrieve + Rerank Pipeline
Why does the modern search pipeline use MULTIPLE retrievers (BM25 + dense + learned sparse) instead of just one?
💡 Think about what kinds of queries each retriever type handles best.
Consider the query 'python memory leak debugging'. BM25 excels here because 'memory leak' is a precise term. But for 'how to fix slow program', dense retrieval finds semantically relevant docs even without exact matches. Learned sparse models (SPLADE) expand queries with related terms, bridging both worlds. RRF fusion rewards documents that rank well across multiple retrievers — a document ranked #5 by BM25 AND #3 by dense retrieval is likely more relevant than one ranked #1 by only one retriever. This 'ensemble retrieval' approach consistently outperforms any single retriever by 5-15% NDCG.
Where Every Millisecond Goes
Your search system reranks the top 1000 results with a cross-encoder, taking 5 seconds per query. Users complain about latency. What's the best fix?
💡 Cross-encoder latency scales linearly with the number of documents...
Cross-encoder cost is O(n) in the number of documents to rerank. Each (query, doc) pair requires a full BERT forward pass (~5ms on GPU). 1000 × 5ms = 5 seconds. Reducing to 100 → 500ms. Reducing to 50 → 250ms. The key insight: Stage 1 retrieval already puts the most relevant docs in the top 100. Reranking top 100 vs top 1000 typically loses only 1-2% NDCG but is 10× faster. Some systems also use model distillation — training a smaller, faster cross-encoder (e.g., MiniLM-L6 instead of BERT-large) to get 3× speedup with minimal quality loss.
Your search latency budget is 200ms. The cross-encoder reranker currently takes 100ms on 100 documents. Product wants you to ALSO add a second reranking pass with a different model. How do you fit it in?
💡 How can you make a cross-encoder faster without reducing the number of documents?
Model distillation trains a small 'student' model to mimic a large 'teacher' cross-encoder. MiniLM-L6 (22M params) achieves ~98% of BERT-large quality at 3× the speed. Your new budget: Stage 1 retrieval (25ms) + fusion (5ms) + first reranker via MiniLM-L6 (33ms) + second reranker (33ms) + network/render (55ms) = 151ms. Option D (reducing to 50 docs) also works but costs recall. Distillation preserves the full 100-doc window while gaining speed. This is why companies like Cohere and Jina ship specialized distilled reranker models.
🎓 What You Now Know
✓ The modern search stack is BM25 + Dense + Rerank — retrieve 1000 candidates cheaply, fuse with RRF, rerank top 100 with a cross-encoder.
✓
RRF combines multiple retriever rankings — 1/(k + rank) elegantly rewards consensus without score calibration.
✓ Reranking takes 50% of the latency budget — but contributes ~70% of quality. The sweet spot is reranking the top 100, not 1000.
✓ Distilled rerankers trade size for speed — MiniLM-L6 is 3× faster than BERT-large with only 1-2% NDCG loss.
Every search query triggers a symphony of indexes, algorithms, and models — each operating at a different scale and cost. Understanding which tool to use at which stage is the essence of search engineering. ⚡
↗ Keep Learning
Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute
A visual deep dive into the architectural difference between bi-encoders and cross-encoders. Why cross-attention produces higher-quality relevance scores, and why cross-encoders can only be used for reranking — never for retrieval.
BM25 — The 30-Year-Old Algorithm That Still Wins at Search
A visual deep dive into BM25 scoring. Understand every term in the formula — IDF, TF saturation via k₁, length normalization via b — and why BM25 still outperforms neural retrievers on specialized vocabularies.
Approximate Nearest Neighbor Search — Trading 1% Accuracy for 1000× Speed
A visual deep dive into ANN search. Why brute-force nearest neighbor fails at scale, how approximate methods achieve 99% recall with logarithmic query time, and the fundamental accuracy-speed tradeoff behind every vector search system.
Vector Databases — Search by Meaning, Not Keywords
A visual deep dive into vector databases. From embeddings to ANN search to HNSW — understand how AI-powered search finds what you actually mean, not just what you typed.
Comments
No comments yet. Be the first!