Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute
A visual deep dive into the architectural difference between bi-encoders and cross-encoders. Why cross-attention produces higher-quality relevance scores, and why cross-encoders can only be used for reranking — never for retrieval.
🔬
Same BERT model.
5-15% NDCG difference.
A bi-encoder encodes query and document separately, then compares vectors with cosine similarity. A cross-encoder concatenates query and document together, letting every query token attend to every document token. Same BERT, dramatically different accuracy — but the cross-encoder is 1000× more expensive at query time. This tradeoff defines modern search architecture.
↓ Scroll to understand the architecture that makes reranking possible
Bi-Encoders vs. Cross-Encoders: The Critical Difference
A bi-encoder and cross-encoder use the SAME BERT model. Why does the cross-encoder produce more accurate relevance scores?
💡 Think about what happens inside the BERT self-attention layers when tokens from query and document are together vs. separate.
Both use identical BERT architectures with the same number of parameters. The difference is purely architectural: the bi-encoder runs BERT twice (once on query, once on doc) and compares output vectors. The cross-encoder runs BERT once on [CLS] query [SEP] document, letting self-attention create CROSS-interactions between query and document tokens. When BERT sees 'apple stock price [SEP] Apple Inc. reported quarterly earnings', the attention mechanism links 'apple' in the query with 'Inc.' and 'earnings' in the document — disambiguating that this is about the company, not the fruit. The bi-encoder can't do this because it must embed 'apple' without seeing the document.
Why cross-encoders are more accurate
Bi-encoder: score = cosine(BERT(q), BERT(d)) Cross-encoder: score = sigmoid(BERT([CLS] q [SEP] d)) Example: query='apple stock price', doc='Apple Inc. reported strong quarterly earnings...' Bi-encoder: encodes 'apple' without knowing you mean the COMPANY Cross-encoder: sees 'apple' + 'stock' + 'Inc.' + 'earnings' together → knows it's about the company A cross-encoder is far more accurate than a bi-encoder. Why not just use a cross-encoder for everything?
💡 How many BERT forward passes does each architecture need per query?
This is the fundamental limitation. A bi-encoder encodes each document ONCE offline and stores the vector. At query time, you encode only the query (1 BERT call) and do ANN lookup (~1ms). A cross-encoder can't pre-compute anything — the score for 'apple stock price' vs document D depends on BOTH inputs concatenated. So for each query, you'd need N forward passes through BERT. That's why cross-encoders are only used as Stage 2 rerankers on the top ~100-1000 candidates from Stage 1.
You want to build a semantic search system over 50M documents with the HIGHEST possible accuracy. Which architecture do you choose?
💡 Neither pure bi-encoder nor pure cross-encoder is optimal alone. What if you combined them?
This is the retrieve-then-rerank paradigm that powers modern search. Stage 1 (bi-encoder + ANN): encode the query once, retrieve top 100 candidates in ~5ms via ANN index. Stage 2 (cross-encoder): run BERT on each of the 100 (query, candidate) pairs in ~500ms total. The result combines bi-encoder scalability with cross-encoder accuracy. Running a cross-encoder over all 50M documents would take 50M × 5ms = ~69 hours per query. The two-stage approach is the only way to get cross-encoder quality at interactive speed.
🎓 What You Now Know
✓ Bi-encoders encode query and doc separately — docs can be pre-computed offline, enabling ANN lookup at query time. Fast but lossy.
✓ Cross-encoders see query+doc together — full cross-attention produces 5-15% NDCG improvement. But requires a BERT pass per pair.
✓ The information bottleneck is the key — bi-encoders compress entire documents to 768 numbers. Cross-encoders have no such constraint.
✓ Cross-encoders can only rerank, never retrieve — 1B docs × 50ms = 1.5 years per query. Use them on the top 50-100 candidates only.
The two-stage retrieve + rerank pipeline combines the best of both: bi-encoder speed with cross-encoder accuracy. ⚡
↗ Keep Learning
Search Reranking — The Two-Stage Pipeline That Powers Production Search
A visual deep dive into the retrieve + rerank pipeline. How BM25, dense retrieval, and learned sparse retrieval feed into Reciprocal Rank Fusion, then cross-encoder reranking — with a full latency budget breakdown.
BM25 — The 30-Year-Old Algorithm That Still Wins at Search
A visual deep dive into BM25 scoring. Understand every term in the formula — IDF, TF saturation via k₁, length normalization via b — and why BM25 still outperforms neural retrievers on specialized vocabularies.
Transformers — The Architecture That Changed AI
A scroll-driven visual deep dive into the Transformer architecture. From RNNs to self-attention to GPT — understand the engine behind every modern AI model.
Approximate Nearest Neighbor Search — Trading 1% Accuracy for 1000× Speed
A visual deep dive into ANN search. Why brute-force nearest neighbor fails at scale, how approximate methods achieve 99% recall with logarithmic query time, and the fundamental accuracy-speed tradeoff behind every vector search system.
Comments
No comments yet. Be the first!