All articles
· 8 min deep-divesearchNLPtransformers
Article 1 in your session

Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute

A visual deep dive into the architectural difference between bi-encoders and cross-encoders. Why cross-attention produces higher-quality relevance scores, and why cross-encoders can only be used for reranking — never for retrieval.

Introduction 0%
Introduction
🎯 0/3 0%

🔬

Same BERT model.
5-15% NDCG difference.

A bi-encoder encodes query and document separately, then compares vectors with cosine similarity. A cross-encoder concatenates query and document together, letting every query token attend to every document token. Same BERT, dramatically different accuracy — but the cross-encoder is 1000× more expensive at query time. This tradeoff defines modern search architecture.

↓ Scroll to understand the architecture that makes reranking possible

Architecture

Bi-Encoders vs. Cross-Encoders: The Critical Difference

Bi-EncoderQueryBERT₁v_qDocBERT₂v_dcosine(v_q, v_d) → scoreCross-Encoder[CLS] Q [SEP] DSingle BERTscore: 0.87q-d attention → direct scoreBi-Encoder Properties✓ Docs encoded ONCE offline✓ Query time: ~5ms (ANN lookup)✓ Scales to billions of docs✗ No query-doc cross-attention✗ Loses fine-grained interactionsUse for: Stage 1 (retrieval)Cross-Encoder Properties✓ Full cross-attention q↔d✓ Highest accuracy (NDCG +5-15%)✓ Captures subtle relevance signals✗ Must run PER (query, doc) pair✗ ~50ms per pair → 1000 pairs = 50sUse for: Stage 2 (reranking top 100)
Bi-encoders encode independently (fast, separate). Cross-encoders read query+doc together (slow, accurate).
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

A bi-encoder and cross-encoder use the SAME BERT model. Why does the cross-encoder produce more accurate relevance scores?

Cross-Encoder Math

Why cross-encoders are more accurate

1
Bi-encoder: score = cosine(BERT(q), BERT(d))
Query and doc NEVER see each other inside the model. The model compresses all meaning into a single vector BEFORE comparing.
2
Cross-encoder: score = sigmoid(BERT([CLS] q [SEP] d))
Query and doc are CONCATENATED and fed together. Every query token can attend to every doc token.
3
Example: query='apple stock price', doc='Apple Inc. reported strong quarterly earnings...'
4
Bi-encoder: encodes 'apple' without knowing you mean the COMPANY
The query vector for 'apple' is the same whether you mean fruit or company
5
Cross-encoder: sees 'apple' + 'stock' + 'Inc.' + 'earnings' together → knows it's about the company
Cross-attention lets query words disambiguate based on document content — this is WHY it's more accurate
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A cross-encoder is far more accurate than a bi-encoder. Why not just use a cross-encoder for everything?

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

You want to build a semantic search system over 50M documents with the HIGHEST possible accuracy. Which architecture do you choose?

🎓 What You Now Know

Bi-encoders encode query and doc separately — docs can be pre-computed offline, enabling ANN lookup at query time. Fast but lossy.

Cross-encoders see query+doc together — full cross-attention produces 5-15% NDCG improvement. But requires a BERT pass per pair.

The information bottleneck is the key — bi-encoders compress entire documents to 768 numbers. Cross-encoders have no such constraint.

Cross-encoders can only rerank, never retrieve — 1B docs × 50ms = 1.5 years per query. Use them on the top 50-100 candidates only.

The two-stage retrieve + rerank pipeline combines the best of both: bi-encoder speed with cross-encoder accuracy. ⚡

Keep Learning