BM25 — The 30-Year-Old Algorithm That Still Wins at Search
A visual deep dive into BM25 scoring. Understand every term in the formula — IDF, TF saturation via k₁, length normalization via b — and why BM25 still outperforms neural retrievers on specialized vocabularies.
📝
Published in 1994.
Still beating neural models.
BM25 (Best Match 25) is a bag-of-words scoring function that ranks documents by term frequency, inverse document frequency, and document length. It powers Elasticsearch, Solr, and Lucene. In the 2024 BEIR benchmark, BM25 outperformed fine-tuned dense retrieval models on 6 out of 18 datasets — especially on medical, legal, and scientific domains. The secret? Exact keyword matching wins when vocabulary is specialized.
↓ Scroll to understand every term in the BM25 formula
BM25: Every Term Explained
BM25 scoring function — every term explained
BM25 scores a document by summing over all query terms: each term's contribution is its IDF weight multiplied by a saturated, length-normalized term frequency.
BM25(q, d) = Σ IDF(t) × [tf×(k₁+1)] / [tf + k₁×(1−b+b×|d|/avgdl)] Inverse Document Frequency gives rare terms higher weight. 'quantum' appears in few documents and gets a high IDF; 'the' appears everywhere and gets near-zero IDF.
IDF(t) = log[(N − df(t) + 0.5) / (df(t) + 0.5)] Raw count of how many times a query term appears in the document. More occurrences signal higher relevance — but BM25 applies saturation so the 30th mention barely matters more than the 3rd.
Controls how quickly additional term occurrences stop mattering. At k₁=0, frequency is ignored entirely. At k₁=∞, TF grows linearly. The typical k₁=1.2 means mentioning a term 3× helps a lot, but 30× barely adds more.
Controls whether longer documents are penalized. At b=0, document length is ignored. At b=1, fully normalized. The typical b=0.75 penalizes long docs (where terms appear more by chance) without being too aggressive.
The ratio |d|/avgdl compares this document's length to the corpus average. A document twice as long as average gets its TF discounted — occurrences in longer documents are more likely to happen by chance.
|d|/avgdl = document length / average document length In the BM25 formula, what does the IDF component reward?
💡 IDF stands for Inverse DOCUMENT Frequency — think about what 'inverse' means here.
IDF = log[(N - df(t) + 0.5) / (df(t) + 0.5)]. When df(t) is small (rare term), the numerator is large relative to the denominator, producing a high IDF score. A term appearing in 10 out of 1 million documents gets a much higher IDF than a term appearing in 500,000 documents. This is why BM25 naturally prioritizes distinctive, informative terms — a query for 'mitochondria function' will weight 'mitochondria' (rare) far more than 'function' (common).
TF Saturation: The Key Insight
You're building a search engine for medical research papers. A user queries 'BRCA1 p.V600E mutation pathogenicity'. Which retrieval method would you use as Stage 1?
💡 Think about what happens when a neural model embeds 'p.V600E' — does it distinguish it from 'p.V600K'?
Medical, legal, and scientific search have specialized vocabularies where exact term matching is critical. 'p.V600E' is a specific point mutation — a dense retriever might confuse it with 'p.V600K' or 'p.V600R' since they're semantically similar. BM25 matches the exact string. In the BEIR benchmark, BM25 outperformed dense models specifically on BioASQ (medical) and SciFact (scientific) datasets. The best approach: use BM25 as Stage 1 retrieval to capture exact keyword matches, then rerank the top results with a cross-encoder for quality.
Now that you’ve seen what BM25 does and why it works, let’s test your intuition about the edge cases. The k₁ parameter is the single knob that controls TF saturation — and pushing it to its extremes reveals exactly how flexible the formula really is.
You set k₁=0 in your BM25 configuration. What happens to the scoring behavior?
💡 Plug k₁=0 into the formula: tf × (k₁ + 1) / (tf + k₁ × ...) and simplify.
When k₁=0, the TF saturation formula becomes: tf × (0 + 1) / (tf + 0 × ...) = tf / tf = 1 for any non-zero tf. This means term frequency is completely flattened — whether a term appears once or a thousand times, the TF component contributes the same score (1.0). Only IDF distinguishes query terms. This is sometimes useful for boolean-style search where you only care about term presence, not frequency. At the other extreme, k₁=∞ makes TF grow linearly with no saturation — which is vulnerable to keyword stuffing.
🎓 What You Now Know
✓ BM25 = IDF × saturated TF with length normalization — a deceptively simple formula that’s been the backbone of search for 30 years.
✓ k₁ controls TF saturation — at k₁=1.2, mentioning a term 3× is great but 30× is barely better. This prevents keyword stuffing.
✓ b controls length normalization — at b=0.75, long documents are penalized because words appear more by chance in longer text.
✓ BM25 beats neural models on specialized vocabularies — exact keyword matching wins when terms like “BRCA1 p.V600E” must be matched precisely.
Modern production search always includes BM25 alongside dense retrieval — the two approaches are complementary, not competing. ⚡
↗ Keep Learning
Cross-Encoders vs Bi-Encoders — Why Accuracy Costs 1000× More Compute
A visual deep dive into the architectural difference between bi-encoders and cross-encoders. Why cross-attention produces higher-quality relevance scores, and why cross-encoders can only be used for reranking — never for retrieval.
Search Reranking — The Two-Stage Pipeline That Powers Production Search
A visual deep dive into the retrieve + rerank pipeline. How BM25, dense retrieval, and learned sparse retrieval feed into Reciprocal Rank Fusion, then cross-encoder reranking — with a full latency budget breakdown.
Approximate Nearest Neighbor Search — Trading 1% Accuracy for 1000× Speed
A visual deep dive into ANN search. Why brute-force nearest neighbor fails at scale, how approximate methods achieve 99% recall with logarithmic query time, and the fundamental accuracy-speed tradeoff behind every vector search system.
Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly
A scroll-driven visual deep dive into Naive Bayes. Learn Bayes' theorem, why the 'naive' independence assumption is wrong but works anyway, and why it dominates spam filtering.
Comments
No comments yet. Be the first!