All articles
· 13 min deep-diveNLPinformation-retrievalsimilarity
Article 1 in your session

Text Similarity — From Jaccard to Neural Matching

A scroll-driven visual deep dive into text similarity. Learn how search engines detect duplicates, match queries to documents, and measure how 'close' two texts really are — from set overlap to cosine similarity to learned embeddings.

Introduction 0%
Introduction
🎯 0/4 0%

🔗

These texts are
87% similar.
Says who?

“The cat sat on the mat” and “A kitten rested on the rug” — share no words in common yet describe the same scene. How do machines quantify textual similarity? The answer depends entirely on what you mean by “similar.”

↓ Scroll to explore the spectrum from character-level to semantic similarity

Exact Matching

Level 0: Exact & Near-Exact Matching

Exact Matchstring_a == string_bO(n) — fast, but useless for fuzzy matchingEdit Distance (Levenshtein)Min edits to transform A → BO(nm) — good for typo detection, spell checkNormalized edit distance similarity:sim(a,b) = 1 - edit_dist(a,b) / max(|a|, |b|)
The simplest similarity: do the strings match character by character?
Set-Based

Level 1: Set-Based Similarity

Jaccard Similarity — overlap of word sets

1
A = {the, cat, sat, on, mat}
Document A as a set of words
2
B = {the, dog, sat, on, log}
Document B as a set of words
3
A ∩ B = {the, sat, on}
Words appearing in BOTH documents = 3
4
A ∪ B = {the, cat, sat, on, mat, dog, log}
All unique words across both = 7
5
J(A,B) = |A ∩ B| / |A ∪ B| = 3/7 ≈ 0.43
Jaccard similarity: fraction of shared words out of all words used
Acat, matBdog, logA ∩ Bthe, sat, onJ = 3/7 ≈ 0.43
Jaccard similarity = the intersection area / the total area of the Venn diagram
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Two news articles about the same event share only 15% of their words (Jaccard = 0.15). But they describe the identical event. How is this possible?

Vector Similarity

Level 2: Vector Space Similarity

Cosine Similarity — the angle between document vectors

1
Represent each document as a TF-IDF vector
Each dimension = one word in the vocabulary; value = TF-IDF weight
2
cos(θ) = (A⃗ · B⃗) / (‖A⃗‖ × ‖B⃗‖)
Dot product divided by the product of magnitudes
3
cos(θ) = 1 → identical direction (same content)
Vectors point the same way → maximum similarity
4
cos(θ) = 0 → orthogonal (unrelated content)
No word overlap → vectors are perpendicular
5
cos(θ) = -1 → opposite direction (impossible with TF-IDF, ≥ 0)
In practice, TF-IDF values are non-negative so cosine is always [0, 1]
❌ Euclidean DistanceSensitive to magnitude (length)Long docs → larger vectors → farther awayTwo docs about the same topic look different✓ Cosine SimilarityOnly measures direction (angle)Normalized — length doesn’t matterA tweet and a book chapter can match!Why cosine wins for text:Documents vary wildly in length. Cosine measures PROPORTION of words, not COUNT — a 10-word abstract and a 10,000-word paper about the same topic score high.
Cosine similarity normalizes for document length — a 10-page and 1-page document can be equally similar
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Two documents have cosine similarity of 0.95 using TF-IDF vectors. But one is about dogs and the other is about cats. How?

Locality Hashing

Scalable Similarity: MinHash & SimHash

Document AWords: {the,cat,sat,...}50,000 unique words100 hash functions → 100 minsSignature A[42, 17, 93, 5, …]Just 100 integers!50K dims → 100 dims (500x compression)Compare SignaturesSigA vs SigB: count matchingpositions out of 100≈ Jaccard similarity!Key Insight: Locality Sensitive Hashing (LSH)Split signature into bands. If ANY band matches → candidate pair.Similar documents are likely to share at least one band (high recall).Dissimilar documents almost never share a band (high precision).
MinHash approximates Jaccard similarity using compact signatures — no pairwise comparison needed
Semantic Similarity

Level 3: Semantic Similarity with Embeddings

📝 Text A 🧠 Encoder Sentence-BERT 📐 Vector A 384 dims 📊 Cosine Sim cos(A, B) 📝 Text B 🧠 Encoder Same model 📐 Vector B 384 dims
Sentence embeddings compress meaning into dense vectors — then cosine similarity measures semantic closeness
CharacterEdit distanceTypo detectionWord SetJaccard, MinHashDedup, plagiarismWeighted VectorTF-IDF cosineDocument retrievalSemanticSentence-BERTMeaning matching← Surface formDeep meaning →
The similarity spectrum — each level captures different aspects of textual closeness
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

An email service wants to group similar customer support tickets. Tickets use wildly different language for the same issue: 'my account is locked', 'can't log in', 'password not working', 'access denied'. Which similarity approach works best?

Applications

Real-World Applications

🔍 Search & Retrieval• Query-document matching• “Related searches” suggestions• RAG: find relevant context for LLMs📧 Email Intelligence• Thread detection (group related emails)• Template detection (same email → many)• Duplicate email dedup📰 Content Platforms• Near-duplicate news detection• Plagiarism detection (Turnitin)• “You might also like” recommendations⚖️ Legal & Compliance• Contract clause comparison• Patent prior art search• Case law similarity matching
Text similarity powers features across search, email, legal, and content platforms
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

You're building plagiarism detection for a university: 50,000 submissions/semester. Comparing every pair = 1.25 billion comparisons. How do you make this feasible?

🎓 What You Now Know

Similarity is a spectrum — from character-level (edit distance) to word-set (Jaccard) to weighted vector (TF-IDF cosine) to semantic (embeddings).

Cosine similarity normalizes for document length — measuring direction (topic) not magnitude (length), making it the default for text.

MinHash and SimHash enable web-scale dedup — compressing documents into compact signatures for approximate similarity without pairwise comparison.

Semantic similarity captures meaning — Sentence-BERT finds that “car” and “automobile” are similar even though they share no characters.

Choose the right level — typo detection needs edit distance, dedup needs MinHash, search needs TF-IDF/BM25, and understanding needs embeddings.

Text similarity is the fundamental operation powering search, deduplication, recommendation, and retrieval. What “similar” means depends entirely on your task — and choosing the right measure is half the battle. 🔗

Keep Learning