Text Similarity — From Jaccard to Neural Matching
A scroll-driven visual deep dive into text similarity. Learn how search engines detect duplicates, match queries to documents, and measure how 'close' two texts really are — from set overlap to cosine similarity to learned embeddings.
🔗
These texts are
87% similar.
Says who?
“The cat sat on the mat” and “A kitten rested on the rug” — share no words in common yet describe the same scene. How do machines quantify textual similarity? The answer depends entirely on what you mean by “similar.”
↓ Scroll to explore the spectrum from character-level to semantic similarity
Level 0: Exact & Near-Exact Matching
Level 1: Set-Based Similarity
Jaccard Similarity — overlap of word sets
A = {the, cat, sat, on, mat} B = {the, dog, sat, on, log} A ∩ B = {the, sat, on} A ∪ B = {the, cat, sat, on, mat, dog, log} J(A,B) = |A ∩ B| / |A ∪ B| = 3/7 ≈ 0.43 Two news articles about the same event share only 15% of their words (Jaccard = 0.15). But they describe the identical event. How is this possible?
💡 Can two sentences mean the same thing while using completely different words?
This is the fundamental limitation of surface-level similarity: lexical diversity. Reporters choose different words, use different phrasings, and structure sentences differently even when describing the same event. 'Blaze engulfs tower block' and 'Fire destroys apartment building' have Jaccard ≈ 0 (no shared content words) but are semantically identical. This is why pure word-overlap measures fail for tasks like duplicate detection, and why we need vector-based and semantic similarity.
Level 2: Vector Space Similarity
Cosine Similarity — the angle between document vectors
Represent each document as a TF-IDF vector cos(θ) = (A⃗ · B⃗) / (‖A⃗‖ × ‖B⃗‖) cos(θ) = 1 → identical direction (same content) cos(θ) = 0 → orthogonal (unrelated content) cos(θ) = -1 → opposite direction (impossible with TF-IDF, ≥ 0) Two documents have cosine similarity of 0.95 using TF-IDF vectors. But one is about dogs and the other is about cats. How?
💡 How much vocabulary do two pet care guides actually share?
TF-IDF cosine similarity measures word usage patterns, not semantic topics. Two pet care guides — one about dogs, one about cats — share 90%+ of their vocabulary: 'pet,' 'veterinarian,' 'grooming,' 'feeding,' 'breed,' 'training,' etc. Only a few dimensions (dog/cat, bark/meow, leash/litter) differ. With thousands of shared words and only a handful different, cosine similarity will be very high. This is both a strength (they ARE similar documents) and a limitation (you might want to distinguish them).
Scalable Similarity: MinHash & SimHash
Level 3: Semantic Similarity with Embeddings
An email service wants to group similar customer support tickets. Tickets use wildly different language for the same issue: 'my account is locked', 'can't log in', 'password not working', 'access denied'. Which similarity approach works best?
💡 Do these four phrases share any words? What kind of similarity can see past the word differences?
These four phrases share almost no words, so Jaccard ≈ 0 and TF-IDF cosine would be very low. But they all mean the same thing: 'I can't access my account.' Sentence-BERT encodes meaning, not words — so 'account locked,' 'can't log in,' and 'access denied' would all map to nearby vectors in embedding space. This is exactly the type of problem semantic similarity was built for, and it's how modern helpdesk systems (Zendesk, Freshdesk) automatically group and route tickets.
Real-World Applications
You're building plagiarism detection for a university: 50,000 submissions/semester. Comparing every pair = 1.25 billion comparisons. How do you make this feasible?
💡 Is there a way to avoid comparing ALL pairs while still finding the similar ones?
MinHash + LSH is the standard for near-duplicate detection at scale. MinHash compresses each document's shingle set into a compact signature preserving Jaccard similarity. LSH hashes signatures into buckets so similar documents land together with high probability. You only compare within-bucket pairs, reducing O(n²) to roughly O(n). Turnitin, Google News dedup, and web crawlers all use this. The trade-off: small probability of missing similar pairs, tunable by adjusting hash functions and band count.
🎓 What You Now Know
✓ Similarity is a spectrum — from character-level (edit distance) to word-set (Jaccard) to weighted vector (TF-IDF cosine) to semantic (embeddings).
✓ Cosine similarity normalizes for document length — measuring direction (topic) not magnitude (length), making it the default for text.
✓ MinHash and SimHash enable web-scale dedup — compressing documents into compact signatures for approximate similarity without pairwise comparison.
✓ Semantic similarity captures meaning — Sentence-BERT finds that “car” and “automobile” are similar even though they share no characters.
✓ Choose the right level — typo detection needs edit distance, dedup needs MinHash, search needs TF-IDF/BM25, and understanding needs embeddings.
Text similarity is the fundamental operation powering search, deduplication, recommendation, and retrieval. What “similar” means depends entirely on your task — and choosing the right measure is half the battle. 🔗
↗ Keep Learning
Word Embeddings — When Words Learned to Be Vectors
A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.
Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks
A scroll-driven visual deep dive into information retrieval. From inverted indices to BM25 to learning-to-rank — learn how Google, Bing, and enterprise search find the most relevant documents in milliseconds.
Vector Databases — Search by Meaning, Not Keywords
A visual deep dive into vector databases. From embeddings to ANN search to HNSW — understand how AI-powered search finds what you actually mean, not just what you typed.
Comments
No comments yet. Be the first!