Introduction 0%

Introduction

🎯 0/4 0%

🔗

These texts are
87% similar.
Says who?

“The cat sat on the mat” and “A kitten rested on the rug” — share no words in common yet describe the same scene. How do machines quantify textual similarity? The answer depends entirely on what you mean by “similar.”

↓ Scroll to explore the spectrum from character-level to semantic similarity

Exact Matching

Level 0: Exact & Near-Exact Matching

The simplest similarity: do the strings match character by character?

Set-Based

Level 1: Set-Based Similarity

Jaccard Similarity — overlap of word sets

A = {the, cat, sat, on, mat}

Document A as a set of words

B = {the, dog, sat, on, log}

Document B as a set of words

A ∩ B = {the, sat, on}

Words appearing in BOTH documents = 3

A ∪ B = {the, cat, sat, on, mat, dog, log}

All unique words across both = 7

J(A,B) = |A ∩ B| / |A ∪ B| = 3/7 ≈ 0.43

Jaccard similarity: fraction of shared words out of all words used

Jaccard similarity = the intersection area / the total area of the Venn diagram

↑ Answer the question above to continue ↑

Two news articles about the same event share only 15% of their words (Jaccard = 0.15). But they describe the identical event. How is this possible?

Vector Similarity

Level 2: Vector Space Similarity

Cosine Similarity — the angle between document vectors

Represent each document as a TF-IDF vector

Each dimension = one word in the vocabulary; value = TF-IDF weight

cos(θ) = (A⃗ · B⃗) / (‖A⃗‖ × ‖B⃗‖)

Dot product divided by the product of magnitudes

cos(θ) = 1 → identical direction (same content)

Vectors point the same way → maximum similarity

cos(θ) = 0 → orthogonal (unrelated content)

No word overlap → vectors are perpendicular

cos(θ) = -1 → opposite direction (impossible with TF-IDF, ≥ 0)

In practice, TF-IDF values are non-negative so cosine is always [0, 1]

Cosine similarity normalizes for document length — a 10-page and 1-page document can be equally similar

↑ Answer the question above to continue ↑

Two documents have cosine similarity of 0.95 using TF-IDF vectors. But one is about dogs and the other is about cats. How?

Locality Hashing

Scalable Similarity: MinHash & SimHash

MinHash approximates Jaccard similarity using compact signatures — no pairwise comparison needed

Semantic Similarity

Level 3: Semantic Similarity with Embeddings

Sentence embeddings compress meaning into dense vectors — then cosine similarity measures semantic closeness

The similarity spectrum — each level captures different aspects of textual closeness

↑ Answer the question above to continue ↑

An email service wants to group similar customer support tickets. Tickets use wildly different language for the same issue: 'my account is locked', 'can't log in', 'password not working', 'access denied'. Which similarity approach works best?

Applications

Real-World Applications

Text similarity powers features across search, email, legal, and content platforms

↑ Answer the question above to continue ↑

You're building plagiarism detection for a university: 50,000 submissions/semester. Comparing every pair = 1.25 billion comparisons. How do you make this feasible?

🎓 What You Now Know

✓ Similarity is a spectrum — from character-level (edit distance) to word-set (Jaccard) to weighted vector (TF-IDF cosine) to semantic (embeddings).

✓ Cosine similarity normalizes for document length — measuring direction (topic) not magnitude (length), making it the default for text.

✓ MinHash and SimHash enable web-scale dedup — compressing documents into compact signatures for approximate similarity without pairwise comparison.

✓ Semantic similarity captures meaning — Sentence-BERT finds that “car” and “automobile” are similar even though they share no characters.

✓ Choose the right level — typo detection needs edit distance, dedup needs MinHash, search needs TF-IDF/BM25, and understanding needs embeddings.

Text similarity is the fundamental operation powering search, deduplication, recommendation, and retrieval. What “similar” means depends entirely on your task — and choosing the right measure is half the battle. 🔗

Text Similarity — From Jaccard to Neural Matching

These texts are
87% similar.
Says who?

Level 0: Exact & Near-Exact Matching

Level 1: Set-Based Similarity

Jaccard Similarity — overlap of word sets

Level 2: Vector Space Similarity

Cosine Similarity — the angle between document vectors

Scalable Similarity: MinHash & SimHash

Level 3: Semantic Similarity with Embeddings

Real-World Applications

🎓 What You Now Know

Comments

↗ Keep Learning

Word Embeddings — When Words Learned to Be Vectors

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks

Vector Databases — Search by Meaning, Not Keywords

Word Embeddings — When Words Learned to Be Vectors

These texts are 87% similar. Says who?

Level 0: Exact & Near-Exact Matching

Level 1: Set-Based Similarity

Jaccard Similarity — overlap of word sets

Level 2: Vector Space Similarity

Cosine Similarity — the angle between document vectors

Scalable Similarity: MinHash & SimHash

Level 3: Semantic Similarity with Embeddings

Real-World Applications

🎓 What You Now Know

Comments

↗ Keep Learning

Word Embeddings — When Words Learned to Be Vectors

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Information Retrieval — How Search Engines Find Your Needle in a Billion Haystacks

Vector Databases — Search by Meaning, Not Keywords

Word Embeddings — When Words Learned to Be Vectors

These texts are
87% similar.
Says who?