Word Embeddings — When Words Learned to Be Vectors
A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.
👑
king − man + woman
= queen
In 2013, Tomas Mikolov at Google showed that words could be represented as vectors where meaning becomes math. This single idea — word embeddings — sparked the entire modern NLP revolution, from BERT to GPT.
↓ Scroll to understand how words learned geometry
The Problem: Words Without Meaning
In one-hot encoding, the cosine similarity between 'happy' and 'joyful' is:
💡 In one-hot, each word occupies its own unique dimension...
In one-hot encoding, each word gets its own dimension. 'Happy' is [0,0,...,1,...,0] and 'joyful' is [0,...,1,...,0,0] with the 1 in completely different positions. Their dot product = 0, so cosine similarity = 0. The model literally cannot tell that these words are synonyms. This is the fundamental failure that word embeddings solve: similar words get similar vectors, so cosine('happy', 'joyful') ≈ 0.9.
Word2Vec: Learning Meaning from Context
Word2Vec Skip-Gram objective
maximize Σ log P(context | center) P(w_context | w_center) = softmax(v_context · v_center) Problem: softmax over 50,000 words is slow! Fix: negative sampling — only update ~5 random 'wrong' words Word2Vec learns that 'dog' and 'cat' have similar vectors. How does it learn this without any explicit labels?
💡 What's the training task? Does it require any human annotations?
Word2Vec never sees any explicit similarity labels. It learns from raw text using the distributional hypothesis: words with similar contexts get similar vectors. Because 'dog' and 'cat' both appear near words like 'pet', 'food', 'walks', 'cute', their embedding vectors gradually converge during training. It's purely self-supervised — the training signal comes from predicting context words from billions of sentences.
The Aha Moment: Word Arithmetic
Word2Vec learns vec('Tokyo') - vec('Japan') + vec('France') ≈ vec('Paris'). What geometric relationship does this reveal?
💡 What happens when you subtract a country from its capital? Do you always get the same direction?
The embedding space organizes concepts such that the vector offset country→capital is consistent across different countries. Tokyo-Japan ≈ Paris-France ≈ Berlin-Germany. This means the model has learned an abstract 'capital-of' relationship as a DIRECTION in high-dimensional space. This works for many relationships: male→female, present→past tense, singular→plural. The model was never told these relationships exist — it discovered them from word co-occurrence patterns.
Beyond Word2Vec: GloVe and FastText
GloVe's elegant objective
X_ij = co-occurrence count of word i with word j minimize Σ f(X_ij) × (v_i · v_j + b_i + b_j − log X_ij)² f(X_ij) = weighting function A user types 'reccommendation systms' (two misspellings) into a search engine. Which embedding approach handles this best?
💡 What happens when you break 'reccommendation' into character n-grams?
FastText represents each word as a bag of character n-grams. 'reccommendation' shares n-grams like 'rec', 'com', 'men', 'dat', 'tion' with 'recommendation'. The misspelled version's embedding is an average of its n-gram embeddings — and most n-grams overlap with the correct spelling! Word2Vec and GloVe have no mechanism for this: a misspelled word is simply OOV (out of vocabulary). Google Search uses subword approaches for exactly this reason.
Real-World Applications
🎓 What You Now Know
✓ One-hot encoding is meaningless — every word is equidistant from every other word. No notion of similarity.
✓ Word2Vec learns from context — “you shall know a word by the company it keeps.” Words in similar contexts get similar vectors.
✓ Embeddings encode relationships as directions — king-queen, France-Paris, big-bigger all live in consistent geometric directions.
✓ GloVe uses global statistics; FastText handles unknown words — different strengths for different use cases.
✓ Embeddings have bias — they learn human stereotypes from training text. Debiasing is critical for fair systems.
✓ Static embeddings led to contextual embeddings — BERT and GPT produce context-dependent vectors, solving the polysemy problem.
Word embeddings were the bridge from classical NLP to the deep learning era. They proved that meaning could be captured as geometry — an insight that powers every modern language model. 👑
↗ Keep Learning
Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
Vector Databases — Search by Meaning, Not Keywords
A visual deep dive into vector databases. From embeddings to ANN search to HNSW — understand how AI-powered search finds what you actually mean, not just what you typed.
Transformers — The Architecture That Changed AI
A scroll-driven visual deep dive into the Transformer architecture. From RNNs to self-attention to GPT — understand the engine behind every modern AI model.
Text Similarity — From Jaccard to Neural Matching
A scroll-driven visual deep dive into text similarity. Learn how search engines detect duplicates, match queries to documents, and measure how 'close' two texts really are — from set overlap to cosine similarity to learned embeddings.
Comments
No comments yet. Be the first!