All articles
· 16 min deep-diveNLPrepresentation-learning
Article 1 in your session

Word Embeddings — When Words Learned to Be Vectors

A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.

Introduction 0%
Introduction
🎯 0/4 0%

👑

king − man + woman
= queen

In 2013, Tomas Mikolov at Google showed that words could be represented as vectors where meaning becomes math. This single idea — word embeddings — sparked the entire modern NLP revolution, from BERT to GPT.

↓ Scroll to understand how words learned geometry

Sparse vs Dense

The Problem: Words Without Meaning

One-Hot (Sparse)king = [0,0,0,1,0,0,…,0]queen = [0,0,0,0,1,0,…,0]dog = [0,0,0,0,0,1,…,0]50,000 dims • 99.998% zeros • no similarityEmbedding (Dense)king = [0.2, 0.9, -0.5, 0.1]queen = [0.3, 0.8, -0.4, 0.7]dog = [-0.6, 0.1, 0.8, -0.3]300 dims • all values used • similar = closeThe Key DifferenceOne-hot: cosine(king, queen) = 0 (orthogonal — no relation!)Embedding: cosine(king, queen) = 0.85 (very similar!)Embeddings capture meaning as geometry
One-hot vectors are sparse and meaningless; embeddings are dense and meaningful
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

In one-hot encoding, the cosine similarity between 'happy' and 'joyful' is:

Word2Vec

Word2Vec: Learning Meaning from Context

CBOWContinuous Bag of WordsContext words (input):thecatonNeural NetPredict center word:satSkip-GramPredict context from wordCenter word (input):satNeural NetPredict context words:thecatonthe
Two architectures for learning word embeddings: CBOW predicts the center word, Skip-gram predicts context words

Word2Vec Skip-Gram objective

1
maximize Σ log P(context | center)
For each word in the corpus, predict its surrounding words
2
P(w_context | w_center) = softmax(v_context · v_center)
Dot product of embedding vectors → softmax → probability
3
Problem: softmax over 50,000 words is slow!
Computing the denominator requires summing over the entire vocabulary
4
Fix: negative sampling — only update ~5 random 'wrong' words
Instead of computing over all 50K words, just distinguish the real context from 5 random noise words
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Word2Vec learns that 'dog' and 'cat' have similar vectors. How does it learn this without any explicit labels?

Word Arithmetic

The Aha Moment: Word Arithmetic

king − man + woman ≈ queenWhat’s actually happening in vector space:vec(“king”) = royalty + malevec(“man”) = malevec(“woman”) = female(royalty + male) − male + female = royalty + female = queen ✓More examples that work:Paris − France + Italy = Romebigger − big + small = smaller
Vector arithmetic captures conceptual relationships — the most famous result in NLP
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Word2Vec learns vec('Tokyo') - vec('Japan') + vec('France') ≈ vec('Paris'). What geometric relationship does this reveal?

GloVe & FastText

Beyond Word2Vec: GloVe and FastText

Word2VecGoogle, 2013• Predictive model• Sliding window• Local context only• Skip-gram or CBOW✓ Simple & effective✓ Scales to huge corpora✗ No global statistics✗ Can’t handle OOVGloVeStanford, 2014• Count-based model• Co-occurrence matrix• Global + local context• Matrix factorization✓ Uses global statistics✓ Great analogies✗ Huge co-occur matrix✗ Can’t handle OOVFastTextFacebook, 2016• Subword model• Character n-grams• Morphology-aware• Skip-gram + subwords✓ Handles OOV words!✓ Great for morphology✗ Larger model size✗ Slower training
Three embedding methods, three design philosophies

GloVe's elegant objective

1
X_ij = co-occurrence count of word i with word j
Build a global matrix counting how often words appear near each other
2
minimize Σ f(X_ij) × (v_i · v_j + b_i + b_j − log X_ij)²
Find vectors whose dot product = log of co-occurrence count
3
f(X_ij) = weighting function
Down-weights extremely frequent pairs like ('the', 'the')
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

A user types 'reccommendation systms' (two misspellings) into a search engine. Which embedding approach handles this best?

Applications

Real-World Applications

🧠 Word Embeddings Dense vectors 🔍 Semantic Search Find meaning, not words 🚫 Spam Detection Semantic features 📄 Related Content Similar articles 🏷️ Text Classification Sentiment, topics 💡 Query Suggestion Did you mean...?
Word embeddings power features across search and email systems

🎓 What You Now Know

One-hot encoding is meaningless — every word is equidistant from every other word. No notion of similarity.

Word2Vec learns from context — “you shall know a word by the company it keeps.” Words in similar contexts get similar vectors.

Embeddings encode relationships as directions — king-queen, France-Paris, big-bigger all live in consistent geometric directions.

GloVe uses global statistics; FastText handles unknown words — different strengths for different use cases.

Embeddings have bias — they learn human stereotypes from training text. Debiasing is critical for fair systems.

Static embeddings led to contextual embeddings — BERT and GPT produce context-dependent vectors, solving the polysemy problem.

Word embeddings were the bridge from classical NLP to the deep learning era. They proved that meaning could be captured as geometry — an insight that powers every modern language model. 👑

Keep Learning