Language Models & N-grams — Predicting the Next Word Since 1948
A scroll-driven visual deep dive into statistical language models. From Shannon's information theory to N-grams to the bridge to transformers — learn how autocomplete, spell check, and speech recognition all predict what comes next.
🔮
“I want to _____”
Your phone knows what’s next.
Every time your keyboard suggests the next word, every time Google completes your search query, every time Siri transcribes your speech — a language model is predicting what word comes next. This fundamental idea — assigning probabilities to word sequences — is the foundation of all modern NLP, from n-grams to GPT.
↓ Scroll to understand the idea that launched the AI revolution
What Is a Language Model?
The fundamental equation of language modeling
P(w₁, w₂, ..., wₙ) = ? Apply the chain rule of probability: P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... × P(wₙ|w₁,...,wₙ₋₁) P('the cat sat') = P('the') × P('cat'|'the') × P('sat'|'the cat') Problem: P(wₙ|w₁,...,wₙ₋₁) requires remembering ALL previous words! N-gram Language Models
Building a bigram model by counting
Training corpus: 'the cat sat on the mat the cat ate' Count bigrams: (the,cat)=2, (cat,sat)=1, (cat,ate)=1, (the,mat)=1, ... P(cat|the) = Count(the,cat) / Count(the) = 2/3 ≈ 0.67 P(mat|the) = Count(the,mat) / Count(the) = 1/3 ≈ 0.33 P(sat|cat) = Count(cat,sat) / Count(cat) = 1/2 = 0.50 You build a bigram model from a training corpus. A user types 'I love quantum _____'. The word 'quantum' appeared 10 times in training, followed by 'physics' 8 times and 'computing' 2 times. 'quantum mechanics' never appeared. P('mechanics'|'quantum') = ?
💡 What happens when the count of a bigram is exactly zero?
This is the 'zero probability problem' — the most critical flaw of MLE n-gram models. If a bigram was never observed in training, P = count(x,y)/count(x) = 0/10 = 0. The entire sentence 'I love quantum mechanics' gets P = 0 because of ONE missing bigram. This is catastrophic: (1) the model thinks valid English is impossible, (2) any sentence with a single unseen bigram gets P=0. This is why smoothing is essential.
Smoothing: Solving the Zero Problem
Perplexity: How Good Is Your Language Model?
Perplexity — the standard evaluation metric for language models
PP(W) = P(w₁, w₂, ..., wₙ)^(-1/n) PP = 2^H, where H = cross-entropy of the model on test data Intuition: perplexity = weighted average branching factor PP = 50 → model is as uncertain as choosing from 50 equally likely words Lower perplexity = better model. PP = 1 would mean perfect prediction. Model A has perplexity 100 and Model B has perplexity 50 on the same test set. What does this mean?
💡 Perplexity = how 'perplexed' the model is. Is being MORE perplexed better or worse?
Lower perplexity = better. Perplexity of 50 means the model's uncertainty at each position is equivalent to randomly choosing from 50 equally likely words. Perplexity of 100 = double the uncertainty. Model B assigns higher probability to the actual next words in the test set, meaning it better understands the statistical patterns of language. Think of it like a multiple-choice test: would you rather have 50 options or 100 per question?
Where N-gram Models Still Shine
From N-grams to Neural Language Models
GPT-4 and a trigram model both predict the next word. What is the fundamental difference?
💡 Think about what CONTEXT each model has access to when predicting the next word...
Both compute P(next_word | context). But: (1) Context: trigram sees 2 words; GPT-4 sees up to 128,000 tokens. 'The cat that I saw at the park yesterday while walking my dog...' — a trigram only sees 'my dog', losing everything else. GPT-4 sees it all. (2) Representation: trigram treats words as unrelated symbols; GPT-4 uses learned embeddings where similar words have similar representations. (3) Compositionality: GPT-4 learns complex patterns (grammar, semantics, even reasoning); trigrams capture only local word adjacency. Same task, astronomically different capability.
A trigram model gives 'Colorless green ideas sleep furiously' perplexity 8,200. GPT-2 gives it perplexity 180. What explains GPT-2's much lower perplexity on this nonsensical sentence?
💡 Is the sentence grammatically correct, even though it's meaningless?
Chomsky's 1957 example shows grammar and meaning are separate. A trigram model has no concept of grammar — it memorizes P(word|prev_2_words). 'green ideas' and 'ideas sleep' are rare trigrams, so perplexity explodes. But GPT-2 learned that adjective-adjective-noun-verb-adverb is a valid syntactic pattern and assigns reasonable probability to each position, even though semantics are absurd. This demonstrates that language models capture different levels of linguistic structure depending on architecture and context window size.
🎓 What You Now Know
✓ A language model assigns probabilities to word sequences — P(“the cat sat”) > P(“sat the cat”). This one idea underlies all of modern NLP.
✓ N-gram models approximate history with N-1 words — bigrams use 1 word of context, trigrams use 2. Simple, fast, but limited context.
✓ Smoothing is essential — without it, a single unseen bigram makes P(sentence) = 0. Kneser-Ney smoothing is the gold standard.
✓ Perplexity measures how “surprised” the model is — lower is better. PP=50 means the model is choosing from 50 options per word.
✓ N-grams → NNLM → RNN → Transformer → GPT — all solve the same task (predict next word) with increasing expressiveness. N-grams still power autocomplete and speech recognition because they’re small, fast, and private.
Language models are the beating heart of NLP. From Shannon’s 1948 paper to GPT-4, the mission has never changed: predict the next word. Understanding n-grams means understanding the foundation that everything else — including LLMs — was built upon. 🔮
↗ Keep Learning
Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
Word Embeddings — When Words Learned to Be Vectors
A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.
Transformers — The Architecture That Changed AI
A scroll-driven visual deep dive into the Transformer architecture. From RNNs to self-attention to GPT — understand the engine behind every modern AI model.
Text Preprocessing — Turning Messy Words into Clean Features
A scroll-driven visual deep dive into text preprocessing. Learn tokenization, stemming, lemmatization, stopword removal, and normalization — the essential first step of every NLP pipeline.
Comments
No comments yet. Be the first!