All articles
· 17 min deep-diveNLPlanguage-modelsfoundations
Article 1 in your session

Language Models & N-grams — Predicting the Next Word Since 1948

A scroll-driven visual deep dive into statistical language models. From Shannon's information theory to N-grams to the bridge to transformers — learn how autocomplete, spell check, and speech recognition all predict what comes next.

Introduction 0%
Introduction
🎯 0/4 0%

🔮

“I want to _____”
Your phone knows what’s next.

Every time your keyboard suggests the next word, every time Google completes your search query, every time Siri transcribes your speech — a language model is predicting what word comes next. This fundamental idea — assigning probabilities to word sequences — is the foundation of all modern NLP, from n-grams to GPT.

↓ Scroll to understand the idea that launched the AI revolution

What Is a LM?

What Is a Language Model?

The fundamental equation of language modeling

1
P(w₁, w₂, ..., wₙ) = ?
What's the probability of this specific word sequence?
2
Apply the chain rule of probability:
3
P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... × P(wₙ|w₁,...,wₙ₋₁)
The probability of each word GIVEN all previous words
4
P('the cat sat') = P('the') × P('cat'|'the') × P('sat'|'the cat')
Example: probability of 'the cat sat'
5
Problem: P(wₙ|w₁,...,wₙ₋₁) requires remembering ALL previous words!
For long sentences, this is computationally impossible → we need approximations
N-gram Models

N-gram Language Models

Unigram (N=1)P(wₙ) — each word is independent”the” = 7%, “cat” = 0.01%, “sat” = 0.005%Bigram (N=2)P(wₙ|wₙ₋₁) — depends on previous word”cat” after “the” = 2%, “dog” after “the” = 1.5%Trigram (N=3)P(wₙ|wₙ₋₂, wₙ₋₁) — depends on 2 previous”sat” after “the cat” = 5%4-gram / 5-gramMore context → better predictionsBut: exponentially more parameters needed!The Big Trade-off: Higher N = better predictions but exponentially more data neededVocabulary size V = 50,000 → Bigram: 50K² = 2.5B params | Trigram: 50K³ = 125 TRILLION params!
N-gram models approximate the full history with just the last N-1 words

Building a bigram model by counting

1
Training corpus: 'the cat sat on the mat the cat ate'
Our training data — tiny, but illustrates the concept
2
Count bigrams: (the,cat)=2, (cat,sat)=1, (cat,ate)=1, (the,mat)=1, ...
Count how many times each word pair appears
3
P(cat|the) = Count(the,cat) / Count(the) = 2/3 ≈ 0.67
67% of the time 'the' is followed by 'cat'
4
P(mat|the) = Count(the,mat) / Count(the) = 1/3 ≈ 0.33
33% of the time 'the' is followed by 'mat'
5
P(sat|cat) = Count(cat,sat) / Count(cat) = 1/2 = 0.50
50% of the time 'cat' is followed by 'sat'
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

You build a bigram model from a training corpus. A user types 'I love quantum _____'. The word 'quantum' appeared 10 times in training, followed by 'physics' 8 times and 'computing' 2 times. 'quantum mechanics' never appeared. P('mechanics'|'quantum') = ?

Smoothing

Smoothing: Solving the Zero Problem

1. Laplace (Add-1) SmoothingAdd 1 to every bigram count: P(w|prev) = (Count(prev,w) + 1) / (Count(prev) + V)✓ Simple | ✗ Gives too much mass to unseen events — “quantum cat” gets same boost as “quantum mechanics”2. Kneser-Ney Smoothing (Gold Standard)Subtract fixed discount d from seen counts, redistribute to unseen proportional to word versatility✓ Best performing | ✓ Considers how many different contexts a word appears in3. Backoff / InterpolationIf trigram not found → fall back to bigram → fall back to unigram✓ Uses best available context | ✓ Combines with other smoothing methods
Smoothing techniques redistribute probability mass from seen to unseen events
Perplexity

Perplexity: How Good Is Your Language Model?

Perplexity — the standard evaluation metric for language models

1
PP(W) = P(w₁, w₂, ..., wₙ)^(-1/n)
Inverse probability of the test set, normalized by word count
2
PP = 2^H, where H = cross-entropy of the model on test data
Equivalent to 2 raised to the cross-entropy
3
Intuition: perplexity = weighted average branching factor
At each position, how many words is the model 'choosing between'?
4
PP = 50 → model is as uncertain as choosing from 50 equally likely words
5
Lower perplexity = better model. PP = 1 would mean perfect prediction.
GPT-2: PP ≈ 35 on webtext. A unigram model: PP ≈ 1000. Human estimate: PP ≈ 20.
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Model A has perplexity 100 and Model B has perplexity 50 on the same test set. What does this mean?

Applications

Where N-gram Models Still Shine

📱 Keyboard Autocomplete• On-device trigram model• Personalizes from YOUR typing• Runs in <1ms per suggestion🔍 Search Autocomplete• Suggest completions as you type• “how to” → “how to tie a tie”• Based on query log n-grams🗣️ Speech Recognition• Disambiguate homophones• “their/there/they’re” → context• 3-gram used alongside acoustic model✏️ Spell Correction• “thier” → “their” vs “thief”?• LM score picks best correction• P(“their friends”) >> P(“thief friends”)
N-gram language models power everyday features you use without thinking
Bridge to Transformers

From N-grams to Neural Language Models

📊 N-gram LMs 1948–2003 🧠 Neural LMs (NNLM) 2003 (Bengio) 🔄 RNN / LSTM LMs 2010–2017 Transformer LMs 2017–present 🚀 GPT / LLMs 2018–present Word embeddings (dense, not sparse) Variable-length context Parallelization + attention Scale (billions of params)
The evolution of language models — each generation solves deeper limitations
N-gram LimitationFixed context window (N-1 words)Can’t see “The cat that I saw yesterday ___“Transformer FixSelf-attention over ENTIRE contextGPT-4: 128K token context window!The same fundamental task: P(next word | previous words)N-grams, RNNs, and Transformers all model the same probability — just with different expressiveness.
Each step solves a fundamental limitation of the previous approach
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

GPT-4 and a trigram model both predict the next word. What is the fundamental difference?

↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

A trigram model gives 'Colorless green ideas sleep furiously' perplexity 8,200. GPT-2 gives it perplexity 180. What explains GPT-2's much lower perplexity on this nonsensical sentence?

🎓 What You Now Know

A language model assigns probabilities to word sequences — P(“the cat sat”) > P(“sat the cat”). This one idea underlies all of modern NLP.

N-gram models approximate history with N-1 words — bigrams use 1 word of context, trigrams use 2. Simple, fast, but limited context.

Smoothing is essential — without it, a single unseen bigram makes P(sentence) = 0. Kneser-Ney smoothing is the gold standard.

Perplexity measures how “surprised” the model is — lower is better. PP=50 means the model is choosing from 50 options per word.

N-grams → NNLM → RNN → Transformer → GPT — all solve the same task (predict next word) with increasing expressiveness. N-grams still power autocomplete and speech recognition because they’re small, fast, and private.

Language models are the beating heart of NLP. From Shannon’s 1948 paper to GPT-4, the mission has never changed: predict the next word. Understanding n-grams means understanding the foundation that everything else — including LLMs — was built upon. 🔮

Keep Learning