Introduction 0%

Introduction

🎯 0/4 0%

🔮

“I want to _____”
Your phone knows what’s next.

Every time your keyboard suggests the next word, every time Google completes your search query, every time Siri transcribes your speech — a language model is predicting what word comes next. This fundamental idea — assigning probabilities to word sequences — is the foundation of all modern NLP, from n-grams to GPT.

↓ Scroll to understand the idea that launched the AI revolution

What Is a LM?

What Is a Language Model?

The fundamental equation of language modeling

P(w₁, w₂, ..., wₙ) = ?

What's the probability of this specific word sequence?

Apply the chain rule of probability:

P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... × P(wₙ|w₁,...,wₙ₋₁)

The probability of each word GIVEN all previous words

P('the cat sat') = P('the') × P('cat'|'the') × P('sat'|'the cat')

Example: probability of 'the cat sat'

Problem: P(wₙ|w₁,...,wₙ₋₁) requires remembering ALL previous words!

For long sentences, this is computationally impossible → we need approximations

N-gram Models

N-gram Language Models

N-gram models approximate the full history with just the last N-1 words

Building a bigram model by counting

Training corpus: 'the cat sat on the mat the cat ate'

Our training data — tiny, but illustrates the concept

Count bigrams: (the,cat)=2, (cat,sat)=1, (cat,ate)=1, (the,mat)=1, ...

Count how many times each word pair appears

P(cat|the) = Count(the,cat) / Count(the) = 2/3 ≈ 0.67

67% of the time 'the' is followed by 'cat'

P(mat|the) = Count(the,mat) / Count(the) = 1/3 ≈ 0.33

33% of the time 'the' is followed by 'mat'

P(sat|cat) = Count(cat,sat) / Count(cat) = 1/2 = 0.50

50% of the time 'cat' is followed by 'sat'

↑ Answer the question above to continue ↑

You build a bigram model from a training corpus. A user types 'I love quantum _____'. The word 'quantum' appeared 10 times in training, followed by 'physics' 8 times and 'computing' 2 times. 'quantum mechanics' never appeared. P('mechanics'|'quantum') = ?

Smoothing

Smoothing: Solving the Zero Problem

Smoothing techniques redistribute probability mass from seen to unseen events

Perplexity

Perplexity: How Good Is Your Language Model?

Perplexity — the standard evaluation metric for language models

PP(W) = P(w₁, w₂, ..., wₙ)^(-1/n)

Inverse probability of the test set, normalized by word count

PP = 2^H, where H = cross-entropy of the model on test data

Equivalent to 2 raised to the cross-entropy

Intuition: perplexity = weighted average branching factor

At each position, how many words is the model 'choosing between'?

PP = 50 → model is as uncertain as choosing from 50 equally likely words

Lower perplexity = better model. PP = 1 would mean perfect prediction.

GPT-2: PP ≈ 35 on webtext. A unigram model: PP ≈ 1000. Human estimate: PP ≈ 20.

↑ Answer the question above to continue ↑

Model A has perplexity 100 and Model B has perplexity 50 on the same test set. What does this mean?

Applications

Where N-gram Models Still Shine

N-gram language models power everyday features you use without thinking

Bridge to Transformers

From N-grams to Neural Language Models

The evolution of language models — each generation solves deeper limitations

Each step solves a fundamental limitation of the previous approach

↑ Answer the question above to continue ↑

GPT-4 and a trigram model both predict the next word. What is the fundamental difference?

↑ Answer the question above to continue ↑

A trigram model gives 'Colorless green ideas sleep furiously' perplexity 8,200. GPT-2 gives it perplexity 180. What explains GPT-2's much lower perplexity on this nonsensical sentence?

🎓 What You Now Know

✓ A language model assigns probabilities to word sequences — P(“the cat sat”) > P(“sat the cat”). This one idea underlies all of modern NLP.

✓ N-gram models approximate history with N-1 words — bigrams use 1 word of context, trigrams use 2. Simple, fast, but limited context.

✓ Smoothing is essential — without it, a single unseen bigram makes P(sentence) = 0. Kneser-Ney smoothing is the gold standard.

✓ Perplexity measures how “surprised” the model is — lower is better. PP=50 means the model is choosing from 50 options per word.

✓ N-grams → NNLM → RNN → Transformer → GPT — all solve the same task (predict next word) with increasing expressiveness. N-grams still power autocomplete and speech recognition because they’re small, fast, and private.

Language models are the beating heart of NLP. From Shannon’s 1948 paper to GPT-4, the mission has never changed: predict the next word. Understanding n-grams means understanding the foundation that everything else — including LLMs — was built upon. 🔮

Language Models & N-grams — Predicting the Next Word Since 1948

“I want to _____”
Your phone knows what’s next.

What Is a Language Model?

The fundamental equation of language modeling

N-gram Language Models

Building a bigram model by counting

Smoothing: Solving the Zero Problem

Perplexity: How Good Is Your Language Model?

Perplexity — the standard evaluation metric for language models

Where N-gram Models Still Shine

From N-grams to Neural Language Models

🎓 What You Now Know

Comments

↗ Keep Learning

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Word Embeddings — When Words Learned to Be Vectors

Transformers — The Architecture That Changed AI

Text Preprocessing — Turning Messy Words into Clean Features

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

“I want to _____” Your phone knows what’s next.

What Is a Language Model?

The fundamental equation of language modeling

N-gram Language Models

Building a bigram model by counting

Smoothing: Solving the Zero Problem

Perplexity: How Good Is Your Language Model?

Perplexity — the standard evaluation metric for language models

Where N-gram Models Still Shine

From N-grams to Neural Language Models

🎓 What You Now Know

Comments

↗ Keep Learning

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

Word Embeddings — When Words Learned to Be Vectors

Transformers — The Architecture That Changed AI

Text Preprocessing — Turning Messy Words into Clean Features

Bag of Words & TF-IDF — How Search Engines Ranked Before AI

“I want to _____”
Your phone knows what’s next.