Speculative Decoding — Making LLMs Think Ahead

Introduction 0%

Introduction

🎯 0/5 0%

🐢 → 🐇

What if a small model could make
a big model faster?

GPT-4 generates text one word at a time. That’s painfully slow.
Speculative decoding uses a tiny “draft” model to guess ahead, then lets the big model verify in parallel.
Same output. 2–3x faster.

↓ Scroll to learn — quizzes will test your understanding

The Problem

Why Is ChatGPT So Slow at Typing?

Ever noticed how ChatGPT “types” one word at a time? That’s not a UI trick — the model literally generates one token at a time. And each token requires a full forward pass through the entire model.

Why It's Slow

The Autoregressive Bottleneck

Here’s the thing most people don’t realize: the GPU is barely working during text generation. The model is so big that most of the time is spent loading model weights from memory, not doing math.

Standard generation: one token per forward pass, GPU mostly idle

Why each token is expensive

1 token = 1 full forward pass

Every single token requires loading the entire model (billions of parameters) from memory

GPU utilization: ~1-5%

The GPU spends most of its time loading weights, not doing math. It's memory-bandwidth-bound.

100 tokens at 30ms each = 3 seconds

A short reply takes seconds. A long essay can take 30+ seconds of pure waiting.

↑ Answer the question above to continue ↑

Why is autoregressive text generation slow even on powerful GPUs?

The Breakthrough

💡 The Big Idea

What if a tiny model guessed the next 5 words, and then the big model checked all 5 at once?

The small model is 100x faster but less accurate.
The big model can verify multiple tokens in one pass (same cost as generating one!).
If the guesses are right? Free speedup. If wrong? Just try again.

Here’s the key insight that makes this work: verification is parallel, generation is sequential.

When you give a language model a sequence like “The cat sat on the”, it computes the probability of every next token at every position in a single forward pass. That’s just how transformers work! So checking 5 guesses costs the same as generating 1 token.

↑ Answer the question above to continue ↑

Why can the big model verify K guessed tokens in the same time it takes to generate 1 token?

Draft & Verify

The Algorithm: Draft, Verify, Accept

Speculative decoding: the small model drafts, the big model verifies in parallel

Step by Step

The speculative decoding loop

Step 1: Draft model generates K tokens

The small model quickly generates K tokens (typically K=4-8). Each token is cheap because the model is tiny.

Step 2: Run big model on all K tokens at once

Feed the original prompt + K draft tokens into the big model. One forward pass gives you probabilities at every position.

Step 3: Compare draft vs target probabilities

At each position, check: would the big model have generated the same token? Use rejection sampling to decide.

Step 4: Accept prefix, reject suffix

If the first 3 tokens match but token 4 doesn't, accept tokens 1-3 and sample token 4 from the big model instead.

Result: Got 3-4 tokens for the cost of ~1 big-model pass!

Instead of 4 sequential big-model passes, you did 1 big-model pass + a few cheap draft passes. That's a 2-3x speedup!

Example: 4 draft tokens, 3 accepted, 1 rejected and resampled

↑ Answer the question above to continue ↑

The draft model generates 5 tokens. The big model verifies and finds token 3 is wrong. What happens?

The Math

The Rejection Sampling Trick (Why the Output Is Identical)

This is the magical part: speculative decoding doesn’t produce “approximate” output. It produces the exact same distribution as running the big model alone.

How rejection sampling preserves the target distribution

p(x) = target model probability

The big model's probability for a token — this is the 'true' distribution we want

q(x) = draft model probability

The small model's probability — a rough approximation

Accept if: rand() < min(1, p(x)/q(x))

If the big model agrees (p ≥ q), always accept. If the big model disagrees, accept proportionally.

If rejected: sample from (p(x) - q(x))⁺ / Z

Sample from the 'residual' distribution — exactly the probability mass the draft model missed

Result: output distribution = p(x) exactly

Mathematically proven: you get the same output as running the big model alone. Zero quality loss.

Expected Speedup

The speedup depends on the acceptance rate — how often the small model’s guesses match the big model.

Higher acceptance rate → more free tokens per round → bigger speedup

↑ Answer the question above to continue ↑

Does speculative decoding reduce the quality of the generated text?

Real Impact

Real-World Impact

Speculative decoding is now standard in production LLM serving

Who Uses It?

Google — Uses it in Gemini for faster inference
Meta — LLaMA models support speculative decoding natively
vLLM — The most popular LLM serving framework supports it out of the box
Apple — Uses it for on-device LLM inference in Apple Intelligence
Medusa, EAGLE — Variants that use multiple draft heads instead of a separate model

↑ Answer the question above to continue ↑

A draft model with 1B parameters is drafting for a 70B target model. The draft generates 6 tokens. The target verifies and accepts tokens 1-4 but rejects token 5. How many total tokens do you get from this round?

🎓 What You Now Know

✓ LLM generation is memory-bound — The GPU spends most of its time loading weights, not computing. Each token requires a full forward pass.

✓ A small model drafts, a big model verifies — The draft model is fast but imperfect. The big model checks all draft tokens in parallel.

✓ Rejection sampling preserves quality — The output distribution is mathematically identical to the big model alone.

✓ 2-3x speedup for free — No quality loss, no extra hardware, no approximations. Just clever scheduling.

Check your quiz score → How many did you nail? 🎯

📄 Read the paper: Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022)

📄 Accelerating LLM Inference with Staged Speculative Decoding (Chen et al., 2023)

Speculative Decoding — Making LLMs Think Ahead

What if a small model could make
a big model faster?

Why Is ChatGPT So Slow at Typing?

The Autoregressive Bottleneck

Why each token is expensive

💡 The Big Idea

The Algorithm: Draft, Verify, Accept

Step by Step

The speculative decoding loop

The Rejection Sampling Trick (Why the Output Is Identical)

How rejection sampling preserves the target distribution

Expected Speedup

Real-World Impact

Who Uses It?

🎓 What You Now Know

Comments

↗ Keep Learning

Transformers — The Architecture That Changed AI

Flash Attention — Making Transformers Actually Fast

Caching — The Art of Remembering What's Expensive to Compute

Transformers — The Architecture That Changed AI

What if a small model could makea big model faster?

Why Is ChatGPT So Slow at Typing?

The Autoregressive Bottleneck

Why each token is expensive

💡 The Big Idea

The Algorithm: Draft, Verify, Accept

Step by Step

The speculative decoding loop

The Rejection Sampling Trick (Why the Output Is Identical)

How rejection sampling preserves the target distribution

Expected Speedup

Real-World Impact

Who Uses It?

🎓 What You Now Know

Comments

↗ Keep Learning

Transformers — The Architecture That Changed AI

Flash Attention — Making Transformers Actually Fast

Caching — The Art of Remembering What's Expensive to Compute

Transformers — The Architecture That Changed AI

What if a small model could make
a big model faster?