All articles
· 12 min deep-divetransformersinference
Article 1 in your session

Speculative Decoding — Making LLMs Think Ahead

A scroll-driven visual deep dive into speculative decoding. Learn why LLM inference is slow, how a small 'draft' model can speed up a big model by 2-3x, and why the output is mathematically identical.

Introduction 0%
Introduction
🎯 0/5 0%

🐢 → 🐇

What if a small model could make
a big model faster?

GPT-4 generates text one word at a time. That’s painfully slow.
Speculative decoding uses a tiny “draft” model to guess ahead, then lets the big model verify in parallel.
Same output. 2–3x faster.

↓ Scroll to learn — quizzes will test your understanding

The Problem

Why Is ChatGPT So Slow at Typing?

Ever noticed how ChatGPT “types” one word at a time? That’s not a UI trick — the model literally generates one token at a time. And each token requires a full forward pass through the entire model.

Why It's Slow

The Autoregressive Bottleneck

Here’s the thing most people don’t realize: the GPU is barely working during text generation. The model is so big that most of the time is spent loading model weights from memory, not doing math.

Time →Token 1”The”Full passwaitToken 2”cat”Full passwaitToken 3”sat”Full passToken Nstill going
Standard generation: one token per forward pass, GPU mostly idle

Why each token is expensive

1
1 token = 1 full forward pass
Every single token requires loading the entire model (billions of parameters) from memory
2
GPU utilization: ~1-5%
The GPU spends most of its time loading weights, not doing math. It's memory-bandwidth-bound.
3
100 tokens at 30ms each = 3 seconds
A short reply takes seconds. A long essay can take 30+ seconds of pure waiting.
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Why is autoregressive text generation slow even on powerful GPUs?

The Breakthrough

💡 The Big Idea

What if a tiny model guessed the next 5 words, and then the big model checked all 5 at once?



The small model is 100x faster but less accurate.
The big model can verify multiple tokens in one pass (same cost as generating one!).
If the guesses are right? Free speedup. If wrong? Just try again.

Here’s the key insight that makes this work: verification is parallel, generation is sequential.

When you give a language model a sequence like “The cat sat on the”, it computes the probability of every next token at every position in a single forward pass. That’s just how transformers work! So checking 5 guesses costs the same as generating 1 token.

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Why can the big model verify K guessed tokens in the same time it takes to generate 1 token?

Draft & Verify

The Algorithm: Draft, Verify, Accept

🐇 Draft Model Small, fast (1B params) Draft: [sat, on, the, mat] 🧠 Target Model Big, smart (70B params) Accept 3, reject 1 Rewrite from rejection Generate K tokens Verify all at once Accept/reject
Speculative decoding: the small model drafts, the big model verifies in parallel

Step by Step

The speculative decoding loop

1
Step 1: Draft model generates K tokens
The small model quickly generates K tokens (typically K=4-8). Each token is cheap because the model is tiny.
2
Step 2: Run big model on all K tokens at once
Feed the original prompt + K draft tokens into the big model. One forward pass gives you probabilities at every position.
3
Step 3: Compare draft vs target probabilities
At each position, check: would the big model have generated the same token? Use rejection sampling to decide.
4
Step 4: Accept prefix, reject suffix
If the first 3 tokens match but token 4 doesn't, accept tokens 1-3 and sample token 4 from the big model instead.
5
Result: Got 3-4 tokens for the cost of ~1 big-model pass!
Instead of 4 sequential big-model passes, you did 1 big-model pass + a few cheap draft passes. That's a 2-3x speedup!
Draft”sat""on""the""bed”VerifyResult”sat""on""the""mat”resampled!4 tokens verified in 1 forward pass → 3 accepted + 1 corrected = 4 tokens total!
Example: 4 draft tokens, 3 accepted, 1 rejected and resampled
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

The draft model generates 5 tokens. The big model verifies and finds token 3 is wrong. What happens?

The Math

The Rejection Sampling Trick (Why the Output Is Identical)

This is the magical part: speculative decoding doesn’t produce “approximate” output. It produces the exact same distribution as running the big model alone.

How rejection sampling preserves the target distribution

1
p(x) = target model probability
The big model's probability for a token — this is the 'true' distribution we want
2
q(x) = draft model probability
The small model's probability — a rough approximation
3
Accept if: rand() < min(1, p(x)/q(x))
If the big model agrees (p ≥ q), always accept. If the big model disagrees, accept proportionally.
4
If rejected: sample from (p(x) - q(x))⁺ / Z
Sample from the 'residual' distribution — exactly the probability mass the draft model missed
5
Result: output distribution = p(x) exactly
Mathematically proven: you get the same output as running the big model alone. Zero quality loss.

Expected Speedup

The speedup depends on the acceptance rate — how often the small model’s guesses match the big model.

50%1.5x60%1.8x70%2.2x80%2.8x90%3.4xAcceptance rate →
Higher acceptance rate → more free tokens per round → bigger speedup
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Does speculative decoding reduce the quality of the generated text?

Real Impact

Real-World Impact

🚀2-3x FasterSame quality output,dramatically fastertime-to-first-token.💰Same GPU CostDraft model fits insame GPU. No extrahardware needed.🎯Zero Quality LossMathematically provenidentical outputdistribution.
Speculative decoding is now standard in production LLM serving

Who Uses It?

  • Google — Uses it in Gemini for faster inference
  • Meta — LLaMA models support speculative decoding natively
  • vLLM — The most popular LLM serving framework supports it out of the box
  • Apple — Uses it for on-device LLM inference in Apple Intelligence
  • Medusa, EAGLE — Variants that use multiple draft heads instead of a separate model
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

A draft model with 1B parameters is drafting for a 70B target model. The draft generates 6 tokens. The target verifies and accepts tokens 1-4 but rejects token 5. How many total tokens do you get from this round?

🎓 What You Now Know

LLM generation is memory-bound — The GPU spends most of its time loading weights, not computing. Each token requires a full forward pass.

A small model drafts, a big model verifies — The draft model is fast but imperfect. The big model checks all draft tokens in parallel.

Rejection sampling preserves quality — The output distribution is mathematically identical to the big model alone.

2-3x speedup for free — No quality loss, no extra hardware, no approximations. Just clever scheduling.

Check your quiz score → How many did you nail? 🎯

📄 Read the paper: Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022)


📄 Accelerating LLM Inference with Staged Speculative Decoding (Chen et al., 2023)

Keep Learning