On This Page
AI Engineering

Speculative Decoding in vLLM: A Practical Guide to Faster LLM Inference

A hands-on speculative decoding tutorial for vLLM: how it works, runnable n-gram and draft-model examples on Qwen3, EAGLE-3, and where the speedup disappears.

RayZ
Diagram of a draft model proposing four tokens and a target model verifying them in one pass, accepting three and rejecting one

If you serve an LLM and latency is your problem, the first thing to internalize is that your GPU is mostly idle during generation. Autoregressive decoding produces one token per forward pass, and each pass is bottlenecked on reading the model's weights out of memory, not on arithmetic. The matrix units sit largely unused while billions of parameters stream in to produce a single token. Speculative decoding exploits that idle compute to produce several tokens per pass instead of one, and it does so without changing the text your model would have generated. This guide shows how to turn speculative decoding on in vLLM, measure the speedup on real models, and recognize the one production condition under which it quietly stops helping.

The technique pairs naturally with the rest of the LLM inference optimization stack, and like quantization, it is a case where the demo number and the production number can differ by a lot if you measure the wrong scenario.

Why decoding is slow in the first place

A transformer forward pass during generation is memory-bandwidth bound, not compute bound. To produce one token, the GPU reads every weight in the model from high-bandwidth memory into the compute units, does a relatively small amount of math, and writes one token out. At batch size one, the arithmetic intensity is terrible: you move tens of gigabytes of weights to compute a single token's worth of multiply-accumulates. The GPU's floating-point throughput is almost irrelevant because you cannot feed it fast enough.

This is the key asymmetry that makes speculative decoding work. Verifying a guess is nearly free. If you already have a candidate sequence of k tokens, the target model can check all k in a single forward pass, because evaluating the model on k positions at once has almost the same cost as evaluating it on one position: you read the weights once either way, and the extra arithmetic for k positions is the cheap part. So the question becomes: can you produce a good guess cheaply, and then verify it in one pass? If the guess is mostly right, you get several tokens for the price of one forward pass.

The core idea: draft cheaply, verify in parallel

Speculative decoding splits generation into two roles. A cheap drafter proposes the next k tokens. The expensive target model verifies all k in parallel, accepts the longest correct prefix, and corrects the first wrong token. Then the cycle repeats from the new position.

Walk through one cycle with k equals 4:

  1. The drafter proposes four tokens: the cat sat on.
  2. The target runs one forward pass over those four positions, producing its own probability distribution at each.
  3. A verification rule accepts tokens left to right as long as they agree with the target, and stops at the first disagreement, where it samples a corrected token from the target instead.
  4. Say the target agrees with the cat sat but would not have chosen on. You accept three drafted tokens, replace the fourth with the target's choice, and you have advanced four tokens using one target forward pass plus the cheap drafting.

The speedup is the average number of tokens accepted per cycle. If you accept three drafted tokens plus the one correction, you produced four tokens in one target pass instead of four passes, close to a 4x reduction in target forward passes. The realized wall-clock speedup is lower because drafting is not free and the per-pass cost rises slightly with k, but the structure is why 2x to 4x is achievable.

Why it is lossless

The objection writes itself: if a small drafter is choosing tokens, is the output worse? No, and this is the property that makes speculative decoding worth using rather than a quality tradeoff to negotiate. The verification step uses a modified rejection sampling scheme, introduced in the original speculative sampling work (Leviathan et al. and Chen et al., 2023), that provably preserves the target model's output distribution.

The rule: for a drafted token x with target probability p(x) and draft probability q(x), accept it with probability min(1, p(x) / q(x)). If rejected, sample a replacement from the normalized residual distribution proportional to max(0, p(x) − q(x)). Working through the algebra, the probability that this process emits any particular token x equals exactly p(x), the target's own probability. The drafter only affects how fast you sample, never what distribution you sample from. The generated text is identical in distribution to standard autoregressive sampling from the target, down to hardware floating-point numerics. A higher-quality drafter raises the acceptance rate and the speedup; it cannot change the answer.

That guarantee is the whole pitch. You are not trading quality for speed. You are spending idle compute to skip redundant memory reads.

The four methods, and when each wins

vLLM supports several drafting strategies. They differ in what produces the guess and therefore in what workloads they accelerate.

MethodExtra model neededBest forCatch
N-gram / prompt lookupNoneHigh input-output overlap: RAG, summarization, code editing, structured rewritesOnly helps when the output echoes the context
Draft modelA small model with the same tokenizerGeneral-purpose, easy to set upAcceptance depends on draft quality; the draft costs memory and time
MedusaExtra prediction heads trained on the targetAvoiding a separate modelHeads must be trained for the target
EAGLE-3A lightweight feature-level head, trained per targetHighest acceptance and speedupThe head is target-specific; you need the matching one

N-gram drafting deserves special attention because it needs no model at all. It proposes the next tokens by finding where the recent context repeats earlier in the prompt and copying what followed. For workloads where the output quotes the input heavily, which is most of RAG, summarization, and code editing, this is close to free acceleration. For open-ended generation where the output does not echo the input, it accepts almost nothing and you should use a model-based drafter instead.

Turning it on in vLLM

The examples below use vLLM 0.10 or newer and the Qwen3 family, which share a tokenizer across sizes so the small models can draft for the large ones. Install with a pinned version so the speculative_config schema matches:

bash
pip install "vllm>=0.10.1"

N-gram drafting (no extra model)

This is the lowest-effort win. Point vLLM at your target model and add a speculative config that uses n-gram lookup:

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-8B",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,   # propose up to 5 tokens per cycle
        "prompt_lookup_min": 2,        # shortest n-gram to match
        "prompt_lookup_max": 4,        # longest n-gram to match
    },
)

sampling = SamplingParams(temperature=0.0, max_tokens=256)
out = llm.generate(["Summarize the following passage:\n\n" + passage], sampling)
print(out[0].outputs[0].text)

On a summarization or RAG prompt where the answer reuses chunks of the source, this alone can deliver a meaningful speedup with zero extra model weights loaded.

Draft-model speculation (a small model drafts for a big one)

When the output does not echo the input, use a real model as the drafter. A small Qwen3 drafts for a larger Qwen3 target; the shared tokenizer is what makes them compatible:

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-8B",                  # target
    speculative_config={
        "model": "Qwen/Qwen3-0.6B",         # cheap drafter, same vocab
        "num_speculative_tokens": 5,
        "draft_tensor_parallel_size": 1,
    },
)

sampling = SamplingParams(temperature=0.7, max_tokens=256)
out = llm.generate(["Write a short product description for a mechanical keyboard."], sampling)
print(out[0].outputs[0].text)

The drafter should be small enough that proposing tokens is cheap relative to a target pass, but capable enough that the target accepts its proposals often. A 0.6B drafting for an 8B is a reasonable starting ratio; tune num_speculative_tokens (more is better when acceptance is high, worse when it is low) for your workload.

Measuring the speedup honestly

Do not trust the feature, measure it on your prompts. Compare tokens per second with and without speculation, holding sampling parameters fixed:

python
import time
from vllm import LLM, SamplingParams

PROMPTS = [...]  # a representative sample of YOUR real traffic
SP = SamplingParams(temperature=0.0, max_tokens=256)

def tok_per_s(llm):
    t = time.perf_counter()
    outs = llm.generate(PROMPTS, SP)
    dt = time.perf_counter() - t
    toks = sum(len(o.outputs[0].token_ids) for o in outs)
    return toks / dt

baseline = LLM(model="Qwen/Qwen3-8B")
print("baseline:", tok_per_s(baseline), "tok/s")

spec = LLM(
    model="Qwen/Qwen3-8B",
    speculative_config={"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5},
)
print("speculative:", tok_per_s(spec), "tok/s")

vLLM also logs speculative metrics, including the acceptance rate and the mean number of accepted tokens per step. That acceptance length is the number to watch: it is the direct driver of the speedup, and it tells you immediately whether the drafter is well matched to your traffic. An acceptance length near 1 means the drafter is rarely right and you are paying drafting cost for nothing.

Tuning the number of speculative tokens

num_speculative_tokens, the k from the walkthrough, is the main dial once a drafter is in place. Larger k proposes more tokens per cycle. That helps when acceptance is high, because you ride a long correct draft and skip more passes, and it hurts when acceptance is low, because you spend drafting effort on tokens that get rejected and each verification pass costs slightly more compute. There is an optimum, and it is workload-dependent, so sweep it:

python
for k in [1, 3, 5, 7, 10]:
    llm = LLM(
        model="Qwen/Qwen3-8B",
        speculative_config={"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": k},
    )
    print(f"k={k:2d}: {tok_per_s(llm):.1f} tok/s")
    del llm  # free the GPU before the next config

As a rule of thumb, predictable workloads (code, structured output) favor larger k, often 5 to 10, while high-entropy generation favors smaller k, 2 to 4, or no speculation at all. Watch for the curve to rise and then fall: past the optimum, throughput drops because you are drafting tokens you will not keep.

EAGLE-3, the higher-performance option

For the largest gains, EAGLE-3 (NeurIPS 2025) replaces the generic drafter with a lightweight head trained to predict the target's own next tokens from its internal features, using tri-layer feature fusion and token-level generation. The published results report up to roughly 5x speedups on large targets, with acceptance lengths that stay high even for tokens deep into a drafted span, where earlier methods fell off. In vLLM you select it with "method": "eagle3" and point it at the matching EAGLE-3 draft checkpoint for your target:

python
llm = LLM(
    model="Qwen/Qwen3-8B",
    speculative_config={
        "method": "eagle3",
        "model": "<eagle3-head-for-your-target>",  # target-specific; use the head trained for this model
        "num_speculative_tokens": 5,
        "draft_tensor_parallel_size": 1,
    },
)

The one catch is in that placeholder: EAGLE-3 heads are trained per target model, so you need the head that matches your exact target, either an official release or one you train. That coupling is the price of EAGLE-3's higher acceptance rate. When a matching head exists for your model, it is usually the best option; when it does not, n-gram or draft-model speculation are the fallbacks.

Drafting a tree, not a chain

One refinement explains why modern methods accept longer spans than the simple linear picture suggests. The drafter does not have to propose a single sequence. EAGLE-2 and Medusa propose a tree of candidate continuations, several possible next tokens, each with several possible successors, and the target verifies the whole tree in one forward pass using a specially masked attention that keeps the branches independent. The target then accepts the best root-to-leaf path through the tree. A tree hedges against the drafter's uncertainty: when the drafter is unsure whether the next token is the or a, it proposes both and lets the target pick, so more tokens survive verification per cycle than a single guessed chain would yield. vLLM manages the tree internally, so you do not configure its shape directly, but it is the mechanism behind the high acceptance lengths that make EAGLE-3 worth the per-target head.

The batch-size catch nobody mentions

Here is the production reality that the benchmark charts bury, and the single most important thing in this guide. Speculative decoding is a low-concurrency latency optimization, and its benefit collapses as batch size grows.

The reason follows directly from why it works. Speculation trades spare compute for fewer memory reads. At batch size one, you have abundant spare compute, so converting it into skipped passes is a clear win. But as you batch more requests together, the GPU's compute fills up: with enough concurrent sequences, you are already doing useful arithmetic on every weight you load, and the decode step stops being memory-bound and starts being compute-bound. At that point the extra verification work of speculation competes for the same busy compute units, and the "free" capacity it relied on is gone.

The EAGLE-3 paper's own numbers show the shape of this cliff: a speedup around 2.3x at batch size 4 falls to roughly break-even by batch size 32. A method that looks like a 4x win in a single-stream latency benchmark can be worth nothing, or slightly negative, on a server saturated with concurrent requests. This is not a flaw in any particular implementation, it is the physics of the optimization.

The practical consequence: decide based on your serving regime, not the headline number.

  • Latency-critical, low concurrency (interactive coding assistant, a single user's chat, on-device, agentic loops where one request blocks the next): speculation is a strong win. Use it.
  • Throughput-critical, high concurrency (a busy multi-tenant endpoint running large batches): measure carefully, and expect the gains to shrink or vanish. Continuous batching may already be saturating your compute.
  • In between: benchmark at your actual concurrency, not at batch size one, or you will ship a "4x speedup" that does nothing for your real load.

This is the same honest-measurement discipline that the evaluation crisis piece argues for everywhere: a number measured in the convenient regime is not a result. Speculative decoding measured at batch size one tells you almost nothing about a server running at batch size 32.

Acceptance also depends on the task

Beyond batch size, the second variable is the workload, because acceptance rate is not a model constant. It depends on how predictable your outputs are.

  • High acceptance: code, structured output, on-distribution continuations, anything where the next tokens are highly constrained. The drafter guesses right often, so you accept long spans.
  • Low acceptance: creative writing, out-of-distribution prompts, high-temperature sampling, anything where the next token is genuinely uncertain. The drafter is frequently wrong, acceptance length drops toward 1, and the speedup evaporates.

Temperature matters here too: greedy or low-temperature decoding is more predictable and accepts more; high-temperature sampling spreads the target distribution and rejects more drafted tokens. None of this changes output quality, because of the lossless guarantee, but it changes how much speed you get. The corollary is that you must benchmark on a representative sample of your real prompts at your real sampling settings. A speedup measured on greedy code completion will not transfer to high-temperature creative generation.

Gotchas and troubleshooting

A handful of failure modes turn an expected speedup into a slowdown or an outright error.

Tokenizer mismatch. Draft-model speculation requires the draft and target to share a tokenizer, because the draft proposes token IDs that the target must interpret directly. Pairing a drafter from one family with a target from another will fail or produce near-zero acceptance. Stay within a single model family, or use n-gram drafting (which copies the target's own tokens) or EAGLE (which works in the target's feature space), both of which sidestep the issue.

The draft competes for memory. A separate draft model and its KV cache occupy GPU memory that would otherwise hold target KV cache, which can reduce the batch size or context length you can serve. On a memory-tight deployment, the draft's footprint can cost you more in lost batching than it saves in latency. EAGLE heads are far lighter than a full draft model, which is part of their appeal.

Draft tensor parallelism. EAGLE heads and small drafters generally run with draft_tensor_parallel_size: 1 even when the target is sharded across several GPUs, because the draft is small and tensor-parallel communication overhead would dominate its tiny compute. Getting this wrong is a common cause of "speculation made it slower."

Acceptance length near 1 means stop. If vLLM's metrics report an acceptance length close to 1, the drafter is almost never right for your traffic. You are paying drafting cost for no benefit. Switch drafters, lower k, or turn speculation off for that workload.

Guided decoding interacts with acceptance. If you constrain generation with a grammar or JSON schema, the constraint already prunes the target's distribution, which changes how often drafted tokens are valid. Always measure speculation with your guided-decoding settings turned on, not off, or the benchmark will not match production.

When to use it

Speculative decoding is close to a default for single-stream, latency-sensitive serving, and it is one of the few inference optimizations that costs you nothing in output quality. Reach for it when:

  • Your decode latency matters and your concurrency is low to moderate.
  • Your workload is predictable: code, structured output, or RAG and summarization where the output reuses the input (start with n-gram for the last case).
  • You can pin a draft model with the same tokenizer, or you have a matching EAGLE-3 head.

Be skeptical when your server runs large batches, when your outputs are high-entropy, or when you have not measured at your actual concurrency. In those cases the honest answer is "benchmark it on your traffic," and the honest result may be "no measurable gain." That is not a failure of the technique, it is the technique telling you that your GPU was already busy, which is its own kind of good news.

Key Takeaways

  1. Decoding is memory-bound, so verification is nearly free. A target model can check k drafted tokens in one forward pass for almost the cost of producing one, because both read the weights once. Speculative decoding turns spare compute into skipped passes.
  2. It is mathematically lossless. The modified rejection sampling rule (accept with probability min(1, p/q), resample from the residual otherwise) provably emits the target's exact distribution. A better drafter changes the speed, never the output.
  3. The speedup equals the acceptance length. Tokens accepted per cycle is the number that matters. Watch it in vLLM's speculative metrics; an acceptance length near 1 means the drafter is mismatched to your traffic.
  4. N-gram drafting is a free win for input-echoing workloads. RAG, summarization, and code editing reuse the context, so prompt-lookup speculation needs no extra model. Open-ended generation needs a model-based drafter.
  5. EAGLE-3 is the high end, at the cost of a per-target head. It reports up to ~5x on large targets with high acceptance deep into a span, but the head is trained for a specific target model, so you need the matching one.
  6. Batch size is the catch. Speculation is a low-concurrency latency optimization. The EAGLE-3 paper shows ~2.3x at batch 4 collapsing to break-even by batch 32, because a saturated server is already compute-bound and has no spare capacity to spend.
  7. Acceptance depends on the task and temperature. Predictable, low-temperature, structured outputs accept long spans; creative, high-temperature, out-of-distribution generation accepts little. Benchmark on your real prompts at your real settings.
  8. Measure in your serving regime, not at batch size one. A 4x single-stream speedup can be worth nothing under production batching. The convenient-regime number is not a result.

The Acing AI newsletter takes apart the inference stack the way this guide takes apart speculative decoding: mechanism first, then the production caveat the benchmark hid. Subscribe if that is your kind of detail.