What is Speculative Decoding?

An inference acceleration technique that uses a small, fast draft model to propose multiple tokens at once, which the larger target model verifies in parallel.

Definition

Speculative decoding is an inference optimization technique that accelerates text generation from large language models without changing the output distribution. The key insight is that LLM inference is memory-bandwidth-bound — the bottleneck is reading model weights from memory, not computing with them. A single forward pass that processes 5 tokens costs nearly the same as one that processes 1 token, because the weight-loading cost dominates. Speculative decoding exploits this by using a small, fast draft model to predict several tokens ahead, then verifying all predictions in a single forward pass of the large target model.

The technique works in three steps: (1) the draft model generates K candidate tokens autoregressively (fast, because the draft model is small), (2) the target model processes all K candidates in a single forward pass (computing the probability it assigns to each), and (3) a verification step accepts the longest prefix of candidates that the target model agrees with, rejecting the rest. Crucially, the acceptance criterion is designed so that the final output distribution is mathematically identical to standard autoregressive decoding from the target model — speculative decoding is lossless.

The speedup depends on the acceptance rate — how often the draft model's predictions match what the target model would have generated. When the draft model is a good approximation of the target (e.g., a 1B model drafting for a 70B model from the same family), acceptance rates of 70-90% are common, yielding 2-3x speedups in tokens per second with zero quality degradation.

Why It Matters

LLM inference latency directly impacts user experience and cost. Users perceive delays beyond 200ms between tokens as sluggish, and long generation tasks (summarization, code generation) can take tens of seconds. Speculative decoding reduces this latency by 2-3x without any change to output quality, making it one of the few optimizations that is truly free in terms of quality trade-offs.

For inference providers, speculative decoding reduces the GPU-hours required per request, directly lowering serving costs. Unlike quantization, which trades quality for speed, speculative decoding is mathematically guaranteed to produce the same output distribution as standard decoding. This makes it suitable for applications where output quality cannot be compromised.

How It Works

The draft model runs standard autoregressive decoding to generate K tokens (typically K=4-8). This is fast because the draft model is 10-50x smaller than the target model. The target model then processes the entire sequence (original context plus K draft tokens) in a single forward pass, producing probability distributions for each position.

The verification step walks through the K draft tokens sequentially. For each position, it compares the draft model's chosen token to the target model's distribution. If the target model assigns sufficient probability to the draft token (using a modified rejection sampling scheme), the token is accepted. If rejected, the verification stops, and the target model's distribution at the rejection point is used to sample a replacement token. This guarantees the output distribution matches standard target-model decoding while typically accepting most draft tokens.

Example Use Case

An inference platform serves Llama 3 70B to users requesting long-form content generation. Average generation takes 45 seconds for 2,000 tokens. By deploying Llama 3 8B as a draft model with speculative decoding (K=5, average acceptance rate 78%), they reduce generation time to 18 seconds — a 2.5x speedup with identical output quality. Users report significantly better experience, and the platform's GPU cost per request drops by 60%.

Key Takeaways

Speculative decoding uses a small draft model to propose tokens verified by the large target model.
It produces mathematically identical outputs to standard decoding — zero quality loss.
Typical speedups are 2-3x, depending on draft model quality and acceptance rate.
The technique exploits the fact that LLM inference is memory-bandwidth-bound, not compute-bound.
Draft models from the same model family as the target yield the highest acceptance rates.

How Ertas Helps

Models fine-tuned in Ertas Studio can be deployed with speculative decoding by pairing a fine-tuned large model with a smaller draft model from the same family, both exported as GGUF files for efficient local inference.