What is Top-p (Nucleus Sampling)?

A sampling strategy that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p, balancing output quality with diversity.

Definition

Top-p sampling, also known as nucleus sampling, is a decoding strategy introduced by Holtzman et al. in 2019 that dynamically selects the pool of candidate tokens based on their cumulative probability. Instead of considering all tokens in the vocabulary (which includes many extremely unlikely options) or a fixed number of top tokens (top-k), top-p includes the smallest set of tokens whose combined probability mass exceeds the threshold p. For example, with top-p = 0.9, the model considers only the most probable tokens that together account for 90% of the total probability, discarding the remaining 10% of unlikely tokens.

The key insight behind top-p is that the number of "reasonable" next tokens varies dramatically depending on context. After the word "the," thousands of tokens are plausible. After "the capital of France is," essentially only one token ("Paris") is reasonable. A fixed top-k value cannot adapt to this variation — it either includes too many unlikely tokens in the first case or unnecessarily restricts options in the second. Top-p naturally adapts: when the model is confident, the nucleus is small; when the model is uncertain, the nucleus grows to include more options.

Top-p is typically used in combination with temperature. Temperature shapes the overall probability distribution, while top-p truncates the long tail of improbable tokens. A common production configuration is temperature 0.7 with top-p 0.9, which allows moderate creativity while preventing the model from selecting truly off-the-wall tokens. Setting top-p to 1.0 effectively disables it (all tokens are candidates), while very low values like 0.1 make the model almost as deterministic as greedy decoding.

Why It Matters

Top-p is essential for generating high-quality text that is both coherent and non-repetitive. Pure greedy decoding (always picking the most likely token) produces dull, repetitive text. Random sampling from the full distribution produces incoherent text. Top-p strikes the balance by allowing variety within the bounds of plausibility. It is the default sampling strategy in most production LLM deployments and APIs because it produces consistently natural-sounding text across diverse contexts. Understanding top-p helps practitioners tune their inference pipeline for the right balance of quality and creativity.

How It Works

After the model produces logits and temperature is applied, the softmax function generates a probability distribution over the entire vocabulary. The tokens are then sorted by probability in descending order. Starting from the most probable token, probabilities are accumulated until the running sum exceeds the threshold p. All tokens included in this cumulative sum form the "nucleus" — the set of candidates for sampling. Tokens outside the nucleus are masked (their probabilities set to zero), and the remaining probabilities are renormalized to sum to 1.0. The final token is then sampled from this renormalized distribution. This process repeats for each generated token.

Example Use Case

A chatbot development team notices their model occasionally produces bizarre non-sequiturs in customer conversations. Investigation reveals they are using temperature 0.8 with no top-p filtering, allowing the model to occasionally sample from the very tail of the distribution. They add top-p = 0.9, which eliminates the worst outlier tokens while preserving the natural conversational variety. The bizarre responses disappear, and customer satisfaction scores improve by 12% — all from a single parameter change at inference time.

Key Takeaways

Top-p (nucleus sampling) dynamically selects the smallest token set whose cumulative probability exceeds p.
It adapts to context: tight nucleus when the model is confident, wider when uncertain.
Common production values are 0.85–0.95, often combined with temperature 0.5–0.8.
Top-p prevents sampling from the extreme tail of the distribution, reducing incoherent outputs.
Setting top-p to 1.0 disables it; very low values approach greedy decoding.

How Ertas Helps

Ertas supports top-p as a configurable inference parameter both in Studio's evaluation playground and in Ertas Cloud API deployments. When users test their fine-tuned models in Studio, they can experiment with different top-p values alongside temperature to find the optimal sampling configuration for their use case. These settings can then be baked into the deployment configuration, ensuring consistent behavior in production.