What is Temperature?

    A sampling parameter that controls the randomness of a language model's output — lower values produce more deterministic responses, higher values increase creativity and variety.

    Definition

    Temperature is a scalar parameter applied during the token sampling phase of text generation that controls how "random" or "creative" the model's outputs are. Technically, temperature scales the logits (raw prediction scores) before the softmax function converts them into a probability distribution over the vocabulary. A temperature of 1.0 uses the model's raw probabilities as-is. Values below 1.0 sharpen the distribution, making the most likely tokens even more likely (more deterministic). Values above 1.0 flatten the distribution, giving less likely tokens a greater chance of being selected (more random and creative).

    At the extreme, a temperature of 0 (or near-zero) makes the model always select the single most probable next token — this is called greedy decoding and produces fully deterministic output. A temperature of 2.0, on the other hand, creates a nearly uniform distribution where the model might select obscure or unusual tokens, often producing incoherent text. Most practical applications use temperatures between 0.0 and 1.0, with the sweet spot depending on the task.

    Temperature interacts with other sampling parameters like top-p (nucleus sampling) and top-k to shape the final token selection process. These parameters work together: temperature adjusts the overall distribution shape, while top-p and top-k truncate the distribution to exclude the least likely tokens. Understanding how these parameters interact is key to achieving the right balance between consistency and variety in model outputs.

    Why It Matters

    Temperature is one of the most user-facing parameters in LLM deployment because it directly affects the perceived personality of the model. Customer support chatbots typically use low temperatures (0.1–0.3) for consistent, reliable answers. Creative writing assistants use higher temperatures (0.7–1.0) for diverse and surprising prose. Getting the temperature wrong can make a factual Q&A system unreliable (too high) or a creative tool boringly repetitive (too low). For fine-tuned models, the optimal temperature also depends on how the model was trained — models trained on diverse outputs may need lower temperatures to stabilize, while models trained on uniform outputs can tolerate higher values.

    How It Works

    After the model's forward pass produces a logit vector (one score per vocabulary token), temperature is applied by dividing each logit by the temperature value: adjusted_logit = logit / temperature. These adjusted logits are then passed through the softmax function to produce probabilities. When temperature < 1, division amplifies the differences between logits, concentrating probability mass on the top tokens. When temperature > 1, division compresses the differences, spreading probability more evenly. The final token is then sampled from this adjusted probability distribution (or from a truncated version if top-p or top-k is also applied).

    Example Use Case

    A content generation platform offers users a "creativity slider" that maps to the model's temperature parameter. For product descriptions (factual, consistent), the slider sets temperature to 0.2. For marketing taglines (creative, varied), it sets temperature to 0.8. For brainstorming sessions (maximum variety), it goes up to 1.2. The same fine-tuned model serves all three use cases, with temperature being the only parameter that changes — demonstrating how this single parameter can dramatically shift model behavior.

    Key Takeaways

    • Temperature scales logits before softmax, controlling the randomness of token selection.
    • Lower values (0.0–0.3) produce deterministic, consistent outputs; higher values (0.7–1.2) increase creativity.
    • Temperature 0 gives greedy decoding (always picks the most likely token).
    • It works alongside top-p and top-k to shape the final sampling distribution.
    • The optimal temperature depends on the task: factual tasks need low values, creative tasks need higher ones.

    How Ertas Helps

    When users evaluate their fine-tuned models in Ertas Studio, they can adjust the temperature parameter in the inference playground to see how it affects output quality and variety. Ertas's deployment options — whether through Ertas Cloud APIs or local inference via GGUF exports — all support temperature as a configurable parameter, giving end users control over the creativity-consistency trade-off in production applications.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.