What is KV Cache?

    A memory buffer that stores previously computed key and value tensors from the attention mechanism, avoiding redundant computation during autoregressive text generation.

    Definition

    The KV cache (key-value cache) is a memory optimization used during autoregressive text generation with transformer models. When a model generates text token by token, each new token requires attending to all previous tokens through the attention mechanism, which involves computing key (K) and value (V) tensors for every token in the sequence. Without caching, the model would recompute K and V tensors for all previous tokens at every generation step — a redundant computation that grows quadratically with sequence length.

    The KV cache stores the K and V tensors computed for each token as it is generated, so subsequent generation steps only need to compute K and V for the new token and read the cached values for all previous tokens. This reduces the per-step computation from O(n) to O(1) for the K/V computation, dramatically accelerating generation. The trade-off is memory: the KV cache grows linearly with sequence length, and for long contexts or large batch sizes, it can consume more GPU memory than the model weights themselves.

    For a typical 7B parameter model in FP16, each token in the KV cache requires approximately 1 MB of memory (across all layers). A 4096-token context therefore requires roughly 4 GB of KV cache memory. For a 128K context window, the KV cache alone would require approximately 128 GB — far exceeding the model weight memory. This is why efficient KV cache management is one of the most important challenges in LLM serving.

    Why It Matters

    KV cache management is the critical bottleneck in LLM serving at scale. The cache determines how many concurrent requests a GPU can serve (batch size), how long contexts can be (maximum sequence length), and how much memory is available for model weights. Efficient KV cache systems directly translate to higher throughput, lower latency, and reduced serving costs.

    Innovations in KV cache management — such as PagedAttention (used by vLLM), which manages cache memory like an operating system manages virtual memory pages — have dramatically improved LLM serving efficiency. Before PagedAttention, serving systems wasted up to 60% of KV cache memory due to fragmentation. These innovations are as impactful as model architecture improvements for real-world deployment economics.

    How It Works

    During the prefill phase (processing the initial prompt), the model computes K and V tensors for all input tokens and stores them in the cache. During the decode phase (generating new tokens), each generation step computes K and V only for the new token, appends them to the cache, and uses the full cache for the attention computation.

    Advanced KV cache techniques include: quantized KV cache (storing cached tensors in INT8 or INT4 instead of FP16, reducing memory by 2-4x with minimal quality impact); grouped-query attention (GQA, which uses fewer K/V heads than query heads, reducing cache size); sliding window attention (only caching the most recent N tokens instead of the full history); and PagedAttention (managing cache memory in fixed-size pages, eliminating fragmentation and enabling efficient memory sharing across requests).

    Example Use Case

    A serving platform handles 200 concurrent chat sessions, each with up to 8K token contexts, on 4 A100-80GB GPUs. Without PagedAttention, KV cache fragmentation limits them to 80 concurrent sessions. With vLLM's PagedAttention managing the KV cache, they serve all 200 sessions with 95% memory utilization, reducing their GPU requirements from 10 GPUs to 4 — a 60% infrastructure cost reduction.

    Key Takeaways

    • The KV cache stores previously computed attention key/value tensors to avoid redundant computation.
    • It is essential for efficient autoregressive generation — without it, computation grows quadratically.
    • KV cache memory grows linearly with sequence length and can exceed model weight memory for long contexts.
    • Techniques like PagedAttention, GQA, and cache quantization optimize memory efficiency.
    • KV cache management directly determines serving throughput, latency, and infrastructure costs.

    How Ertas Helps

    Models fine-tuned in Ertas Studio inherit the KV cache efficiency characteristics of their base architecture. Studio supports models with grouped-query attention (GQA) that reduce KV cache memory requirements, enabling longer contexts during local inference with GGUF exports.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.