What is Context Window?

The maximum number of tokens a language model can process in a single input-output sequence, determining how much text the model can 'see' at once.

Definition

The context window (also called context length or sequence length) defines the upper limit on the total number of tokens — including both the input prompt and the generated output — that a language model can handle in a single interaction. A model with a 4,096-token context window can process roughly 3,000 words of combined input and output; a model with a 128,000-token window can handle the equivalent of a short novel.

The context window is an architectural constraint baked into the model during pre-training. It is determined by the positional encoding scheme and the size of the attention matrices. In the standard self-attention mechanism, memory and compute scale quadratically with sequence length (O(n²)), which is why context windows were historically limited. Modern techniques like rotary positional embeddings (RoPE), sliding-window attention, and FlashAttention have enabled context windows to grow from 2K–4K tokens in early models to 128K–1M tokens in recent architectures, while keeping resource usage manageable.

For fine-tuning, the context window has practical implications for training data preparation. Every training example must fit within the model's context window — if an example exceeds the limit, it will be truncated, potentially losing critical information. The training context length can sometimes be shorter than the model's maximum to save memory (e.g., training at 2,048 tokens on a model that supports 8,192), with the trade-off that the fine-tuned model may perform less reliably at lengths beyond what it was trained on.

Why It Matters

The context window determines what tasks a model can realistically perform. Summarizing a 50-page document requires a context window large enough to hold the entire document plus the summary. Multi-turn conversational assistants need context to maintain the full conversation history. For enterprise applications involving long documents — legal contracts, medical records, codebases — context window length is often a deciding factor in model selection. Additionally, context window limits affect fine-tuning economics: longer training examples consume more GPU memory per batch, increasing training cost.

How It Works

When text is submitted to a model, the tokenizer converts it to a sequence of token IDs. If this sequence exceeds the context window, it must be truncated or the request will fail. During processing, the attention mechanism computes relationships between all tokens within the window — each token can attend to every preceding token (in causal models). Positional embeddings encode the position of each token so the model can understand word order. At inference time, the key-value cache (KV cache) stores attention states for previously generated tokens to avoid redundant computation, but this cache grows linearly with context length and can dominate GPU memory usage for long sequences.

Example Use Case

A legal technology company building a contract review assistant discovers that their average contract is 12,000 tokens long. They select a base model with a 32K context window to comfortably accommodate the full contract plus a system prompt and generated analysis. During fine-tuning, they set the maximum sequence length to 16,384 tokens (the longest example in their dataset plus a safety margin) to balance training memory usage against coverage. In production, the model processes entire contracts in a single pass without losing critical clauses to truncation.

Key Takeaways

The context window caps the total tokens (input + output) a model can process at once.
Modern models range from 4K to 1M tokens, with 8K–128K being most common.
Attention memory scales quadratically with sequence length, making long contexts expensive.
Training examples must fit within the context window or they will be truncated.
Context window length is a critical factor in model selection for document-heavy applications.

How Ertas Helps

Ertas Studio displays the context window of each base model in its model catalog, helping users select the right model for their use case. The platform's data validation step flags training examples that exceed the configured context length, preventing silent truncation. Users can configure the training sequence length in Studio's hyperparameter panel, and Ertas provides guidance on balancing context length against GPU memory to optimize training efficiency.