What is Attention?

A mechanism in transformer models that allows each token to dynamically weigh and focus on the most relevant parts of the input sequence when computing its representation.

Definition

Attention is the core computational mechanism inside transformer-based language models. It enables the model to build context-aware representations by allowing each token in a sequence to "look at" and gather information from every other token. Rather than processing text left-to-right with a fixed-size memory (as RNNs do), attention computes a direct, weighted connection between every pair of tokens in the input, enabling the model to capture long-range dependencies and nuanced relationships.

The standard attention mechanism — called scaled dot-product attention — works by projecting each token's embedding into three vectors: a query (Q), a key (K), and a value (V). Attention scores are computed as the dot product of each query with all keys, scaled by the square root of the key dimension, and then passed through a softmax to produce a probability distribution. This distribution determines how much each token's value contributes to the output representation of the query token. High attention scores mean two tokens are highly relevant to each other in context.

Modern transformers use multi-head attention, which runs several independent attention computations ("heads") in parallel, each operating on a different learned subspace of the embedding. This allows the model to simultaneously attend to different types of relationships — for example, one head might capture syntactic dependencies while another captures semantic similarity. The outputs of all heads are concatenated and projected back to the model dimension. Variants like grouped-query attention (GQA) and multi-query attention (MQA) reduce the memory cost of attention by sharing key and value projections across heads.

Why It Matters

Attention is what gives language models their remarkable ability to understand context, resolve ambiguity, and maintain coherence across long passages. Without attention, a model processing "The bank was flooded after the river overflowed" would not know which meaning of "bank" to use. Attention is also where fine-tuning has the most impact — LoRA adapters are typically applied to the attention projection matrices (Q, K, V, and output) because these are the components most responsible for how the model interprets and relates information. Understanding attention helps practitioners diagnose model behavior and make informed fine-tuning decisions.

How It Works

For each token in the input sequence, the attention layer computes Q = W_q × x, K = W_k × x, and V = W_v × x, where x is the token's embedding and W_q, W_k, W_v are learned weight matrices. The attention score between token i and token j is computed as (Q_i · K_j) / √d_k, where d_k is the key dimension. These scores are passed through softmax to produce attention weights that sum to 1. The output for token i is the weighted sum of all V vectors according to these weights. In causal (decoder-only) models, a mask prevents tokens from attending to future positions, maintaining the autoregressive property. The entire computation is performed in parallel across all tokens and all heads using efficient matrix multiplication on GPUs.

Example Use Case

A developer analyzing why their fine-tuned model misinterprets certain queries uses attention visualization to inspect which tokens the model focuses on. They discover that the model attends heavily to generic stop words rather than domain-specific terminology in the prompt. After adding more diverse training examples that emphasize domain terms, the attention patterns shift appropriately, and the model's accuracy on those query types improves by 23%.

Key Takeaways

Attention lets each token dynamically focus on the most relevant parts of the input sequence.
Scaled dot-product attention uses query, key, and value projections to compute weighted relationships.
Multi-head attention captures diverse relationship types in parallel across different subspaces.
LoRA fine-tuning targets attention projection matrices because they most influence model behavior.
Causal masking in decoder-only models prevents tokens from attending to future positions.

How Ertas Helps

When users fine-tune models in Ertas Studio using LoRA, the adapters are applied primarily to the attention layers — specifically the query, key, value, and output projection matrices. Ertas's default configuration targets these layers because research shows they provide the best return on trainable parameters. Advanced users can customize which attention layers to target through Studio's configuration panel, giving fine-grained control over the fine-tuning process.