What is Transformer?

The neural network architecture that underlies virtually all modern large language models, using self-attention mechanisms to process sequences in parallel.

Definition

The transformer is a neural network architecture introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced the recurrent neural networks (RNNs) and LSTMs that previously dominated natural language processing with a fully attention-based design that processes all tokens in a sequence simultaneously rather than sequentially. This parallelism enabled transformers to scale to vastly larger datasets and model sizes, directly leading to the large language model revolution.

A transformer consists of stacked layers, each containing two main sub-components: a multi-head self-attention mechanism and a position-wise feed-forward network. The self-attention mechanism allows every token in the input sequence to attend to every other token, computing weighted relevance scores that determine how much each token influences the representation of every other token. The feed-forward network then applies non-linear transformations to each token's representation independently. Layer normalization and residual connections stabilize training across many stacked layers.

Modern LLMs like GPT, Llama, Mistral, and Phi are decoder-only transformers — they are trained autoregressively to predict the next token given all previous tokens. Encoder-only transformers (like BERT) and encoder-decoder transformers (like T5) are used for other tasks like classification and translation. The decoder-only variant has proven most effective for generative tasks, which is why it dominates the current LLM landscape.

Why It Matters

The transformer architecture is the foundation upon which the entire modern AI ecosystem is built. Understanding transformers is essential for making informed decisions about model selection, fine-tuning strategy, and deployment. Key architectural choices — such as the number of layers, the hidden dimension, the number of attention heads, and the context window length — directly determine a model's capabilities, memory requirements, and inference speed. When practitioners discuss model size (7B, 13B, 70B parameters), they are describing the scale of a transformer's weight matrices.

How It Works

Input text is first tokenized and converted to embeddings. Positional encodings (or rotary positional embeddings in modern models like Llama) are added so the model can distinguish token order. The embeddings then pass through N identical transformer layers. In each layer, the self-attention mechanism computes query, key, and value projections for each token, calculates attention scores as the scaled dot product of queries and keys, applies softmax normalization, and produces a weighted sum of values. Multiple attention heads operate in parallel on different subspaces of the embedding, capturing different types of relationships. The attention output is combined with a residual connection, normalized, and passed through a feed-forward network before the next layer.

Example Use Case

A research team building a domain-specific assistant needs to choose between a 7B and 13B transformer model. They analyze the architectural differences: the 13B model has more layers and wider hidden dimensions, giving it greater capacity to represent complex patterns. However, it also requires 2x the VRAM for inference. After benchmarking both on their domain tasks, they find the 13B model scores 8% higher on their evaluation suite — a meaningful improvement for their accuracy-critical medical application that justifies the additional infrastructure cost.

Key Takeaways

Transformers use self-attention to process all tokens in parallel, enabling massive scaling.
Modern LLMs (GPT, Llama, Mistral) are decoder-only transformers trained for next-token prediction.
Each transformer layer contains multi-head self-attention and a feed-forward network.
Model size (parameter count) is determined by the transformer's depth, width, and number of attention heads.
The transformer architecture is the universal foundation for all current large language models.

How Ertas Helps

Every model fine-tuned through Ertas Studio is built on the transformer architecture. Ertas abstracts away the architectural complexity, allowing users to select a model by name and size without needing to configure transformer-specific parameters. Under the hood, Ertas's training pipeline applies LoRA adapters to the transformer's attention layers — the components that benefit most from task-specific adaptation — ensuring efficient and effective fine-tuning for any domain.