What is Token?

The fundamental unit of text that a language model processes — typically a word, subword, or character that maps to an integer ID in the model's vocabulary.

Definition

A token is the atomic unit of text processing in a language model. Before a language model can process text, the raw string must be converted into a sequence of tokens — discrete units drawn from a fixed vocabulary. Modern LLMs typically use subword tokenization algorithms like Byte Pair Encoding (BPE) or SentencePiece, which split text into a mixture of whole words, subwords, and individual characters depending on frequency. Common words like 'the' map to a single token, while rare or technical terms are split into multiple subword tokens.

Tokenization is a critical but often overlooked stage of the LLM pipeline. The tokenizer's vocabulary size (typically 32,000 to 128,000 tokens) determines the granularity of the model's text representation. Larger vocabularies can represent text more efficiently (fewer tokens per text passage) but increase the size of the model's embedding and output layers. Different models use different tokenizers, meaning the same text produces different token sequences for different models — a fact that matters when comparing context lengths or calculating costs.

Tokens directly determine LLM economics. API providers charge by the token (both input and output), so the tokenization efficiency of a model affects per-query cost. A tokenizer that requires 1.3 tokens per word is 30% more expensive per word than one that requires 1.0 tokens per word. Context window limits are also measured in tokens, and the number of tokens consumed by a prompt determines how much space remains for retrieved context or generated output.

Why It Matters

Understanding tokens is essential for working effectively with LLMs. Context window limits, training data sizing, inference costs, and generation speeds are all measured in tokens. A practitioner who thinks in words rather than tokens will systematically miscalculate costs, overestimate available context, and misjudge training data requirements.

Tokenization also affects model behavior in subtle ways. Languages with non-Latin scripts often tokenize less efficiently, requiring more tokens per semantic unit, which reduces effective context length and increases costs for multilingual applications. Code tokenization can be particularly inefficient, with variable names and syntax tokens consuming many individual tokens. These tokenization inefficiencies directly impact the feasibility of certain use cases.

How It Works

Most modern LLMs use Byte Pair Encoding (BPE) or a variant. BPE starts with a base vocabulary of individual bytes (256 entries) and iteratively merges the most frequently co-occurring pair of tokens in a training corpus into a new token. This process repeats for a fixed number of merge operations, building up a vocabulary of increasingly common subword units. At inference time, the tokenizer applies these learned merges to split input text into tokens.

Each token maps to an integer ID in the vocabulary. The model's embedding layer converts each token ID into a dense vector representation, which flows through the transformer layers. At the output, the model produces a probability distribution over the entire vocabulary for the next token, and the token with the highest probability (or a sampled token, depending on the generation strategy) is selected and decoded back to text.

Example Use Case

A team building a code assistant notices that their model's 4096-token context window can only fit about 200 lines of Python code, because the tokenizer splits many code elements (variable names, indentation, operators) into multiple tokens. They switch to a model with a code-optimized tokenizer that achieves 1.5x better tokenization efficiency for Python, effectively expanding their usable context to 300 lines — enough to include the surrounding functions and classes needed for accurate code completion.

Key Takeaways

Tokens are the fundamental processing units of LLMs, typically subwords or whole words from a fixed vocabulary.
Tokenization efficiency varies by language, domain, and tokenizer algorithm, affecting costs and context usage.
Context windows, costs, and generation speeds are all measured in tokens, not words.
Most modern models use BPE or SentencePiece tokenization with vocabularies of 32K-128K tokens.
Different models have different tokenizers, making token counts non-comparable across models.

How Ertas Helps

Ertas Studio displays token counts for training datasets and validates that training examples fit within the model's context window. Ertas Data Suite uses tokenizer-aware processing to ensure data is properly segmented for the target model.

Related Resources

Context Window

Embedding

Tokenizer

Transformer

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →