What is Tokenizer?

The component that converts raw text into a sequence of numerical tokens that a language model can process, and vice versa.

Definition

A tokenizer is the preprocessing layer that bridges human-readable text and the numerical representations that neural networks operate on. It splits input text into tokens — which may be whole words, subwords, or individual characters — and maps each token to a unique integer ID from a fixed vocabulary. The model processes these integer sequences through its layers and produces output token IDs, which the tokenizer then decodes back into text.

Modern LLMs predominantly use subword tokenization algorithms such as Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. These algorithms learn a vocabulary of commonly occurring character sequences from the training corpus. Common words like "the" get a single token, while rare words are decomposed into multiple subword tokens. For example, "unhappiness" might be tokenized as ["un", "happiness"] or ["un", "happ", "iness"] depending on the tokenizer. This approach balances vocabulary size (typically 32,000–128,000 tokens) against the ability to represent any input text without out-of-vocabulary errors.

The tokenizer is tightly coupled to its model — each model family (Llama, Mistral, GPT, etc.) has its own tokenizer with its own vocabulary. Using the wrong tokenizer with a model produces garbage outputs because the token ID mappings will not match. When fine-tuning, the tokenizer from the base model must be preserved exactly. The tokenizer also determines the model's context window utilization: more efficient tokenization means more text fits within the same context window length.

Why It Matters

Tokenization directly affects model performance, cost, and capability. A tokenizer that fragments common domain-specific terms into many subwords forces the model to "waste" context window capacity and may reduce comprehension. Token count also drives API costs for cloud-hosted models (charged per token) and determines whether a given input fits within the model's context window. Understanding tokenization helps practitioners estimate costs, debug unexpected model behavior, and make informed decisions about data formatting and prompt design.

How It Works

The tokenization pipeline typically works in stages: first, the raw text is normalized (lowercasing, Unicode normalization, etc., depending on the tokenizer). Then, a pre-tokenization step splits the text into rough chunks (usually by whitespace and punctuation). Finally, the subword algorithm (e.g., BPE) applies learned merge rules to break each chunk into tokens from the vocabulary. Each token is mapped to its integer ID, and special tokens (like beginning-of-sequence, end-of-sequence, or padding tokens) are added as needed. The reverse process (decoding) maps IDs back to their text representations and joins them into a readable string.

Example Use Case

A team preparing training data for a medical fine-tuning project discovers that the Llama tokenizer splits "acetaminophen" into 4 subword tokens and "ibuprofen" into 3. This means medical text consumes more tokens per word than general English text, reducing the effective context window for their use case. They factor this into their prompt design, keeping system prompts concise to maximize the context available for clinical content. They also use token counting in their data pipeline to ensure no training example exceeds the model's context length.

Key Takeaways

Tokenizers convert text to numerical IDs using learned subword vocabularies (BPE, WordPiece, SentencePiece).
Each model family has its own tokenizer — using the wrong one produces garbage outputs.
Token count determines context window usage, API costs, and effective input length.
Domain-specific text may tokenize less efficiently, consuming more tokens per word.
The tokenizer must be preserved exactly when fine-tuning a base model.

How Ertas Helps

Ertas Studio automatically loads the correct tokenizer for whichever base model the user selects, eliminating a common source of fine-tuning errors. The platform's data preview feature shows token counts per example, helping users identify overly long examples that might exceed the model's context window. During training, Ertas handles all tokenization, padding, and special token insertion transparently, so users can focus on their data quality rather than low-level preprocessing details.