What is Adapter?
A small set of trainable parameters inserted into a frozen pre-trained model, enabling efficient fine-tuning without modifying the original model weights.
Definition
In the context of large language models, an adapter is a lightweight module — typically a pair of small matrices or a low-rank decomposition — that is injected into specific layers of a pre-trained model. During fine-tuning, the original (base) model weights are frozen, and only the adapter parameters are updated through backpropagation. This approach, known as parameter-efficient fine-tuning (PEFT), reduces the number of trainable parameters from billions to just a few million, slashing memory requirements and training time by an order of magnitude.
The most widely used adapter architecture is LoRA (Low-Rank Adaptation), which decomposes weight updates into two small matrices whose product approximates the full-rank update. Other adapter designs include prefix tuning (prepending learned vectors to attention keys and values), prompt tuning (learning soft prompt embeddings), and bottleneck adapters (inserting small feed-forward layers between transformer blocks). Each approach offers a different trade-off between expressiveness, memory efficiency, and ease of merging with the base model.
A key advantage of adapters is composability. Because adapters are separate from the base model, an organization can maintain a single base model and swap different adapters in and out at inference time for different tasks — customer support, content generation, code review — without duplicating the full model weights for each use case. Adapters can also be shared, merged, or stacked, enabling a modular ecosystem of specialized capabilities built on a common foundation.
Why It Matters
Full fine-tuning of a large model requires storing and updating every parameter, which for a 70B model means hundreds of gigabytes of GPU memory and a separate full-size copy of the model for every task. Adapters solve this scaling problem elegantly: the base model is loaded once, and each task only requires a small adapter file (often just 10–100 MB). This makes multi-task deployment economically viable and allows rapid experimentation with different fine-tuning strategies. For teams working with limited hardware budgets, adapters are often the difference between being able to fine-tune at all and being priced out.
How It Works
During adapter-based fine-tuning, the training framework identifies target layers in the model (typically the query, key, value, and output projection matrices in each attention block) and injects adapter modules alongside or within them. For LoRA, two small matrices A and B are initialized such that their product B×A starts at zero, meaning the adapter initially has no effect on the model's output. As training progresses, gradients update A and B to capture the task-specific knowledge. After training, the adapter weights can either be kept separate (for hot-swapping at inference) or merged into the base model weights for a single-file deployment.
Example Use Case
A customer support platform maintains one Mistral 7B base model and three LoRA adapters: one fine-tuned on billing inquiries, one on technical troubleshooting, and one on account management. When a support ticket arrives, the routing system classifies its topic and loads the corresponding adapter, delivering specialized responses without tripling the infrastructure cost of hosting three separate fine-tuned models.
Key Takeaways
- Adapters are small trainable modules that enable fine-tuning without modifying base model weights.
- LoRA is the most popular adapter architecture, reducing trainable parameters by 100–1000x.
- Multiple adapters can share a single base model, enabling cost-efficient multi-task deployment.
- Adapters can be merged into the base model or kept separate for hot-swapping at inference time.
- Parameter-efficient fine-tuning via adapters makes large model customization accessible on modest hardware.
How Ertas Helps
Ertas Studio uses adapter-based fine-tuning (LoRA and QLoRA) as its default training strategy. The visual canvas makes it easy to configure adapter rank, target modules, and alpha scaling without writing code. After training, Ertas gives users the option to export the adapter weights separately or merge them into the base model before exporting to GGUF — supporting both the hot-swappable and single-file deployment workflows from a unified interface.
Related Resources
Base Model
Fine-Tuning
LoRA
Model Distillation
Model Merging
QLoRA
Getting Started with Ertas: Fine-Tune and Deploy Custom AI Models
Introducing Ertas Studio: A Visual Canvas for Fine-Tuning AI Models
Model Distillation with LoRA: Training Smaller Models from Frontier Outputs
Hugging Face
llama.cpp
Ollama
Ertas for SaaS Product Teams
Ertas for Customer Support
Ertas for ML Engineers & Fine-Tuning Practitioners
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.