LoRA Adapters for AI Agency Owners (No ML Degree Required)

LoRA is one of those terms you keep encountering in AI agency circles without a clean, practical explanation of what it actually is and why it matters. The academic papers are dense. The blog posts written by ML engineers assume you care about the mathematics.

This is neither. This is the explanation for agency owners who need to understand LoRA well enough to use it profitably with clients.

The Problem LoRA Solves

When people talk about "fine-tuning" an AI model, they historically meant retraining the entire model on new data. For a 7B parameter model, that means updating 7 billion numbers. This requires massive compute, days of training time, and a complete copy of the model for each client. It was not practically viable for agency-scale operations.

LoRA (Low-Rank Adaptation) is a technique that fine-tunes a model by adding a small number of new parameters rather than modifying all the existing ones. The result is nearly equivalent to full fine-tuning for most tasks — but requires 10-100x less compute and produces a tiny output file instead of a full model copy.

The business translation: you can fine-tune a custom AI model for each client, on consumer hardware, in 1-4 hours, and the per-client customization weighs less than a PowerPoint file.

How LoRA Works (Conceptually)

Here is the analogy that makes it click:

Imagine an AI model is an expert professional — let's say a highly trained writer. They have spent years developing their craft (training) and have deep general knowledge. You want to hire this writer to produce content specifically for your client — a legal tech company with a very specific voice and terminology.

You have two options:

Option A (Full fine-tuning): Clone the writer, have the clone spend months learning everything about legal tech and your client's voice from scratch. Now you have two full writers. Repeat for each client and you have a staff of identical writers, each trained separately. Expensive and inefficient.

Option B (LoRA): Give the original writer a specialisation module — a set of notes, examples, and stylistic guidelines for this specific client. The writer reads the module before writing for this client, and their output reflects the specialisation without requiring them to be retrained from scratch. The module is small (a folder of notes, not years of training). You can have 50 modules for 50 clients, all sitting on top of the same expert base.

LoRA is Option B. The "specialisation module" is the LoRA adapter.

What an Adapter Actually Is

Technically, a LoRA adapter is a set of small weight matrices that are added to specific layers of the base model. These matrices are trained on your client's data. During inference, the base model's weights remain unchanged — the adapter modifies the model's behavior by adding its learned adjustments.

The resulting adapter file is typically 10-200MB, depending on the task and configuration. For comparison:

A 7B base model (Q4 GGUF): ~4GB
A LoRA adapter for that model: ~50-200MB (about 1-5% of the model size)

The adapter contains everything needed to reproduce the client-specific behavior. You can move it between machines, version-control it, or merge it into the base model for deployment.

The Agency Use Case

For an AI agency running multiple clients, the multi-adapter architecture looks like this:

One base model + N client adapters

Instead of storing a full 4GB model for each client, you store:

One 4GB base model (Llama 3.2 7B Q4, for example)
Per-client adapters at 50-200MB each

For 20 clients, this is the difference between 80GB of storage (full model copies) and 6GB (base + adapters). More importantly, the compute to train an adapter is a fraction of the compute to train a full model.

Inference with adapters:

When a request comes in for Client A, your inference server loads Client A's adapter on top of the base model. When a request comes in for Client B, it loads Client B's adapter. The switching is fast — adapters load in milliseconds. For most agency workloads, this architecture serves multiple clients from a single inference server without bottlenecks.

When LoRA Works Well (and When It Doesn't)

LoRA works very well for:

Style and tone training: Teaching the model to respond like a specific brand, person, or character
Domain terminology: Training the model to use your client's specific vocabulary, product names, and conventions
Task specialisation: Teaching the model to perform a specific classification, extraction, or generation task consistently
Instruction following: Training the model to follow specific output format requirements that prompting alone cannot enforce reliably

LoRA works less well for:

Adding factual knowledge the base model was never exposed to. LoRA modifies behavior, not knowledge. If you need the model to reliably recall facts about your client's product catalog (which changes frequently), LoRA is not the right tool — RAG (Retrieval Augmented Generation) is.
Fundamentally changing the model's capabilities. LoRA cannot make a 7B model reason like a 70B model. It can only make the 7B model better at the specific task you train it for.

The most powerful production setups combine LoRA fine-tuning (for behavior and style) with RAG (for current facts). Fine-tuning and RAG solve different problems and complement each other.

Practical LoRA Settings for Agency Work

When you run a LoRA fine-tuning job, you set several parameters. Here are the defaults that work well for most agency tasks:

Parameter	Recommended Value	What It Means
LoRA rank (r)	16-32	Higher = more capacity, more compute
LoRA alpha	32-64 (2x rank)	Scales the adapter's influence
Target modules	q_proj, v_proj	Which model layers get adapted
Training epochs	3-5	How many times the model sees your data
Learning rate	1e-4 to 3e-4	Speed of adaptation
Batch size	4-8	Samples processed together

For most agency tasks — support ticket classification, brand-voice generation, document summarisation — rank=16 with 3-5 epochs on 500-2,000 examples produces a strong adapter. You do not need to tune these extensively; the defaults work for the majority of cases.

Deploying a LoRA Adapter

After training, you have two deployment options:

Option 1: Merge and export to GGUF. Combine the adapter weights with the base model, quantize to Q4_K_M, and produce a single GGUF file. This is the simplest deployment — load it in Ollama like any other model. The tradeoff is that you have a separate full model file per client.

Option 2: Run the adapter separately from the base model. Keep the base model and adapter separate. The inference server loads the base model once and applies the appropriate adapter per client. This is more memory-efficient for multi-client setups but requires an inference server that supports adapter hot-swapping (vLLM with LoRA support, for example, or certain configurations of Ollama).

For most agency deployments, Option 1 (merge + GGUF) is simpler and more reliable. You trade storage space for operational simplicity. Ertas exports the merged GGUF automatically after fine-tuning.

The Business Framing for Clients

When you explain LoRA to clients, you do not need to use the term. The pitch is:

"We train a custom version of the AI model on your data — your support tickets, your product documentation, your style guide. The result is a model that understands your business specifically, not just AI in general. We run it on [your infrastructure / our private server], so your data never leaves our control. And because we own the model, we can keep improving it as your business evolves."

This is accurate, understandable, and valuable. The client does not need to know the technique is called LoRA. They need to understand that you are creating something that belongs to them, trained on their information, and maintained by you.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →