Multi-Client Fine-Tuning: One Base Model, Custom LoRA Adapters Per Law Firm

The economics of running an AI agency break down if you need a separate GPU for every client. A Llama 3.1 8B model takes 16 GB of VRAM. Five clients, five full models, five GPUs — that is $10,000-15,000 in hardware before you earn a dollar.

LoRA (Low-Rank Adaptation) changes this equation completely. One base model stays in GPU memory. Per-client adapters — typically 50-200 MB each — are swapped in and out at inference time. One GPU serves all your clients.

This article covers the architecture, how to train client-specific adapters, how hot-swapping works, the cost implications, and the data isolation guarantees that law firms require.

LoRA Architecture for Multi-Client Serving

How LoRA Works

Standard fine-tuning modifies all the model's weights — billions of parameters. LoRA takes a different approach: it freezes the base model and trains small "adapter" matrices that modify the model's behaviour at specific layers.

The math: instead of updating a weight matrix W (size d × k), LoRA trains two small matrices A (size d × r) and B (size r × k), where r (the "rank") is much smaller than d or k. The effective weight becomes W + BA.

For a rank-16 LoRA on a 7B model:

Base model size: ~14 GB (in FP16)
LoRA adapter size: ~50-100 MB
Combined inference: Same speed as the base model (adapter matrices are merged or applied efficiently)

Multi-Client Architecture

GPU Memory:
┌────────────────────────────────┐
│   Base Model (Llama 3.1 8B)   │ ← Loaded once, stays in memory
│          ~14 GB                │
├────────────────────────────────┤
│   Active LoRA Adapter          │ ← Swapped per request
│   (Client-specific, ~100 MB)   │
└────────────────────────────────┘

Adapter Storage (SSD):
├── firm-a-contract-review.safetensors    (85 MB)
├── firm-b-due-diligence.safetensors      (92 MB)
├── firm-c-case-summary.safetensors       (78 MB)
├── firm-d-regulatory.safetensors         (110 MB)
└── firm-e-intake-triage.safetensors      (65 MB)

One RTX 5090 (32 GB VRAM) can hold the base model plus several adapters simultaneously, or swap adapters from SSD in milliseconds.

Training Client-Specific Adapters

Each law firm client gets their own adapter trained on their specific data.

Data Preparation Per Client

For each firm, collect:

Historical work product: Contract reviews, case summaries, research memos, client correspondence
Style guidelines: How the firm formats deliverables, terminology preferences, risk rating scales
Domain focus: Practice area specialisation (M&A, litigation, IP, regulatory)

Format as instruction-response pairs:

{"instruction": "Review this merger agreement clause for antitrust risk: [clause text]", "response": "[Firm A's analysis style and risk assessment]"}

Training Configuration

For client-specific legal adapters:

Parameter	Value	Notes
Base model	Llama 3.1 8B	Shared across all clients
LoRA rank	16-32	16 for simple tasks, 32 for complex analysis
LoRA alpha	32-64	Typically 2× the rank
Target modules	q_proj, v_proj, k_proj, o_proj	Attention layers only for efficiency
Learning rate	2e-4	Standard for LoRA
Epochs	3	Sufficient for 2,000+ examples
Training examples	1,500-3,000	Per client

Training time per adapter: 30-90 minutes on a single GPU, using Ertas Studio or manual LoRA training.

Quality Validation

Before deploying a client's adapter:

Hold out 10-15% of training data for evaluation
Run the evaluation set through both base model and fine-tuned model
Compare outputs against the original lawyer-written responses
Check for:
- Accuracy of risk identification
- Consistency of risk ratings
- Adherence to firm-specific terminology
- Proper citation of clause numbers and cross-references

Adapter Hot-Swapping at Inference

Hot-swapping is what makes multi-client serving practical. The base model stays loaded; only the lightweight adapter changes between requests.

With Ollama

Ollama supports multiple model variants. Create a Modelfile for each client adapter:

FROM llama3.1:8b
ADAPTER /path/to/firm-a-adapter.gguf
SYSTEM "You are a legal document analyst for [Firm A]..."

ollama create firm-a-contract-review -f Modelfile.firm-a
ollama create firm-b-due-diligence -f Modelfile.firm-b

At inference time, specify the model name in the API request. Ollama handles adapter loading transparently.

With vLLM

vLLM supports LoRA adapter serving natively with the --enable-lora flag:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B \
  --enable-lora \
  --lora-modules firm-a=/adapters/firm-a firm-b=/adapters/firm-b

Request a specific adapter via the model parameter in the API call:

{
  "model": "firm-a",
  "messages": [{"role": "user", "content": "Review this clause..."}]
}

vLLM's LoRA implementation is particularly efficient — it can keep multiple adapters resident in GPU memory and switch between them with near-zero latency.

Swap Latency

Method	Cold Swap (adapter not in memory)	Hot Swap (adapter cached)
Ollama	500-2000 ms	Under 100 ms
vLLM	200-500 ms	Under 10 ms

For most legal workflows, even cold swap latency is imperceptible — the user is waiting for a multi-page document analysis that takes 5-30 seconds regardless.

Cost Efficiency

The cost advantage of the LoRA approach becomes dramatic as your client count grows:

Full Model Per Client (Naive Approach)

Clients	GPUs Needed	Hardware Cost
1	1 × RTX 5090	$2,000
5	5 × RTX 5090	$10,000
10	10 × RTX 5090	$20,000
20	20 × RTX 5090	$40,000

LoRA Adapter Approach

Clients	GPUs Needed	Hardware Cost
1-10	1 × RTX 5090	$2,000
10-25	1-2 × RTX 5090	$2,000-4,000
25-50	2-3 × RTX 5090	$4,000-6,000

A single GPU serves 10+ clients because:

The base model (14 GB) is loaded once
Each adapter adds only 50-100 MB to VRAM (or is swapped from SSD)
Legal workloads are bursty — not all clients generate inference requests simultaneously

At 10 clients, the LoRA approach is 5x cheaper. At 20 clients, it is 10x cheaper.

Per-Client Training Costs

Fine-tuning each adapter through Ertas Studio:

Training compute: minimal (30-90 minutes on agency GPU)
Data preparation: 2-4 hours of agency time
Validation and deployment: 1-2 hours

Total agency cost per new client adapter: roughly half a day of work. This is priced into the implementation fee — typically $5,000-15,000 for a legal AI deployment.

Data Isolation Guarantees

Law firms require absolute data isolation between clients. The LoRA architecture provides this at multiple levels:

Training Data Isolation

Each client's training data is used exclusively for their adapter. In Ertas Studio, each client is a separate project with separate data storage. No cross-client data mixing occurs during training.

Adapter Isolation

Each adapter file is cryptographically independent — the weights in Firm A's adapter contain no information from Firm B's training data. The adapter is a mathematical transformation learned solely from the client's data.

Inference Isolation

At inference time:

Each request specifies which adapter to use
The adapter is loaded exclusively for that request
Request inputs and outputs are logged separately per client
No shared state between client requests

Audit Evidence

For compliance documentation, you can demonstrate:

Training data provenance: which data trained which adapter
Adapter lineage: base model version + training configuration + adapter checksum
Inference logging: which adapter served which request, with timestamps
No cross-contamination: adapter weights are mathematically independent

This level of isolation satisfies even the most stringent legal compliance requirements, including conflict wall requirements for firms with competing clients.

Getting Started

Set up your base model on a GPU (Ollama or vLLM)
Prepare training data for your first client
Train a LoRA adapter using Ertas Studio or manual fine-tuning
Deploy and test with the client
Repeat for each new client — same base model, new adapter

The marginal effort per client decreases as you standardise your data preparation and deployment pipeline.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →