
Multi-Client Fine-Tuning: One Base Model, Custom LoRA Adapters Per Law Firm
How to use LoRA adapters to serve multiple law firm clients from a single base model — covering architecture, training, hot-swapping, cost efficiency, and data isolation guarantees.
The economics of running an AI agency break down if you need a separate GPU for every client. A Llama 3.1 8B model takes 16 GB of VRAM. Five clients, five full models, five GPUs — that is $10,000-15,000 in hardware before you earn a dollar.
LoRA (Low-Rank Adaptation) changes this equation completely. One base model stays in GPU memory. Per-client adapters — typically 50-200 MB each — are swapped in and out at inference time. One GPU serves all your clients.
This article covers the architecture, how to train client-specific adapters, how hot-swapping works, the cost implications, and the data isolation guarantees that law firms require.
LoRA Architecture for Multi-Client Serving
How LoRA Works
Standard fine-tuning modifies all the model's weights — billions of parameters. LoRA takes a different approach: it freezes the base model and trains small "adapter" matrices that modify the model's behaviour at specific layers.
The math: instead of updating a weight matrix W (size d × k), LoRA trains two small matrices A (size d × r) and B (size r × k), where r (the "rank") is much smaller than d or k. The effective weight becomes W + BA.
For a rank-16 LoRA on a 7B model:
- Base model size: ~14 GB (in FP16)
- LoRA adapter size: ~50-100 MB
- Combined inference: Same speed as the base model (adapter matrices are merged or applied efficiently)
Multi-Client Architecture
GPU Memory:
┌────────────────────────────────┐
│ Base Model (Llama 3.1 8B) │ ← Loaded once, stays in memory
│ ~14 GB │
├────────────────────────────────┤
│ Active LoRA Adapter │ ← Swapped per request
│ (Client-specific, ~100 MB) │
└────────────────────────────────┘
Adapter Storage (SSD):
├── firm-a-contract-review.safetensors (85 MB)
├── firm-b-due-diligence.safetensors (92 MB)
├── firm-c-case-summary.safetensors (78 MB)
├── firm-d-regulatory.safetensors (110 MB)
└── firm-e-intake-triage.safetensors (65 MB)
One RTX 5090 (32 GB VRAM) can hold the base model plus several adapters simultaneously, or swap adapters from SSD in milliseconds.
Training Client-Specific Adapters
Each law firm client gets their own adapter trained on their specific data.
Data Preparation Per Client
For each firm, collect:
- Historical work product: Contract reviews, case summaries, research memos, client correspondence
- Style guidelines: How the firm formats deliverables, terminology preferences, risk rating scales
- Domain focus: Practice area specialisation (M&A, litigation, IP, regulatory)
Format as instruction-response pairs:
{"instruction": "Review this merger agreement clause for antitrust risk: [clause text]", "response": "[Firm A's analysis style and risk assessment]"}
Training Configuration
For client-specific legal adapters:
| Parameter | Value | Notes |
|---|---|---|
| Base model | Llama 3.1 8B | Shared across all clients |
| LoRA rank | 16-32 | 16 for simple tasks, 32 for complex analysis |
| LoRA alpha | 32-64 | Typically 2× the rank |
| Target modules | q_proj, v_proj, k_proj, o_proj | Attention layers only for efficiency |
| Learning rate | 2e-4 | Standard for LoRA |
| Epochs | 3 | Sufficient for 2,000+ examples |
| Training examples | 1,500-3,000 | Per client |
Training time per adapter: 30-90 minutes on a single GPU, using Ertas Studio or manual LoRA training.
Quality Validation
Before deploying a client's adapter:
- Hold out 10-15% of training data for evaluation
- Run the evaluation set through both base model and fine-tuned model
- Compare outputs against the original lawyer-written responses
- Check for:
- Accuracy of risk identification
- Consistency of risk ratings
- Adherence to firm-specific terminology
- Proper citation of clause numbers and cross-references
Adapter Hot-Swapping at Inference
Hot-swapping is what makes multi-client serving practical. The base model stays loaded; only the lightweight adapter changes between requests.
With Ollama
Ollama supports multiple model variants. Create a Modelfile for each client adapter:
FROM llama3.1:8b
ADAPTER /path/to/firm-a-adapter.gguf
SYSTEM "You are a legal document analyst for [Firm A]..."
Register each as a separate model name:
ollama create firm-a-contract-review -f Modelfile.firm-a
ollama create firm-b-due-diligence -f Modelfile.firm-b
At inference time, specify the model name in the API request. Ollama handles adapter loading transparently.
With vLLM
vLLM supports LoRA adapter serving natively with the --enable-lora flag:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B \
--enable-lora \
--lora-modules firm-a=/adapters/firm-a firm-b=/adapters/firm-b
Request a specific adapter via the model parameter in the API call:
{
"model": "firm-a",
"messages": [{"role": "user", "content": "Review this clause..."}]
}
vLLM's LoRA implementation is particularly efficient — it can keep multiple adapters resident in GPU memory and switch between them with near-zero latency.
Swap Latency
| Method | Cold Swap (adapter not in memory) | Hot Swap (adapter cached) |
|---|---|---|
| Ollama | 500-2000 ms | Under 100 ms |
| vLLM | 200-500 ms | Under 10 ms |
For most legal workflows, even cold swap latency is imperceptible — the user is waiting for a multi-page document analysis that takes 5-30 seconds regardless.
Cost Efficiency
The cost advantage of the LoRA approach becomes dramatic as your client count grows:
Full Model Per Client (Naive Approach)
| Clients | GPUs Needed | Hardware Cost |
|---|---|---|
| 1 | 1 × RTX 5090 | $2,000 |
| 5 | 5 × RTX 5090 | $10,000 |
| 10 | 10 × RTX 5090 | $20,000 |
| 20 | 20 × RTX 5090 | $40,000 |
LoRA Adapter Approach
| Clients | GPUs Needed | Hardware Cost |
|---|---|---|
| 1-10 | 1 × RTX 5090 | $2,000 |
| 10-25 | 1-2 × RTX 5090 | $2,000-4,000 |
| 25-50 | 2-3 × RTX 5090 | $4,000-6,000 |
A single GPU serves 10+ clients because:
- The base model (14 GB) is loaded once
- Each adapter adds only 50-100 MB to VRAM (or is swapped from SSD)
- Legal workloads are bursty — not all clients generate inference requests simultaneously
At 10 clients, the LoRA approach is 5x cheaper. At 20 clients, it is 10x cheaper.
Per-Client Training Costs
Fine-tuning each adapter through Ertas Studio:
- Training compute: minimal (30-90 minutes on agency GPU)
- Data preparation: 2-4 hours of agency time
- Validation and deployment: 1-2 hours
Total agency cost per new client adapter: roughly half a day of work. This is priced into the implementation fee — typically $5,000-15,000 for a legal AI deployment.
Data Isolation Guarantees
Law firms require absolute data isolation between clients. The LoRA architecture provides this at multiple levels:
Training Data Isolation
Each client's training data is used exclusively for their adapter. In Ertas Studio, each client is a separate project with separate data storage. No cross-client data mixing occurs during training.
Adapter Isolation
Each adapter file is cryptographically independent — the weights in Firm A's adapter contain no information from Firm B's training data. The adapter is a mathematical transformation learned solely from the client's data.
Inference Isolation
At inference time:
- Each request specifies which adapter to use
- The adapter is loaded exclusively for that request
- Request inputs and outputs are logged separately per client
- No shared state between client requests
Audit Evidence
For compliance documentation, you can demonstrate:
- Training data provenance: which data trained which adapter
- Adapter lineage: base model version + training configuration + adapter checksum
- Inference logging: which adapter served which request, with timestamps
- No cross-contamination: adapter weights are mathematically independent
This level of isolation satisfies even the most stringent legal compliance requirements, including conflict wall requirements for firms with competing clients.
Getting Started
- Set up your base model on a GPU (Ollama or vLLM)
- Prepare training data for your first client
- Train a LoRA adapter using Ertas Studio or manual fine-tuning
- Deploy and test with the client
- Repeat for each new client — same base model, new adapter
The marginal effort per client decreases as you standardise your data preparation and deployment pipeline.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Multi-Tenant AI Deployment for Agencies — Broader architecture for serving multiple clients
- Model Distillation and LoRA Guide — Technical deep dive into LoRA training and distillation
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Managing 50+ LoRA Adapters in Production: Versioning and Organization
Practical systems for managing dozens of LoRA adapters across multiple clients, tasks, and base models — covering naming conventions, metadata, registries, multi-LoRA serving, and scaling milestones from 10 to 100+ adapters.

LoRA Adapters Per Healthcare Specialty: Radiology, Pathology, Primary Care
How to serve multiple hospital departments from a single base model using specialty-specific LoRA adapters. Covers architecture, training data requirements, storage math, adapter management, and performance benchmarks.

From n8n Workflow to Fine-Tuned Model: A Step-by-Step Agency Playbook
A tactical guide for n8n agencies: collect client interaction data via workflows, clean and format it, fine-tune a model in Ertas Studio, deploy locally, and connect back to n8n for inference.