Running 10+ Fine-Tuned Models for Different Clients: Operations Guide

At three clients, you can keep everything in your head. At five, a spreadsheet works. At ten, something breaks — a model gets deployed to the wrong client, an update overwrites a production adapter, or you realize you have no idea which GPU is running what.

This is the multi-model reality for AI agencies. The work that got you here — bespoke fine-tuning, hands-on deployment, personal attention — doesn't scale unless you build systems around it. This guide is the operations playbook for running 10+ fine-tuned models across multiple clients without losing your mind or your margins.

The Multi-Model Reality

Most agencies hit the wall somewhere between 5 and 10 active client models. The symptoms are predictable:

You can't remember which version of which adapter is deployed where
Two team members retrain the same model on the same day with different data
A client reports degraded performance and you spend 2 hours figuring out what changed
Your GPU costs are climbing faster than your revenue because nothing is shared efficiently

The root cause is always the same: ad-hoc management that worked at small scale doesn't survive contact with real volume. You need systems.

Model Organization System

The foundation is a naming convention that encodes everything you need to know at a glance. We recommend this format:

{client}-{task}-{base}-v{major}.{minor}.{patch}

For example:

acme-support-llama3-v2.1.0 — Acme Corp's support ticket model, based on Llama 3, second major version
baker-legal-mistral-v1.3.2 — Baker Law's legal review model, based on Mistral, with three patches applied

This naming convention carries through everywhere: your file system, your deployment configs, your monitoring dashboards, and your client communications.

The LoRA Adapter Library

If you're running one full base model per client, you're doing it wrong. LoRA adapters are the entire reason multi-client AI agencies are viable.

Structure your adapter library like this:

models/
├── base/
│   ├── llama3-8b/
│   └── mistral-7b/
├── adapters/
│   ├── acme/
│   │   ├── support-v2.1.0/
│   │   └── support-v2.0.0/  (previous version, kept for rollback)
│   ├── baker/
│   │   ├── legal-v1.3.2/
│   │   └── legal-v1.3.1/
│   └── ...
└── configs/
    ├── acme-support.yaml
    └── baker-legal.yaml

Each adapter directory contains the LoRA weights, the training config that produced them, a hash of the training data, and eval results. Everything needed to reproduce or roll back.

This is where operational efficiency lives. A single Llama 3 8B base model loaded into VRAM can serve multiple LoRA adapters simultaneously. The key insight: you don't need separate model instances for separate clients. You need separate adapters on shared infrastructure.

In practice, this means grouping clients by base model. If 7 of your 12 clients use Llama 3 8B variants, those 7 adapters can share a single base model in memory.

Resource Planning

Hardware planning for multi-model serving requires specific numbers, not vibes. Here's what we've seen work:

Single RTX 4090 (24GB VRAM):

1 base model (7-8B parameters) + 3-5 LoRA adapters simultaneously
Handles ~50-80 concurrent requests across all adapters
Good for: agencies with up to 5 clients on the same base model

Dual RTX 4090 setup:

2 base models + 6-10 adapters total
Handles 100-160 concurrent requests
Good for: agencies with 8-12 clients across 2 base model families

A100 80GB:

1 large base model (70B) or 2-3 smaller base models + 10-15 adapters
Handles 200+ concurrent requests
Good for: agencies with 12-20 clients who need larger models

The math matters. If you're paying $2/hour for an A100 and serving 15 clients at $3K/month each, your compute cost is ~$1,440/month against $45K in revenue. That's a 96.8% gross margin on infrastructure alone.

Memory Budget Per Adapter

A LoRA adapter for a 7B model typically adds 10-50MB to VRAM, depending on rank. At rank 16 (which covers most use cases), you're looking at ~20MB per adapter. That means VRAM isn't your bottleneck — throughput and latency are.

Plan for peak concurrent usage per client. If Client A sends 5 requests/minute during business hours and Client B sends 20, your serving infrastructure needs to handle 25 requests/minute on that base model during overlap hours.

Monitoring Essentials

You cannot manage what you don't measure. For multi-client operations, you need four categories of monitoring:

1. Per-Model Latency

Track P50, P95, and P99 latency for every client's model separately. A latency spike on one adapter affects all adapters sharing that base model. Set alerts at 2x baseline P95.

Target latencies for most agency use cases:

Simple classification/extraction: P95 < 500ms
Short generation (1-2 paragraphs): P95 < 2s
Long generation (full documents): P95 < 10s

2. Accuracy Drift

Models degrade over time as the world changes and client needs evolve. Set up automated eval runs — weekly at minimum — against each client's golden test set. Track accuracy, hallucination rate, and format compliance.

When accuracy drops more than 3 percentage points from the post-training baseline, it's time to retrain. Don't wait for the client to notice.

3. Usage Tracking

Log every inference request with: timestamp, client ID, model version, input token count, output token count, latency. This data serves three purposes:

Capacity planning (when to add hardware)
Client billing (usage-based or for overage charges)
Training data collection (production inputs are your next training set)

4. Cost Allocation Per Client

Know exactly what each client costs you. The formula:

Client cost = (GPU hours × share of compute) + (storage for adapter + data) + (staff hours for maintenance)

If a client's cost exceeds 40% of their monthly fee, something needs to change — either your pricing or your efficiency.

Update Workflow

Retraining is where most agencies create chaos. Here's the workflow that prevents it:

Retrain Schedule

Set a cadence per client tier:

Standard clients: quarterly retraining
Premium clients: monthly retraining
Enterprise clients: continuous improvement with monthly deploys

Never retrain ad-hoc. Schedule it, resource it, and communicate it.

A/B Deployment for Updates

Never swap a production model in-place. Instead:

Deploy the new adapter version alongside the current one
Route 10% of traffic to the new version (canary)
Monitor for 24-48 hours
If metrics hold or improve, ramp to 50%, then 100%
Keep the old version available for 7 days post-cutover

This takes discipline, but it prevents the 3am "the model is broken" calls.

Rollback Procedures

Rollback should take less than 60 seconds. Since you're swapping LoRA adapters, not full models, this is achievable:

Point the adapter reference back to the previous version
The base model stays loaded — no restart needed
Confirm with a quick smoke test against 5-10 known inputs
Notify the client that you've reverted and are investigating

If rollback takes longer than 5 minutes, your deployment system needs work.

Common Scaling Mistakes

We've watched agencies make these mistakes repeatedly. Save yourself the pain:

One base model per client. Loading separate instances of the same 7B model for each client wastes 90%+ of your VRAM. Use shared base models with per-client LoRA adapters.

No versioning. "I'll just overwrite the adapter file" is a statement that precedes disaster. Version everything. Keep at least 3 previous versions per client.

Manual deployment. If deploying a model update requires SSH-ing into a server and running commands by hand, you will make mistakes under pressure. Automate your deployment pipeline — even a simple script is better than manual steps.

Ignoring resource contention. When Client A's batch job runs at 2pm and Client B's real-time API traffic peaks at the same time, both get slow. Understand your traffic patterns and plan for overlap.

No cost tracking. Agencies that don't track per-client costs inevitably have clients that cost more to serve than they pay. This erodes your business without you realizing it.

Ertas Studio's Multi-Model Dashboard

Ertas Studio was built specifically for the multi-client agency workflow. The dashboard gives you a single view of all deployed models across all clients:

Model registry with full version history, training lineage, and eval scores
Resource monitor showing per-adapter compute usage and cost allocation
Automated eval pipeline that runs your test suites on schedule and alerts on drift
One-click deployment with canary routing and instant rollback
Client-scoped views so you can share monitoring data with clients without exposing other tenants

The goal is to make managing 20 models feel like managing 2. The system handles the coordination; you handle the client relationships and model quality.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

What This Looks Like in Practice

An agency running 12 clients on Ertas typically operates with:

2-3 base models serving all 12 clients via LoRA adapters
Automated weekly evals catching drift before clients notice
Monthly retraining cycles for premium clients, quarterly for standard
A deployment pipeline that takes a retrained adapter from eval to production in under an hour
Per-client cost tracking showing 70-85% gross margins

That's the difference between an agency that scrambles and one that scales. The models are the product, but operations is the business.

Building a multi-client AI practice? Read more about multi-tenant deployment architecture, how agencies use per-client LoRA adapters for law firms, and strategies for reducing costs as you scale.

Running 10+ Fine-Tuned Models for Different Clients: Operations Guide

The Multi-Model Reality

Model Organization System

The LoRA Adapter Library

Resource Planning

Memory Budget Per Adapter

Monitoring Essentials

1. Per-Model Latency

2. Accuracy Drift

3. Usage Tracking

4. Cost Allocation Per Client

Update Workflow

Retrain Schedule

A/B Deployment for Updates

Rollback Procedures

Common Scaling Mistakes

Ertas Studio's Multi-Model Dashboard

What This Looks Like in Practice

Ship AI that runs on your users' devices.

Keep reading

How to QA a Fine-Tuned Model Before Client Delivery

The AI Agency's Guide to Model Versioning and Client Rollbacks

AI Agency Proposal Template: How to Win Custom Model Projects

The Multi-Model Reality

Model Organization System

The LoRA Adapter Library

Base Model Sharing

Resource Planning

Memory Budget Per Adapter

Monitoring Essentials

1. Per-Model Latency

2. Accuracy Drift

3. Usage Tracking

4. Cost Allocation Per Client

Update Workflow

Retrain Schedule

A/B Deployment for Updates

Rollback Procedures

Common Scaling Mistakes

Ertas Studio's Multi-Model Dashboard

What This Looks Like in Practice

Ship AI that runs on your users' devices.

Keep reading

How to QA a Fine-Tuned Model Before Client Delivery

The AI Agency's Guide to Model Versioning and Client Rollbacks

AI Agency Proposal Template: How to Win Custom Model Projects