Back to blog
    Running 10+ Fine-Tuned Models for Different Clients: Operations Guide
    agencyoperationsmulti-tenantfine-tuningscalingsegment:agency

    Running 10+ Fine-Tuned Models for Different Clients: Operations Guide

    An operations guide for AI agencies managing 10+ fine-tuned models across multiple clients — covering model organization, resource allocation, monitoring, updates, and scaling without chaos.

    EErtas Team·

    At three clients, you can keep everything in your head. At five, a spreadsheet works. At ten, something breaks — a model gets deployed to the wrong client, an update overwrites a production adapter, or you realize you have no idea which GPU is running what.

    This is the multi-model reality for AI agencies. The work that got you here — bespoke fine-tuning, hands-on deployment, personal attention — doesn't scale unless you build systems around it. This guide is the operations playbook for running 10+ fine-tuned models across multiple clients without losing your mind or your margins.

    The Multi-Model Reality

    Most agencies hit the wall somewhere between 5 and 10 active client models. The symptoms are predictable:

    • You can't remember which version of which adapter is deployed where
    • Two team members retrain the same model on the same day with different data
    • A client reports degraded performance and you spend 2 hours figuring out what changed
    • Your GPU costs are climbing faster than your revenue because nothing is shared efficiently

    The root cause is always the same: ad-hoc management that worked at small scale doesn't survive contact with real volume. You need systems.

    Model Organization System

    The foundation is a naming convention that encodes everything you need to know at a glance. We recommend this format:

    {client}-{task}-{base}-v{major}.{minor}.{patch}
    

    For example:

    • acme-support-llama3-v2.1.0 — Acme Corp's support ticket model, based on Llama 3, second major version
    • baker-legal-mistral-v1.3.2 — Baker Law's legal review model, based on Mistral, with three patches applied

    This naming convention carries through everywhere: your file system, your deployment configs, your monitoring dashboards, and your client communications.

    The LoRA Adapter Library

    If you're running one full base model per client, you're doing it wrong. LoRA adapters are the entire reason multi-client AI agencies are viable.

    Structure your adapter library like this:

    models/
    ├── base/
    │   ├── llama3-8b/
    │   └── mistral-7b/
    ├── adapters/
    │   ├── acme/
    │   │   ├── support-v2.1.0/
    │   │   └── support-v2.0.0/  (previous version, kept for rollback)
    │   ├── baker/
    │   │   ├── legal-v1.3.2/
    │   │   └── legal-v1.3.1/
    │   └── ...
    └── configs/
        ├── acme-support.yaml
        └── baker-legal.yaml
    

    Each adapter directory contains the LoRA weights, the training config that produced them, a hash of the training data, and eval results. Everything needed to reproduce or roll back.

    Base Model Sharing

    This is where operational efficiency lives. A single Llama 3 8B base model loaded into VRAM can serve multiple LoRA adapters simultaneously. The key insight: you don't need separate model instances for separate clients. You need separate adapters on shared infrastructure.

    In practice, this means grouping clients by base model. If 7 of your 12 clients use Llama 3 8B variants, those 7 adapters can share a single base model in memory.

    Resource Planning

    Hardware planning for multi-model serving requires specific numbers, not vibes. Here's what we've seen work:

    Single RTX 4090 (24GB VRAM):

    • 1 base model (7-8B parameters) + 3-5 LoRA adapters simultaneously
    • Handles ~50-80 concurrent requests across all adapters
    • Good for: agencies with up to 5 clients on the same base model

    Dual RTX 4090 setup:

    • 2 base models + 6-10 adapters total
    • Handles 100-160 concurrent requests
    • Good for: agencies with 8-12 clients across 2 base model families

    A100 80GB:

    • 1 large base model (70B) or 2-3 smaller base models + 10-15 adapters
    • Handles 200+ concurrent requests
    • Good for: agencies with 12-20 clients who need larger models

    The math matters. If you're paying $2/hour for an A100 and serving 15 clients at $3K/month each, your compute cost is ~$1,440/month against $45K in revenue. That's a 96.8% gross margin on infrastructure alone.

    Memory Budget Per Adapter

    A LoRA adapter for a 7B model typically adds 10-50MB to VRAM, depending on rank. At rank 16 (which covers most use cases), you're looking at ~20MB per adapter. That means VRAM isn't your bottleneck — throughput and latency are.

    Plan for peak concurrent usage per client. If Client A sends 5 requests/minute during business hours and Client B sends 20, your serving infrastructure needs to handle 25 requests/minute on that base model during overlap hours.

    Monitoring Essentials

    You cannot manage what you don't measure. For multi-client operations, you need four categories of monitoring:

    1. Per-Model Latency

    Track P50, P95, and P99 latency for every client's model separately. A latency spike on one adapter affects all adapters sharing that base model. Set alerts at 2x baseline P95.

    Target latencies for most agency use cases:

    • Simple classification/extraction: P95 < 500ms
    • Short generation (1-2 paragraphs): P95 < 2s
    • Long generation (full documents): P95 < 10s

    2. Accuracy Drift

    Models degrade over time as the world changes and client needs evolve. Set up automated eval runs — weekly at minimum — against each client's golden test set. Track accuracy, hallucination rate, and format compliance.

    When accuracy drops more than 3 percentage points from the post-training baseline, it's time to retrain. Don't wait for the client to notice.

    3. Usage Tracking

    Log every inference request with: timestamp, client ID, model version, input token count, output token count, latency. This data serves three purposes:

    • Capacity planning (when to add hardware)
    • Client billing (usage-based or for overage charges)
    • Training data collection (production inputs are your next training set)

    4. Cost Allocation Per Client

    Know exactly what each client costs you. The formula:

    Client cost = (GPU hours × share of compute) + (storage for adapter + data) + (staff hours for maintenance)
    

    If a client's cost exceeds 40% of their monthly fee, something needs to change — either your pricing or your efficiency.

    Update Workflow

    Retraining is where most agencies create chaos. Here's the workflow that prevents it:

    Retrain Schedule

    Set a cadence per client tier:

    • Standard clients: quarterly retraining
    • Premium clients: monthly retraining
    • Enterprise clients: continuous improvement with monthly deploys

    Never retrain ad-hoc. Schedule it, resource it, and communicate it.

    A/B Deployment for Updates

    Never swap a production model in-place. Instead:

    1. Deploy the new adapter version alongside the current one
    2. Route 10% of traffic to the new version (canary)
    3. Monitor for 24-48 hours
    4. If metrics hold or improve, ramp to 50%, then 100%
    5. Keep the old version available for 7 days post-cutover

    This takes discipline, but it prevents the 3am "the model is broken" calls.

    Rollback Procedures

    Rollback should take less than 60 seconds. Since you're swapping LoRA adapters, not full models, this is achievable:

    1. Point the adapter reference back to the previous version
    2. The base model stays loaded — no restart needed
    3. Confirm with a quick smoke test against 5-10 known inputs
    4. Notify the client that you've reverted and are investigating

    If rollback takes longer than 5 minutes, your deployment system needs work.

    Common Scaling Mistakes

    We've watched agencies make these mistakes repeatedly. Save yourself the pain:

    One base model per client. Loading separate instances of the same 7B model for each client wastes 90%+ of your VRAM. Use shared base models with per-client LoRA adapters.

    No versioning. "I'll just overwrite the adapter file" is a statement that precedes disaster. Version everything. Keep at least 3 previous versions per client.

    Manual deployment. If deploying a model update requires SSH-ing into a server and running commands by hand, you will make mistakes under pressure. Automate your deployment pipeline — even a simple script is better than manual steps.

    Ignoring resource contention. When Client A's batch job runs at 2pm and Client B's real-time API traffic peaks at the same time, both get slow. Understand your traffic patterns and plan for overlap.

    No cost tracking. Agencies that don't track per-client costs inevitably have clients that cost more to serve than they pay. This erodes your business without you realizing it.

    Ertas Studio's Multi-Model Dashboard

    Ertas Studio was built specifically for the multi-client agency workflow. The dashboard gives you a single view of all deployed models across all clients:

    • Model registry with full version history, training lineage, and eval scores
    • Resource monitor showing per-adapter compute usage and cost allocation
    • Automated eval pipeline that runs your test suites on schedule and alerts on drift
    • One-click deployment with canary routing and instant rollback
    • Client-scoped views so you can share monitoring data with clients without exposing other tenants

    The goal is to make managing 20 models feel like managing 2. The system handles the coordination; you handle the client relationships and model quality.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    What This Looks Like in Practice

    An agency running 12 clients on Ertas typically operates with:

    • 2-3 base models serving all 12 clients via LoRA adapters
    • Automated weekly evals catching drift before clients notice
    • Monthly retraining cycles for premium clients, quarterly for standard
    • A deployment pipeline that takes a retrained adapter from eval to production in under an hour
    • Per-client cost tracking showing 70-85% gross margins

    That's the difference between an agency that scrambles and one that scales. The models are the product, but operations is the business.


    Building a multi-client AI practice? Read more about multi-tenant deployment architecture, how agencies use per-client LoRA adapters for law firms, and strategies for reducing costs as you scale.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading