Multi-Tenant AI Deployment: One Base Model, Dozens of Client Adapters

If you run an AI agency, you already know the tension: every client wants a model that feels custom-trained for their domain, their tone, their edge cases. But spinning up a dedicated model instance for each client is a fast track to GPU bankruptcy. The math simply does not work at scale.

The good news is that you do not have to choose between personalization and profitability. Multi-tenant AI deployment with LoRA adapter hot-swapping lets you serve dozens of clients from a single base model — each getting genuinely customized behavior without the cost of separate infrastructure.

The Multi-Tenant Challenge

Agencies typically start with a single fine-tuned model shared across all clients. That works until Client A needs formal medical language, Client B needs casual e-commerce copy, and Client C needs structured legal summaries. Suddenly your one-size-fits-all model is pleasing nobody.

The naive solution is one model per client. Load a 7B parameter model for each, and you are looking at roughly 14GB of VRAM per instance. Twenty clients means 280GB of GPU memory — multiple A100s just to keep the lights on. Hosting costs balloon, and your margins evaporate.

What you need is an architecture that delivers per-client customization at shared-infrastructure cost.

The Architecture: Base Model + Per-Client Adapters

The solution is straightforward in concept: keep one copy of the base model loaded in GPU memory and swap lightweight LoRA adapters per request.

A LoRA adapter modifies the behavior of a model by injecting small trainable weight matrices into specific layers. The key insight is that these adapters are tiny — typically 50-150MB for a 7B model, compared to the 14GB base. The base model handles the heavy lifting of general language understanding. The adapter steers output toward a specific client's style, domain, and requirements.

In practice, your inference server holds the base model resident in GPU memory at all times. When a request arrives tagged with a client ID, the server loads the corresponding adapter, runs inference, and returns the result. The base weights never move.

How Adapter Hot-Swapping Works

The mechanics of adapter swapping are surprisingly efficient. A LoRA adapter modifies only a small subset of the model's weight matrices — usually the attention layers. Loading an adapter means adding these small delta matrices on top of the base weights. Unloading means removing them.

On modern hardware, this swap takes single-digit milliseconds. The base model stays resident in VRAM throughout. There is no model loading, no checkpoint deserialization, no warmup period. The adapter simply slots in and out.

This is fundamentally different from loading a full model, which can take 30-60 seconds depending on size and storage speed.

The Storage Math

Here is where multi-tenant deployment gets compelling at the spreadsheet level:

Traditional approach (one model per client): 20 clients x 14GB per model = 280GB total VRAM needed

Adapter approach: 1 x 14GB base model + 20 x 100MB adapters = 16GB total VRAM (adapters loaded on demand)

That is a 17x reduction in memory requirements. You can serve 20 clients from a single GPU that would have required a multi-node cluster under the traditional approach. At 50 clients, the savings are even more dramatic.

Adapter storage on disk is equally modest. A hundred adapters at 100MB each is 10GB of SSD space — trivial by any measure.

Request Routing and Inference Flow

The request flow for multi-tenant inference looks like this:

Client request arrives with an API key or client identifier
Router resolves client ID to the corresponding adapter file
Adapter cache check — if the adapter is already loaded, skip to step 5
Load adapter into GPU memory alongside the base model
Run inference with the combined base + adapter weights
Return response to the client

For agencies with a manageable number of active clients (say, under 20 concurrently), you can keep all adapters loaded simultaneously. A 7B base model plus 20 adapters fits comfortably in 24GB of VRAM — a single consumer-grade GPU.

For larger client rosters, an LRU (least recently used) cache strategy works well. Keep the most active clients' adapters loaded, and swap less active ones on demand. The millisecond swap time means even cache misses are invisible to end users.

Performance Considerations

While the architecture is elegant, there are practical details worth planning for:

Adapter loading latency. Cold-loading an adapter from SSD takes 10-50ms. From NVMe, it is faster. For latency-sensitive applications, pre-warm adapters for clients with predictable usage patterns.

Batch inference. If multiple requests for the same client arrive simultaneously, batch them. If requests for different clients arrive, you have two options: process them sequentially (swapping adapters between requests) or maintain multiple adapter slots and process in parallel. The right choice depends on your throughput requirements.

Adapter versioning. Clients iterate. Their adapter from three months ago may be outdated. You need a system for versioning adapters, rolling back, and A/B testing new versions against production traffic.

Infrastructure Sizing

A rough guide for infrastructure planning:

1-20 concurrent clients: Single GPU server (24-48GB VRAM). All adapters stay loaded. Simple and cost-effective.
20-100 concurrent clients: Single high-end GPU (80GB VRAM) or a pair of 48GB GPUs. LRU adapter caching handles the rotation.
100+ concurrent clients: GPU cluster with load balancing. Shard clients across nodes, each running the same base model with a subset of adapters.

Most agencies fall squarely in the first tier. A single server with an RTX 4090 or A6000 can handle 20+ clients with comfortable headroom.

How Ertas Fits Into This Architecture

Ertas is designed to make multi-tenant AI deployment practical for agencies that do not employ a dedicated ML ops team.

Per-client adapter management. Train, version, and deploy LoRA adapters for each client through a unified interface. Each client's training data and adapter history is isolated and auditable.

Vault for data isolation. Client data never co-mingles. Ertas Vault enforces strict tenant isolation at the data layer — critical for agencies handling sensitive client information across industries.

GGUF export. When a client needs their model running on-premise or on edge devices, export their adapter merged with the base model as a single GGUF file. One click, and they have a standalone model ready for Ollama or llama.cpp.

The result is an agency that can onboard a new client, fine-tune their adapter, and deploy it into the multi-tenant stack — all without touching infrastructure code.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Start Building Your Multi-Tenant Stack

Multi-tenant AI deployment is not a future architecture pattern. It is how the most efficient AI agencies operate today. The combination of shared base models and per-client LoRA adapters delivers genuine customization at a fraction of the cost.

If you are ready to move beyond one-model-per-client and build a scalable AI agency, Ertas provides the training, deployment, and data management infrastructure to get there.

Multi-Tenant AI Deployment: One Base Model, Dozens of Client Adapters

The Multi-Tenant Challenge

The Architecture: Base Model + Per-Client Adapters

How Adapter Hot-Swapping Works

The Storage Math

Request Routing and Inference Flow

Performance Considerations

Infrastructure Sizing

How Ertas Fits Into This Architecture

Start Building Your Multi-Tenant Stack

Further Reading

Ship AI that runs on your users' devices.

Keep reading

White-Label AI: Build Custom Models for Every Client

The AI Agency's Guide to Model Versioning and Client Rollbacks

Running 10+ Fine-Tuned Models for Different Clients: Operations Guide