What is Multi-Tenant Inference?

Serving multiple clients or tenants from a single model deployment using per-tenant LoRA adapters, reducing infrastructure costs by sharing the base model while delivering customized AI behavior per tenant.

Definition

Multi-tenant inference is an infrastructure pattern where a single base model serves inference requests for multiple distinct clients (tenants), each receiving customized behavior through their own LoRA adapter layered on top of the shared base weights. Instead of deploying a separate model instance per client — which scales linearly in GPU memory and cost — the base model is loaded once, and lightweight adapters (typically 10-100 MB each) are swapped in per request based on the tenant identifier.

This architecture mirrors how multi-tenant SaaS applications share a single database engine while isolating data per customer. The inference server maintains a pool of loaded adapters in GPU or CPU memory, routes incoming requests to the correct adapter based on a tenant ID header or API key, and merges the adapter weights with the base model at inference time. Modern serving frameworks like vLLM and LoRAX support this natively, enabling adapter hot-swapping with minimal latency overhead — typically adding less than 5ms per request compared to single-tenant inference.

Why It Matters

For AI agencies and SaaS platforms that serve multiple clients, multi-tenant inference is the difference between a sustainable business model and one that drowns in infrastructure costs. Running a dedicated GPU instance per client at $1-3/hour means that 20 clients require 20 GPUs — roughly $15,000-45,000/month in compute alone. Multi-tenant inference collapses this to 1-3 GPUs serving all 20 clients, cutting infrastructure costs by 80-95% while maintaining per-client customization.

Beyond cost, multi-tenant inference solves the operational complexity of managing dozens of independent model deployments. A single deployment means one health check endpoint, one scaling policy, one upgrade path, and one monitoring dashboard. Data isolation is maintained at the adapter and request level rather than the infrastructure level, which is both simpler and more secure — each tenant's fine-tuned knowledge lives in their adapter file, never mixed with another tenant's training data. This pattern is essential for any organization building AI-powered products that need to serve multiple customers with distinct fine-tuned behaviors.

How It Works

The multi-tenant inference stack has three core components: an adapter registry, a request router, and an inference engine with adapter caching. The adapter registry stores all tenant adapters indexed by tenant ID — in production this is typically a cloud storage bucket or a local directory synced from a model management platform. The request router examines each incoming API request, extracts the tenant identifier (from an API key, header, or URL path), and maps it to the correct adapter.

The inference engine maintains a cache of recently used adapters in GPU memory. When a request arrives for a tenant whose adapter is already cached, inference proceeds immediately with near-zero overhead. For a cold adapter (not in cache), the engine loads it from the registry into GPU memory — a process that takes 50-200ms for a typical LoRA adapter. Sophisticated implementations use LRU (least recently used) eviction to manage the adapter cache, predictive pre-loading for tenants with known traffic patterns, and adapter batching to group requests from the same tenant together. With a well-tuned cache and 20 active tenants, cache hit rates above 95% are typical, meaning the vast majority of requests see no adapter loading latency at all.

Example Use Case

An AI automation agency serves 20 small-business clients, each with a custom chatbot trained on their company knowledge base. Rather than running 20 separate model instances, the agency deploys a single Llama 3 8B base model on one A100 GPU. Each client has a LoRA adapter (50 MB average) fine-tuned on their FAQ data, product catalog, and brand voice guidelines. The inference server receives requests tagged with a client API key, maps the key to the correct adapter, and serves responses personalized to that client's brand. The agency pays $2.50/hour for one GPU instead of $50/hour for 20 — a 95% reduction in compute costs. During peak hours when all 20 clients are active simultaneously, the adapter cache handles transitions smoothly, and average response latency remains under 200ms. Adding a new client requires only fine-tuning a new LoRA adapter and registering it in the system — no new infrastructure provisioning needed.

Key Takeaways

Multi-tenant inference shares a single base model across multiple clients, with per-tenant LoRA adapters providing customized behavior.
Infrastructure costs drop 80-95% compared to dedicated per-client model deployments, making AI agencies and SaaS platforms economically viable.
Adapter hot-swapping with GPU-level caching keeps latency overhead under 5ms for cached tenants and under 200ms for cold loads.
Data isolation is maintained at the adapter level — each tenant's fine-tuned knowledge is contained in their adapter file and never mixed with other tenants.
Modern serving frameworks like vLLM and LoRAX support multi-tenant LoRA serving natively, making this pattern production-ready today.

How Ertas Helps

Ertas is built around the multi-tenant inference paradigm. The platform's adapter management system in Vault lets agencies organize per-client LoRA adapters with version control, access permissions, and audit trails. Ertas Cloud provisions shared inference endpoints where multiple client adapters are registered against a single base model deployment, with automatic adapter caching and request routing by tenant API key. Combined with Vault's data isolation — ensuring each client's training data and adapter weights are strictly separated — Ertas provides the complete infrastructure layer for agencies and SaaS teams to run multi-tenant AI deployments without building the orchestration themselves.