
Multi-Tenant Fine-Tuning: Per-Customer AI Models in Your SaaS
Your SaaS customers want AI that understands their data, not generic responses. Here's how to architect per-tenant fine-tuned models using LoRA adapters — with real storage math, cost breakdowns, and a serving architecture that scales to hundreds of tenants.
Your SaaS customers want AI that understands their data. Not your training data. Not some blended average across all your tenants. Their terminology, their workflows, their edge cases.
A legal-tech platform serving 80 law firms has 80 different vocabularies. A support platform serving 50 e-commerce brands has 50 different product catalogs, tone guides, and escalation policies. A healthcare SaaS serving 30 clinics has 30 different documentation styles and specialty-specific abbreviations.
Generic AI gives generic answers. And generic answers are why your customers keep asking for "better AI" in every feedback survey.
The fix isn't better prompts. It's per-tenant models — and it's more practical than you think.
The Three Architecture Patterns
There are exactly three ways to add tenant-aware fine-tuning to a SaaS product. Each makes different tradeoffs on cost, isolation, and quality.
Pattern 1: Shared Fine-Tune
You combine training data from all tenants into one dataset and fine-tune a single model. Every tenant hits the same model.
How it works:
- Aggregate training examples from all tenants
- Fine-tune one model on the combined dataset
- All API requests route to the same model endpoint
The upside: Simple. One model to manage, one deployment to monitor, one fine-tune to run. Storage cost is minimal — you're running one model.
The downside: The model learns an average across all tenants. If Tenant A uses "client" and Tenant B uses "customer" to mean the same thing, the model learns a muddy middle. Worse, Tenant A's data influences Tenant B's responses. For regulated industries, that's a compliance problem.
When it works: When your tenants are homogeneous — same industry, similar data, similar vocabulary. If you're building a SaaS for dental offices and they all document procedures similarly, a shared fine-tune might be enough.
Pattern 2: Per-Tenant LoRA Adapters on a Shared Base
You fine-tune a base model once (or use it as-is), then create a small LoRA adapter for each tenant. At inference time, you load the base model plus the tenant's adapter.
How it works:
- Deploy one base model (e.g., Llama 3.3 8B or Qwen 2.5 7B)
- Fine-tune a LoRA adapter per tenant using only that tenant's data
- At request time, load the base model + the correct adapter based on tenant ID
The upside: Each tenant gets a model that genuinely understands their data. Storage is tiny — a LoRA adapter is 50-200MB depending on rank and target modules, compared to 4-14GB for a full model. Training is fast and cheap. Data isolation is absolute: Tenant A's data never touches Tenant B's adapter.
The downside: You need adapter hot-swapping infrastructure. There's a small latency cost when switching between adapters (typically 50-200ms for a cold swap).
When it works: This is the right answer for most SaaS products. It's the architecture we recommend for 90% of multi-tenant AI use cases.
Pattern 3: Per-Tenant Full Fine-Tunes
You fine-tune a complete model for each tenant. Each tenant gets their own fully independent model.
How it works:
- Fine-tune a separate model for each tenant
- Deploy and serve each model independently
- No shared infrastructure between tenants at the model layer
The upside: Maximum isolation. Maximum customization. Each model can be a different size, different base, different quantization. You can give Enterprise Tenant A a 70B model and Startup Tenant B a 7B model.
The downside: Storage and compute costs scale linearly with tenant count. Managing 100 separate model deployments is an operational nightmare. Fine-tuning costs are 10-50x higher per tenant.
When it works: Enterprise customers with massive datasets (100K+ examples), strict data isolation requirements, and budgets to match. Think banks, defense contractors, large hospital networks.
The Storage Math
This is where the decision usually makes itself. Here's what each pattern costs in raw storage at different tenant counts:
| Tenants | Shared Fine-Tune | Per-Tenant LoRA | Per-Tenant Full (7B Q4) | Per-Tenant Full (13B Q5) |
|---|---|---|---|---|
| 1 | 4-5 GB | 4-5.2 GB | 4-5 GB | 9-10 GB |
| 10 | 4-5 GB | 4.5-7 GB | 40-50 GB | 90-100 GB |
| 50 | 4-5 GB | 6.5-15 GB | 200-250 GB | 450-500 GB |
| 100 | 4-5 GB | 9-25 GB | 400-500 GB | 900-1,000 GB |
| 500 | 4-5 GB | 29-105 GB | 2-2.5 TB | 4.5-5 TB |
The math: a LoRA adapter at rank 64 targeting attention layers runs 50-200MB. A full 7B model in Q4 quantization is about 4-5GB. At 100 tenants, per-tenant LoRA costs you 5-20GB total (one base model plus 100 adapters). Per-tenant full fine-tunes cost you 400-500GB minimum.
That's a 20-50x difference in storage alone. And storage is the cheap part.
Serving Architecture: How Adapter Hot-Swapping Works
The per-tenant LoRA pattern only works if you can swap adapters fast. Here's the architecture.
The Request Flow
Incoming Request
→ Extract tenant_id from auth token / header
→ Check adapter cache (in-memory LRU)
→ If cached: route to base model + cached adapter
→ If not cached: load adapter from disk/S3 (50-200ms)
→ Run inference
→ Return response
Running It with Ollama
Ollama supports loading LoRA adapters on top of base models. The setup:
-
One base model in memory. Load your base model (e.g.,
llama3.3:8b-q5_K_M) once. It stays resident. This costs ~6GB of VRAM. -
Adapter files on disk. Store each tenant's
.ggufadapter file in a directory structure:/models/adapters/{tenant_id}/adapter.gguf -
Request routing. Your API gateway extracts
tenant_id, selects the correct adapter path, and creates a Modelfile that references the base model plus the adapter. -
Adapter caching. Keep the most recently used adapters in an LRU cache. For most SaaS products, 80% of traffic comes from 20% of tenants. A cache holding 10-20 adapters handles the majority of requests with zero swap latency.
Latency Budget
| Operation | Time |
|---|---|
| Tenant ID extraction | under 1ms |
| Cache hit (adapter already loaded) | under 1ms |
| Cache miss (load adapter from local disk) | 50-200ms |
| Cache miss (load adapter from S3/object storage) | 200-500ms |
| Inference (7B model, 100 token response) | 500-2,000ms |
For a cached adapter, the per-tenant overhead is negligible. For a cold swap, you add 50-500ms once — subsequent requests for that tenant hit the cache.
Scaling Strategy
- Under 50 tenants: Single GPU server. All adapters fit in memory or cache with fast swaps.
- 50-200 tenants: Two GPU servers with consistent hashing by tenant_id. Each server handles a subset of tenants, improving cache hit rates.
- 200+ tenants: Kubernetes cluster with GPU nodes. Adapter pre-warming based on tenant activity patterns. Most SaaS products will never reach this tier.
Data Isolation and Compliance
This is where per-tenant LoRA adapters win decisively over shared fine-tuning.
Training Data Separation
With per-tenant adapters, data isolation is structural, not policy-based:
- Tenant A's training data is used exclusively to create Tenant A's adapter
- Tenant B's training data never enters the same training run
- Deleting a tenant means deleting one adapter file — not retraining a shared model
- Auditing is straightforward: each adapter has a clear provenance trail
With a shared fine-tune, you can't un-learn one tenant's data without retraining the entire model. That's a GDPR Article 17 problem — the right to erasure means you need the ability to remove a tenant's influence on the model.
GDPR and SOC 2 Implications
| Requirement | Shared Fine-Tune | Per-Tenant LoRA | Per-Tenant Full |
|---|---|---|---|
| Data isolation | Policy-based | Structural | Structural |
| Right to erasure | Requires full retrain | Delete adapter file | Delete model file |
| Audit trail | Complex (mixed data) | Clean (per-tenant) | Clean (per-tenant) |
| Data residency | One location | Per-tenant possible | Per-tenant possible |
| Breach scope | All tenants affected | Single tenant | Single tenant |
For any SaaS selling to enterprises, healthcare, legal, or financial services, the compliance story alone justifies per-tenant adapters. When a prospective customer's security team asks "is our data used to train models that serve other customers?" — the answer needs to be no.
Tenant Data Lifecycle
A clean data lifecycle for per-tenant fine-tuning:
- Ingest: Tenant uploads training data through your UI or API
- Validate: Automated quality checks (format, completeness, deduplication)
- Store: Training data in tenant-isolated storage (separate S3 prefixes or buckets)
- Train: Fine-tune adapter using only this tenant's data
- Deploy: Store adapter in tenant-specific path
- Serve: Load adapter on demand per request
- Delete: Remove training data + adapter when tenant churns or requests deletion
No step touches another tenant's data. No shared state between tenants at the model layer.
Cost Model: What It Actually Costs
Fine-Tuning Cost Per Tenant
Using a platform like Ertas on modest hardware (single consumer GPU or M-series Mac):
| Item | Cost |
|---|---|
| LoRA fine-tune (1,000 examples, 3 epochs, 7B model) | $2-5 in compute |
| LoRA fine-tune (5,000 examples, 3 epochs, 7B model) | $5-12 in compute |
| Full fine-tune (1,000 examples, 3 epochs, 7B model) | $30-80 in compute |
| Full fine-tune (5,000 examples, 3 epochs, 7B model) | $80-200 in compute |
LoRA fine-tuning a 7B model on 1,000 examples takes 15-45 minutes on an RTX 4090 or M3 Max. The compute cost is $2-5 per tenant. Even at 100 tenants, your total fine-tuning bill is $200-500 — a one-time cost that you can pass through as an onboarding fee or absorb into your subscription pricing.
Compare that to full fine-tunes at $30-80 each: $3,000-8,000 for 100 tenants. And you'll redo these periodically as tenants accumulate more data.
Serving Cost
This is where per-tenant LoRA shines hardest:
- One base model in VRAM: ~6GB for a 7B Q5 model
- Adapter overhead: ~50-200MB per loaded adapter, but you only load active ones
- Total VRAM for 100 tenants with 10 cached adapters: ~8GB
You're serving 100 tenants from a single GPU that would otherwise serve one. The per-tenant serving cost is effectively 1/100th of a dedicated model deployment.
Monthly serving cost comparison (100 tenants):
| Approach | Hardware | Monthly Cost |
|---|---|---|
| Per-tenant LoRA (self-hosted) | 1x RTX 4090 server | $150-300/mo |
| Per-tenant full models (self-hosted) | 10-20x GPU servers | $1,500-6,000/mo |
| Per-tenant OpenAI fine-tunes | API costs | $2,000-10,000/mo |
| Shared OpenAI API (no fine-tune) | API costs | $1,000-5,000/mo |
At $150-300/month to serve 100 tenants personalized models, the per-tenant cost is $1.50-3.00/month. That's a rounding error in your SaaS pricing.
Pricing It Into Your SaaS
Three models that work:
-
Included in enterprise tier. Fine-tuned AI is a feature of your $500+/month plan. Costs you $2-5 to set up per tenant, $1.50-3.00/month to serve. Massive margin.
-
Add-on feature. $50-100/month "AI Customization" add-on. Customers self-serve training data upload, you automate the fine-tuning pipeline.
-
Onboarding fee + included. $500 one-time setup fee covers fine-tuning costs and data preparation. Ongoing serving is included in subscription.
Any of these produces 90%+ margins on the AI feature itself.
Implementation Timeline
Adding per-tenant fine-tuning to an existing SaaS product is a 2-4 week project for a backend engineer. Here's the breakdown.
Week 1: Training Pipeline
- Set up Ertas or equivalent fine-tuning infrastructure
- Build tenant data export (pull training examples from your database per tenant)
- Create training data format converter (your schema to instruction/response pairs)
- Test fine-tuning pipeline end-to-end with one tenant
Week 2: Serving Infrastructure
- Deploy base model with Ollama or vLLM
- Build adapter loading and caching layer
- Implement tenant-aware request routing
- Add adapter hot-swap logic with LRU cache
Week 3: Product Integration
- Build tenant-facing data upload or training trigger UI
- Add fine-tuning job status tracking
- Integrate tenant-specific model into your existing AI features
- Implement fallback to base model when adapter isn't ready
Week 4: Operations and Polish
- Add monitoring: per-tenant latency, cache hit rates, adapter load times
- Build automated retraining triggers (new data threshold, scheduled)
- Set up adapter versioning and rollback
- Load testing with simulated multi-tenant traffic
You don't need an ML team. You need one backend engineer who can follow documentation and integrate an API. The fine-tuning complexity is handled by the platform. Your job is the plumbing: getting data in, routing requests, managing adapters.
Common Mistakes
Over-engineering the first version. Start with 5 tenants. Validate that per-tenant models measurably improve your product metrics. Then scale the infrastructure.
Ignoring data quality. A LoRA adapter trained on 200 high-quality examples outperforms one trained on 2,000 noisy examples. Build data validation before you build the training pipeline.
Skipping the fallback. When a new tenant signs up, they don't have a fine-tuned adapter yet. Your system needs to gracefully fall back to the base model (or a shared fine-tune) until their adapter is ready.
Not measuring the delta. Run A/B tests: base model vs. tenant-specific adapter. If the adapter doesn't measurably improve accuracy, relevance, or user satisfaction for a given tenant, don't ship it. Some tenants may not have enough unique data to benefit.
Training too frequently. Most tenants don't need daily retraining. Weekly or monthly retraining triggered by a data threshold (e.g., 100 new examples since last train) is sufficient and keeps compute costs predictable.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Where This Is Heading
The SaaS products that win the next three years will treat AI personalization like they treat data storage today — as a per-tenant resource that's provisioned automatically and scales with usage.
Right now, per-tenant fine-tuning feels like a competitive advantage. By 2028, it will be table stakes. Your customers will expect that your AI understands their terminology, their workflows, their edge cases — because your competitor's AI will.
The good news: the infrastructure is ready now. LoRA adapters made per-tenant fine-tuning economically viable. Adapter hot-swapping made it operationally feasible. Platforms like Ertas made it accessible to teams without ML expertise.
The question isn't whether to build per-tenant AI. It's whether you build it before or after your competitors do.
Further Reading
- Multi-Tenant AI Deployment for Agencies — How agencies manage per-client model deployments at scale
- LoRA Adapters Explained for Agencies — Deep dive into how LoRA adapters work and why they're the right unit of customization
- Adding AI Features to Your SaaS Without an ML Team — The broader playbook for shipping AI features with your existing engineering team
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

When Your SaaS Should Graduate from API Calls to Fine-Tuning
Your AI features work. Your API bill is growing faster than revenue. Here's the decision framework, cost math, and migration path for moving from per-token APIs to fine-tuned models — with real numbers at every step.

Adding AI Features to Your SaaS Without an ML Team
Your customers expect AI features but you don't have ML engineers. Here's how SaaS product teams can fine-tune domain-specific models using their existing product data — no Python, no ML expertise, no API cost cliff.

Building AI Features in Your SaaS: When to Stop Calling the OpenAI API
Adding AI features to your SaaS via OpenAI is fast to ship. But at some usage level, the economics break. Here's how to identify that threshold and what to do about it.