Back to blog
    Per-User LoRA Adapters: Personalized AI at Scale Without Per-Token Costs
    lorapersonalizationfine-tuningscalingsegment:saas

    Per-User LoRA Adapters: Personalized AI at Scale Without Per-Token Costs

    LoRA adapters are 50-200MB each. You can hot-swap them per user request, delivering personalized AI experiences from a single base model — without multiplying your inference costs.

    EErtas Team·

    Every user wants AI that understands their context. Their terminology, their preferences, their domain quirks, their communication style. A support agent that knows their product inside out. A writing assistant that matches their voice. An analyst that understands their data schema.

    The current approaches to personalization all have problems:

    Per-user system prompts: Stuff the user's context into a long system prompt. Works for light personalization, but you're paying for those context tokens on every single request. A 2,000-token system prompt costs $0.005 per request on GPT-4o — multiply by 1,000 requests per user per month across 500 users and you're spending $2,500/month just on system prompt tokens. Plus, system prompts have limits. You can't encode deep behavioral patterns in 2,000 tokens.

    Per-user RAG: Maintain a separate knowledge base per user, retrieve relevant context at inference time. Better for factual personalization, but you're still paying per-token for the retrieved context. Plus the operational overhead: separate vector stores per user, embedding pipelines, retrieval quality monitoring. For 1,000 users, that's 1,000 vector databases to maintain.

    Per-user fine-tuning (full): Fine-tune a complete model per user. Gives you deep personalization but is wildly impractical. A 7B model is 4GB quantized. 1,000 users means 4TB of model storage and 1,000 separate model instances to manage.

    LoRA adapters solve this elegantly.

    The LoRA Personalization Architecture

    One base model in memory. Per-user LoRA adapters stored on disk. Load the right adapter when the user makes a request. Serve the response. Unload the adapter. Ready for the next user.

    The numbers that make this work:

    • Base model (Llama 3.1 8B, Q4): 4GB in GPU memory
    • Per-user LoRA adapter: 50-200MB on disk
    • Adapter load time: 50-200ms (from SSD)
    • 1,000 users × 100MB average adapter: 100GB total storage
    • Cost of 100GB SSD storage: ~$8/month on cloud

    Compare that to 1,000 full model copies at 4TB total, or 1,000 separate RAG pipelines. The economics are not close.

    How Hot-Swapping Works

    The inference flow for a per-user adapter architecture:

    1. User request arrives with user ID
    2. Check if user's adapter is already loaded in memory
    3. If not: load adapter from disk (50-200ms)
    4. Merge adapter weights with base model
    5. Run inference
    6. Return response
    7. Keep adapter in memory cache (LRU eviction for inactive users)

    Memory Management

    You don't need to keep all adapters in memory simultaneously. An LRU (least recently used) cache handles this naturally:

    • Active users (made a request in the last 5 minutes): adapter stays in GPU memory
    • Recent users (last hour): adapter in CPU memory, fast reload
    • Inactive users: adapter on disk, 200ms reload

    A 24GB GPU can hold the base model (4GB) plus 15-30 adapters simultaneously (depending on adapter size and quantization). For most applications, 80-90% of requests hit an already-loaded adapter because user activity follows a power-law distribution: a small number of users generate most of the traffic.

    Implementation with vLLM

    vLLM supports LoRA adapter hot-swapping natively as of late 2025:

    from vllm import LLM, SamplingParams
    from vllm.lora.request import LoRARequest
    
    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        enable_lora=True,
        max_lora_rank=64,
        max_loras=20,  # Max adapters in GPU memory simultaneously
    )
    
    # Serve request with user-specific adapter
    output = llm.generate(
        prompts=["Help me draft a proposal for..."],
        sampling_params=SamplingParams(temperature=0.7, max_tokens=1024),
        lora_request=LoRARequest(
            lora_name="user_12345",
            lora_int_id=12345,
            lora_local_path="/adapters/user_12345/",
        ),
    )
    

    vLLM handles the adapter loading, caching, and eviction automatically. You just pass the adapter path with each request.

    What Per-User Adapters Actually Learn

    A per-user adapter trained on 200-500 examples of a user's interactions learns:

    Communication Style

    How the user prefers responses formatted. Short and direct vs. detailed and thorough. Bullet points vs. paragraphs. Technical terminology vs. plain language. After fine-tuning, the model produces output that matches the user's preferred style without needing a style guide in the system prompt.

    Domain Knowledge

    The user's specific terminology, product names, internal jargon, project names, team members, and acronyms. Instead of explaining what "Q3 OKRs" or "Project Lighthouse" means in every system prompt, the adapter encodes this context directly.

    Task Patterns

    The types of tasks the user performs repeatedly. If a user always asks the AI to convert meeting notes into action items in a specific format, the adapter learns that pattern. The model needs fewer instructions per request because the behavior is baked in.

    Preference Biases

    When there are multiple valid responses, the adapter learns which the user prefers. Conservative vs. aggressive recommendations. Formal vs. casual tone. Detailed caveats vs. confident assertions.

    Use Cases

    Per-Client Agency Models

    An AI agency serving 50 clients can maintain 50 adapters on a single base model. Each client gets an AI that knows their brand voice, their products, their customer base, and their preferred output formats. The agency runs one server with one base model and swaps adapters per client request.

    Storage: 50 clients × 150MB = 7.5GB. That's less than two full model copies. See our guide on LoRA adapters for agencies for the detailed agency architecture.

    Per-Tenant SaaS

    A SaaS platform offering AI features can personalize the AI for each customer account. A project management tool where the AI understands each team's workflows. A CRM where the AI knows each company's sales process. A documentation platform where the AI matches each company's writing style.

    The per-tenant adapter approach scales linearly: onboard a new customer, collect 200-500 examples from their usage, fine-tune an adapter, deploy. The customer gets personalized AI within their first month of usage.

    Per-Department Enterprise

    Large enterprises with different departments that use AI differently. Legal needs formal, precise language with citation requirements. Marketing needs creative, brand-consistent copy. Engineering needs technical, concise documentation. HR needs empathetic, policy-aware communication.

    Four departments, four adapters, one base model. Each department gets an AI that feels like it was built specifically for them.

    Personal AI Assistants

    Consumer applications where each user gets an AI that learns their preferences over time. Start with a generic adapter. As the user interacts, collect examples. Fine-tune periodically (weekly or monthly). The AI becomes progressively more personalized without ever sending the user's data to a cloud API.

    Training Per-User Adapters from Interaction History

    The training data for per-user adapters comes from the user's own interactions. Here's the collection and training pipeline:

    Step 1: Log Interactions

    Capture every user interaction: the input, the model's output, and any feedback signal (explicit thumbs up/down, implicit signals like "the user accepted this draft without edits" or "the user rewrote 80% of this draft").

    Step 2: Filter for Positive Examples

    Select interactions where the outcome was good:

    • User accepted the output without major edits
    • User gave explicit positive feedback
    • The task was completed successfully (measured by downstream metrics)

    For negative examples with corrections, create training pairs where the input is the original request and the output is the user's corrected version — this teaches the model what the user actually wanted.

    Step 3: Format and Balance

    Convert to standard chat format. Balance the dataset across task types so the adapter doesn't overfit to the most common task at the expense of less frequent ones.

    Minimum viable dataset: 200 examples for noticeable personalization. 500 examples for strong personalization. 1,000+ examples for deep behavioral alignment.

    Step 4: Fine-Tune

    Upload to Ertas, select your base model, configure LoRA rank (16-64 depending on personalization depth), and train. Training time for 500 examples on an 8B model: 30-60 minutes. Output: a LoRA adapter file ready for deployment.

    Step 5: Iterate

    Retrain the adapter monthly (or after every 200 new interactions) with the expanded dataset. Each iteration deepens the personalization. After 3-4 cycles, users typically report the AI "just gets them."

    Scaling Math

    Let's work through the numbers for a SaaS platform with 10,000 users:

    Storage

    • 10,000 adapters × 100MB average = 1TB
    • Cloud storage cost (S3/GCS): ~$23/month
    • Fast SSD storage for active adapters: ~$100/month for 1TB NVMe

    Compute

    • Base model: 1× A10G GPU ($0.60-$1.00/hr) handles 50-100 requests/second
    • For 10,000 users with typical usage patterns (80% of traffic from 20% of users), you need adapters for ~2,000 active users at any given time
    • vLLM with 20-30 adapters cached in GPU memory handles this with sub-200ms adapter swap overhead

    Training

    • Initial adapter training: 30-60 minutes per user
    • 10,000 users: batch-train over 2-4 weeks using scheduled GPU time
    • Monthly retraining: prioritize active users (top 2,000), retrain others quarterly
    • Training cost: ~$0.50-$1.00 per adapter per training run

    Total Monthly Cost

    ComponentCost
    Inference GPU (A10G, 24/7)$440-$730
    Storage (1TB SSD + backup)$123
    Monthly retraining (2,000 users)$1,000-$2,000
    Total$1,563-$2,853

    That's $0.16-$0.29 per user per month for fully personalized AI. Compare to the system-prompt approach on GPT-4o at $2.50+ per user per month (assuming 500 requests/user/month with 2K-token system prompts).

    Adapter Versioning and Rollback

    Per-user adapters need the same version control discipline as any production artifact:

    • Version each adapter with a timestamp and training data hash
    • Keep the previous version so you can rollback if a retraining produces worse results
    • A/B test new adapters by serving the new version to 10% of the user's requests and comparing quality metrics
    • Automated quality checks after each retraining: run the new adapter on a held-out set of the user's interactions and verify output quality hasn't degraded

    A simple file structure works:

    /adapters/
      user_12345/
        v1_2026-01-15/
          adapter_model.safetensors
          adapter_config.json
          metadata.json  # training data stats, eval metrics
        v2_2026-02-15/
          ...
        current -> v2_2026-02-15/  # symlink to active version
    

    Privacy Considerations

    Per-user adapters encode user behavior patterns directly into model weights. This has privacy implications:

    • Adapters contain learned patterns from user data. Treat them with the same security as the raw data.
    • Adapter files should be encrypted at rest and access-controlled per user.
    • Users should be able to delete their adapter (right to erasure). This is simpler than purging data from RAG systems — just delete the adapter file.
    • Adapters don't memorize verbatim data the way RAG does. A fine-tuned model learns patterns, not exact strings. But it can still leak sensitive information through generation, so access controls matter.

    The advantage over cloud-based personalization: the adapter and all user data stay on your infrastructure. No user interaction data is sent to OpenAI or Anthropic for processing. This matters for regulated industries and privacy-conscious users.

    When Per-User Adapters Don't Make Sense

    Some situations where the complexity isn't worth it:

    • Users with fewer than 100 interactions: Not enough data to train a meaningful adapter. Use system prompts or RAG until you have sufficient data.
    • Highly uniform use cases: If all users use the AI the same way, a single fine-tuned model serves everyone. Per-user adapters add complexity without benefit.
    • Rapidly changing requirements: If what users need from the AI changes weekly, the adapter can't keep up with retraining cycles.
    • Very small user bases (fewer than 50 users): The infrastructure overhead of adapter management outweighs the benefits. Individual system prompts with RAG are simpler.

    For SaaS products with hundreds or thousands of users who each have distinct workflows, communication styles, and domain knowledge, per-user LoRA adapters are the most cost-effective path to deep personalization.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading