Back to blog
    Multi-Client Fine-Tuning: One Base Model, Custom LoRA Adapters Per Law Firm
    loramulti-tenantlegalfine-tuningagencysegment:agency

    Multi-Client Fine-Tuning: One Base Model, Custom LoRA Adapters Per Law Firm

    How to use LoRA adapters to serve multiple law firm clients from a single base model — covering architecture, training, hot-swapping, cost efficiency, and data isolation guarantees.

    EErtas Team·

    The economics of running an AI agency break down if you need a separate GPU for every client. A Llama 3.1 8B model takes 16 GB of VRAM. Five clients, five full models, five GPUs — that is $10,000-15,000 in hardware before you earn a dollar.

    LoRA (Low-Rank Adaptation) changes this equation completely. One base model stays in GPU memory. Per-client adapters — typically 50-200 MB each — are swapped in and out at inference time. One GPU serves all your clients.

    This article covers the architecture, how to train client-specific adapters, how hot-swapping works, the cost implications, and the data isolation guarantees that law firms require.

    LoRA Architecture for Multi-Client Serving

    How LoRA Works

    Standard fine-tuning modifies all the model's weights — billions of parameters. LoRA takes a different approach: it freezes the base model and trains small "adapter" matrices that modify the model's behaviour at specific layers.

    The math: instead of updating a weight matrix W (size d × k), LoRA trains two small matrices A (size d × r) and B (size r × k), where r (the "rank") is much smaller than d or k. The effective weight becomes W + BA.

    For a rank-16 LoRA on a 7B model:

    • Base model size: ~14 GB (in FP16)
    • LoRA adapter size: ~50-100 MB
    • Combined inference: Same speed as the base model (adapter matrices are merged or applied efficiently)

    Multi-Client Architecture

    GPU Memory:
    ┌────────────────────────────────┐
    │   Base Model (Llama 3.1 8B)   │ ← Loaded once, stays in memory
    │          ~14 GB                │
    ├────────────────────────────────┤
    │   Active LoRA Adapter          │ ← Swapped per request
    │   (Client-specific, ~100 MB)   │
    └────────────────────────────────┘
    
    Adapter Storage (SSD):
    ├── firm-a-contract-review.safetensors    (85 MB)
    ├── firm-b-due-diligence.safetensors      (92 MB)
    ├── firm-c-case-summary.safetensors       (78 MB)
    ├── firm-d-regulatory.safetensors         (110 MB)
    └── firm-e-intake-triage.safetensors      (65 MB)
    

    One RTX 5090 (32 GB VRAM) can hold the base model plus several adapters simultaneously, or swap adapters from SSD in milliseconds.

    Training Client-Specific Adapters

    Each law firm client gets their own adapter trained on their specific data.

    Data Preparation Per Client

    For each firm, collect:

    1. Historical work product: Contract reviews, case summaries, research memos, client correspondence
    2. Style guidelines: How the firm formats deliverables, terminology preferences, risk rating scales
    3. Domain focus: Practice area specialisation (M&A, litigation, IP, regulatory)

    Format as instruction-response pairs:

    {"instruction": "Review this merger agreement clause for antitrust risk: [clause text]", "response": "[Firm A's analysis style and risk assessment]"}
    

    Training Configuration

    For client-specific legal adapters:

    ParameterValueNotes
    Base modelLlama 3.1 8BShared across all clients
    LoRA rank16-3216 for simple tasks, 32 for complex analysis
    LoRA alpha32-64Typically 2× the rank
    Target modulesq_proj, v_proj, k_proj, o_projAttention layers only for efficiency
    Learning rate2e-4Standard for LoRA
    Epochs3Sufficient for 2,000+ examples
    Training examples1,500-3,000Per client

    Training time per adapter: 30-90 minutes on a single GPU, using Ertas Studio or manual LoRA training.

    Quality Validation

    Before deploying a client's adapter:

    1. Hold out 10-15% of training data for evaluation
    2. Run the evaluation set through both base model and fine-tuned model
    3. Compare outputs against the original lawyer-written responses
    4. Check for:
      • Accuracy of risk identification
      • Consistency of risk ratings
      • Adherence to firm-specific terminology
      • Proper citation of clause numbers and cross-references

    Adapter Hot-Swapping at Inference

    Hot-swapping is what makes multi-client serving practical. The base model stays loaded; only the lightweight adapter changes between requests.

    With Ollama

    Ollama supports multiple model variants. Create a Modelfile for each client adapter:

    FROM llama3.1:8b
    ADAPTER /path/to/firm-a-adapter.gguf
    SYSTEM "You are a legal document analyst for [Firm A]..."
    

    Register each as a separate model name:

    ollama create firm-a-contract-review -f Modelfile.firm-a
    ollama create firm-b-due-diligence -f Modelfile.firm-b
    

    At inference time, specify the model name in the API request. Ollama handles adapter loading transparently.

    With vLLM

    vLLM supports LoRA adapter serving natively with the --enable-lora flag:

    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.1-8B \
      --enable-lora \
      --lora-modules firm-a=/adapters/firm-a firm-b=/adapters/firm-b
    

    Request a specific adapter via the model parameter in the API call:

    {
      "model": "firm-a",
      "messages": [{"role": "user", "content": "Review this clause..."}]
    }
    

    vLLM's LoRA implementation is particularly efficient — it can keep multiple adapters resident in GPU memory and switch between them with near-zero latency.

    Swap Latency

    MethodCold Swap (adapter not in memory)Hot Swap (adapter cached)
    Ollama500-2000 msUnder 100 ms
    vLLM200-500 msUnder 10 ms

    For most legal workflows, even cold swap latency is imperceptible — the user is waiting for a multi-page document analysis that takes 5-30 seconds regardless.

    Cost Efficiency

    The cost advantage of the LoRA approach becomes dramatic as your client count grows:

    Full Model Per Client (Naive Approach)

    ClientsGPUs NeededHardware Cost
    11 × RTX 5090$2,000
    55 × RTX 5090$10,000
    1010 × RTX 5090$20,000
    2020 × RTX 5090$40,000

    LoRA Adapter Approach

    ClientsGPUs NeededHardware Cost
    1-101 × RTX 5090$2,000
    10-251-2 × RTX 5090$2,000-4,000
    25-502-3 × RTX 5090$4,000-6,000

    A single GPU serves 10+ clients because:

    • The base model (14 GB) is loaded once
    • Each adapter adds only 50-100 MB to VRAM (or is swapped from SSD)
    • Legal workloads are bursty — not all clients generate inference requests simultaneously

    At 10 clients, the LoRA approach is 5x cheaper. At 20 clients, it is 10x cheaper.

    Per-Client Training Costs

    Fine-tuning each adapter through Ertas Studio:

    • Training compute: minimal (30-90 minutes on agency GPU)
    • Data preparation: 2-4 hours of agency time
    • Validation and deployment: 1-2 hours

    Total agency cost per new client adapter: roughly half a day of work. This is priced into the implementation fee — typically $5,000-15,000 for a legal AI deployment.

    Data Isolation Guarantees

    Law firms require absolute data isolation between clients. The LoRA architecture provides this at multiple levels:

    Training Data Isolation

    Each client's training data is used exclusively for their adapter. In Ertas Studio, each client is a separate project with separate data storage. No cross-client data mixing occurs during training.

    Adapter Isolation

    Each adapter file is cryptographically independent — the weights in Firm A's adapter contain no information from Firm B's training data. The adapter is a mathematical transformation learned solely from the client's data.

    Inference Isolation

    At inference time:

    • Each request specifies which adapter to use
    • The adapter is loaded exclusively for that request
    • Request inputs and outputs are logged separately per client
    • No shared state between client requests

    Audit Evidence

    For compliance documentation, you can demonstrate:

    1. Training data provenance: which data trained which adapter
    2. Adapter lineage: base model version + training configuration + adapter checksum
    3. Inference logging: which adapter served which request, with timestamps
    4. No cross-contamination: adapter weights are mathematically independent

    This level of isolation satisfies even the most stringent legal compliance requirements, including conflict wall requirements for firms with competing clients.

    Getting Started

    1. Set up your base model on a GPU (Ollama or vLLM)
    2. Prepare training data for your first client
    3. Train a LoRA adapter using Ertas Studio or manual fine-tuning
    4. Deploy and test with the client
    5. Repeat for each new client — same base model, new adapter

    The marginal effort per client decreases as you standardise your data preparation and deployment pipeline.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading