Back to blog
    AI Agency Tech Stack for Legal Clients: n8n + Fine-Tuned Models + On-Prem Deployment
    agencytech-stacklegaln8narchitecturesegment:agency

    AI Agency Tech Stack for Legal Clients: n8n + Fine-Tuned Models + On-Prem Deployment

    The complete architecture for AI agencies serving law firms — from n8n orchestration to fine-tuned model inference to client-facing delivery. Component selection, deployment topology, and scaling considerations.

    EErtas Team·

    Building AI solutions for law firms requires a specific technology stack that satisfies legal compliance requirements while remaining manageable for a small agency team. This article documents the full architecture — every component, why it was chosen, and how the pieces connect.

    The Full Architecture

    ┌─────────────────────────────────────────────────────────┐
    │                    CLIENT NETWORK                        │
    │                                                          │
    │  ┌──────────┐    ┌──────────┐    ┌───────────────────┐  │
    │  │   DMS    │───→│   n8n    │───→│  LLM Inference    │  │
    │  │(iManage) │    │(self-    │    │  (Ollama/vLLM)    │  │
    │  └──────────┘    │ hosted)  │    │  + LoRA Adapters   │  │
    │                  └────┬─────┘    └───────────────────┘  │
    │                       │                                  │
    │                  ┌────▼─────┐    ┌───────────────────┐  │
    │                  │ Vector   │    │  Client Portal    │  │
    │                  │   DB     │    │  (results UI)     │  │
    │                  │(Chroma/  │    └───────────────────┘  │
    │                  │ Qdrant)  │                            │
    │                  └──────────┘                            │
    └─────────────────────────────────────────────────────────┘
    

    Every component runs within the law firm's network. No data leaves the perimeter.

    Component Selection

    n8n: Workflow Orchestration

    Why n8n:

    • Self-hostable (Docker, bare metal) — no SaaS dependency
    • Visual workflow builder that non-technical staff can understand during demos
    • OpenAI-compatible node connects directly to local LLM endpoints
    • Webhook triggers for real-time document processing
    • Built-in error handling, retry logic, and execution logging
    • Active open-source community with legal-relevant workflow templates

    Why not Make.com or Zapier:

    • Both are cloud-only SaaS — data must leave the firm's network
    • Cannot self-host for air-gapped deployments
    • Vendor dependency creates risk for long-term engagements

    n8n deployment: Docker container with PostgreSQL backend. Resource-light — 2 CPU cores and 4 GB RAM handles most agency workloads.

    LLM Inference: Ollama or vLLM

    Ollama for simpler deployments:

    • Single binary installation, minimal configuration
    • Built-in model management (download, version, switch models)
    • OpenAI-compatible API endpoint out of the box
    • Lower throughput but simpler operations

    vLLM for production deployments:

    • Higher throughput with continuous batching
    • Better GPU utilisation under concurrent load
    • OpenAI-compatible API
    • More operational complexity (Python environment, model loading)

    Decision framework: Start with Ollama for pilot deployments and single-client setups. Move to vLLM when you need to serve multiple concurrent users or multiple client adapters on the same GPU.

    Fine-Tuned Models + LoRA Adapters

    The base model + adapter architecture is the foundation of multi-client agency operations:

    • One base model (Llama 3.1 8B) loaded in GPU memory
    • Per-client LoRA adapters (50-200 MB each) that customise the base model for each firm's specific tasks and style
    • Dynamic adapter loading — swap adapters at inference time based on which client's request is being processed

    This architecture means a single GPU serves all your legal clients. Each client gets a model that behaves as if it was trained exclusively on their data, but the infrastructure cost is shared. See our multi-client LoRA guide for the technical details.

    Fine-tuning happens through Ertas Studio — upload client data, configure training, export the adapter. No ML expertise required.

    Vector Database: Chroma or Qdrant

    For legal AI, pure fine-tuning is often complemented by Retrieval-Augmented Generation (RAG) for tasks that require referencing specific documents:

    Chroma for lightweight deployments:

    • Embedded mode runs in-process (no separate server)
    • Simple Python API
    • Good for collections under 1M documents

    Qdrant for production deployments:

    • Dedicated server with REST and gRPC APIs
    • Better performance at scale (millions of documents)
    • Built-in filtering (useful for multi-client data isolation)
    • Docker deployment

    When to use RAG alongside fine-tuning:

    • Contract review against a clause library → RAG retrieves similar clauses, fine-tuned model analyses
    • Legal research → RAG retrieves relevant case law, fine-tuned model summarises and synthesises
    • Due diligence → RAG searches the data room, fine-tuned model extracts and classifies

    Client Portal

    Law firms expect a polished interface, not raw API outputs. Options:

    Custom web app: A simple React or Next.js application that:

    • Accepts document uploads
    • Shows processing status
    • Displays analysis results in a formatted report
    • Provides an export function (PDF, DOCX)
    • Authenticates against the firm's identity provider (SAML/OIDC)

    n8n + form interface: For simpler deployments, n8n's webhook + form trigger can serve as a basic intake interface. Less polished but faster to deploy.

    Integration with existing tools: Many firms prefer results delivered into their existing document management system (iManage, NetDocuments) or matter management platform rather than a separate portal.

    Deployment Topology

    Single-Client Deployment

    For a small firm (10-50 lawyers):

    ComponentHardwareNotes
    n8n + PostgreSQL + Vector DBClient's existing server or VM4 CPU, 8 GB RAM
    LLM inference + model filesDedicated GPU workstationRTX 5090, 32 GB VRAM
    Client portalSame server as n8nServed via Nginx

    Total additional hardware cost to client: $2,500-4,000 (GPU workstation only if they do not already have one).

    Multi-Client Agency Deployment

    For an agency managing 5-15 law firm clients:

    Option A: Centralised (Agency-Hosted)

    • Agency operates a server room or colocation rack
    • Each client's data is logically isolated (separate databases, separate LoRA adapters)
    • Requires robust access controls and audit logging
    • Lower hardware cost per client
    • Note: Some firms will not accept this model — their data must be on their own hardware

    Option B: Distributed (Client-Hosted)

    • Each client has their own hardware stack
    • Agency manages remotely via VPN or secure remote access
    • Higher hardware cost (duplicated across clients) but maximum data isolation
    • Preferred by most law firms due to data sovereignty requirements

    Option C: Hybrid

    • Client-hosted inference (GPU + model on client hardware)
    • Agency-hosted n8n (orchestration only, no client data persisted)
    • Fine-tuning happens on agency infrastructure, adapter files delivered to client

    Most agencies start with Option B and migrate clients willing to centralise to Option A as trust builds.

    Data Flow: A Complete Example

    Here is the step-by-step data flow for a contract review workflow:

    1. Lawyer uploads contract to the client portal (or drops it into a monitored DMS folder)
    2. n8n webhook fires, triggering the contract review workflow
    3. n8n extracts text from the document (PDF parsing node)
    4. n8n chunks the document into sections (Function node)
    5. For each section, n8n queries the vector DB for similar clauses from the firm's precedent library
    6. n8n sends each section + retrieved context to the local LLM with the firm-specific LoRA adapter loaded
    7. LLM returns risk analysis for each section
    8. n8n aggregates results into a structured review report
    9. Report is delivered to the client portal, emailed, or written back to the DMS
    10. All execution data is logged in n8n's execution history and the audit log

    Total processing time for a 30-page contract: 2-5 minutes.

    Scaling Considerations

    Adding More Clients

    Each new client requires:

    • A new LoRA adapter (trained via Ertas Studio)
    • A new vector DB collection (if using RAG)
    • New n8n workflows (cloned from templates, customised per client)
    • Client-specific configuration in the portal

    The base model and inference infrastructure are shared. Marginal cost per new client: fine-tuning time + adapter storage (trivial).

    Handling Increased Volume

    When a single GPU becomes saturated:

    • Add a second GPU to the same server (most workstations support 2 GPUs)
    • Use vLLM's tensor parallelism to split models across GPUs
    • Or deploy a second inference server and load-balance with Nginx

    Adding New Capabilities

    New use cases (e.g., adding legal research to a firm that started with contract review) require:

    • A new fine-tuned adapter for the new task
    • New n8n workflows
    • New vector DB collection (if the task uses RAG)

    The infrastructure scales horizontally — same stack, new adapters.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading