
AI Agency Tech Stack for Legal Clients: n8n + Fine-Tuned Models + On-Prem Deployment
The complete architecture for AI agencies serving law firms — from n8n orchestration to fine-tuned model inference to client-facing delivery. Component selection, deployment topology, and scaling considerations.
Building AI solutions for law firms requires a specific technology stack that satisfies legal compliance requirements while remaining manageable for a small agency team. This article documents the full architecture — every component, why it was chosen, and how the pieces connect.
The Full Architecture
┌─────────────────────────────────────────────────────────┐
│ CLIENT NETWORK │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ DMS │───→│ n8n │───→│ LLM Inference │ │
│ │(iManage) │ │(self- │ │ (Ollama/vLLM) │ │
│ └──────────┘ │ hosted) │ │ + LoRA Adapters │ │
│ └────┬─────┘ └───────────────────┘ │
│ │ │
│ ┌────▼─────┐ ┌───────────────────┐ │
│ │ Vector │ │ Client Portal │ │
│ │ DB │ │ (results UI) │ │
│ │(Chroma/ │ └───────────────────┘ │
│ │ Qdrant) │ │
│ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Every component runs within the law firm's network. No data leaves the perimeter.
Component Selection
n8n: Workflow Orchestration
Why n8n:
- Self-hostable (Docker, bare metal) — no SaaS dependency
- Visual workflow builder that non-technical staff can understand during demos
- OpenAI-compatible node connects directly to local LLM endpoints
- Webhook triggers for real-time document processing
- Built-in error handling, retry logic, and execution logging
- Active open-source community with legal-relevant workflow templates
Why not Make.com or Zapier:
- Both are cloud-only SaaS — data must leave the firm's network
- Cannot self-host for air-gapped deployments
- Vendor dependency creates risk for long-term engagements
n8n deployment: Docker container with PostgreSQL backend. Resource-light — 2 CPU cores and 4 GB RAM handles most agency workloads.
LLM Inference: Ollama or vLLM
Ollama for simpler deployments:
- Single binary installation, minimal configuration
- Built-in model management (download, version, switch models)
- OpenAI-compatible API endpoint out of the box
- Lower throughput but simpler operations
vLLM for production deployments:
- Higher throughput with continuous batching
- Better GPU utilisation under concurrent load
- OpenAI-compatible API
- More operational complexity (Python environment, model loading)
Decision framework: Start with Ollama for pilot deployments and single-client setups. Move to vLLM when you need to serve multiple concurrent users or multiple client adapters on the same GPU.
Fine-Tuned Models + LoRA Adapters
The base model + adapter architecture is the foundation of multi-client agency operations:
- One base model (Llama 3.1 8B) loaded in GPU memory
- Per-client LoRA adapters (50-200 MB each) that customise the base model for each firm's specific tasks and style
- Dynamic adapter loading — swap adapters at inference time based on which client's request is being processed
This architecture means a single GPU serves all your legal clients. Each client gets a model that behaves as if it was trained exclusively on their data, but the infrastructure cost is shared. See our multi-client LoRA guide for the technical details.
Fine-tuning happens through Ertas Studio — upload client data, configure training, export the adapter. No ML expertise required.
Vector Database: Chroma or Qdrant
For legal AI, pure fine-tuning is often complemented by Retrieval-Augmented Generation (RAG) for tasks that require referencing specific documents:
Chroma for lightweight deployments:
- Embedded mode runs in-process (no separate server)
- Simple Python API
- Good for collections under 1M documents
Qdrant for production deployments:
- Dedicated server with REST and gRPC APIs
- Better performance at scale (millions of documents)
- Built-in filtering (useful for multi-client data isolation)
- Docker deployment
When to use RAG alongside fine-tuning:
- Contract review against a clause library → RAG retrieves similar clauses, fine-tuned model analyses
- Legal research → RAG retrieves relevant case law, fine-tuned model summarises and synthesises
- Due diligence → RAG searches the data room, fine-tuned model extracts and classifies
Client Portal
Law firms expect a polished interface, not raw API outputs. Options:
Custom web app: A simple React or Next.js application that:
- Accepts document uploads
- Shows processing status
- Displays analysis results in a formatted report
- Provides an export function (PDF, DOCX)
- Authenticates against the firm's identity provider (SAML/OIDC)
n8n + form interface: For simpler deployments, n8n's webhook + form trigger can serve as a basic intake interface. Less polished but faster to deploy.
Integration with existing tools: Many firms prefer results delivered into their existing document management system (iManage, NetDocuments) or matter management platform rather than a separate portal.
Deployment Topology
Single-Client Deployment
For a small firm (10-50 lawyers):
| Component | Hardware | Notes |
|---|---|---|
| n8n + PostgreSQL + Vector DB | Client's existing server or VM | 4 CPU, 8 GB RAM |
| LLM inference + model files | Dedicated GPU workstation | RTX 5090, 32 GB VRAM |
| Client portal | Same server as n8n | Served via Nginx |
Total additional hardware cost to client: $2,500-4,000 (GPU workstation only if they do not already have one).
Multi-Client Agency Deployment
For an agency managing 5-15 law firm clients:
Option A: Centralised (Agency-Hosted)
- Agency operates a server room or colocation rack
- Each client's data is logically isolated (separate databases, separate LoRA adapters)
- Requires robust access controls and audit logging
- Lower hardware cost per client
- Note: Some firms will not accept this model — their data must be on their own hardware
Option B: Distributed (Client-Hosted)
- Each client has their own hardware stack
- Agency manages remotely via VPN or secure remote access
- Higher hardware cost (duplicated across clients) but maximum data isolation
- Preferred by most law firms due to data sovereignty requirements
Option C: Hybrid
- Client-hosted inference (GPU + model on client hardware)
- Agency-hosted n8n (orchestration only, no client data persisted)
- Fine-tuning happens on agency infrastructure, adapter files delivered to client
Most agencies start with Option B and migrate clients willing to centralise to Option A as trust builds.
Data Flow: A Complete Example
Here is the step-by-step data flow for a contract review workflow:
- Lawyer uploads contract to the client portal (or drops it into a monitored DMS folder)
- n8n webhook fires, triggering the contract review workflow
- n8n extracts text from the document (PDF parsing node)
- n8n chunks the document into sections (Function node)
- For each section, n8n queries the vector DB for similar clauses from the firm's precedent library
- n8n sends each section + retrieved context to the local LLM with the firm-specific LoRA adapter loaded
- LLM returns risk analysis for each section
- n8n aggregates results into a structured review report
- Report is delivered to the client portal, emailed, or written back to the DMS
- All execution data is logged in n8n's execution history and the audit log
Total processing time for a 30-page contract: 2-5 minutes.
Scaling Considerations
Adding More Clients
Each new client requires:
- A new LoRA adapter (trained via Ertas Studio)
- A new vector DB collection (if using RAG)
- New n8n workflows (cloned from templates, customised per client)
- Client-specific configuration in the portal
The base model and inference infrastructure are shared. Marginal cost per new client: fine-tuning time + adapter storage (trivial).
Handling Increased Volume
When a single GPU becomes saturated:
- Add a second GPU to the same server (most workstations support 2 GPUs)
- Use vLLM's tensor parallelism to split models across GPUs
- Or deploy a second inference server and load-balance with Nginx
Adding New Capabilities
New use cases (e.g., adding legal research to a firm that started with contract review) require:
- A new fine-tuned adapter for the new task
- New n8n workflows
- New vector DB collection (if the task uses RAG)
The infrastructure scales horizontally — same stack, new adapters.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- n8n + Local LLMs: Building HIPAA-Compliant Automation — Deep dive into n8n + local LLM integration
- Multi-Tenant AI Deployment for Agencies — Managing multiple clients on shared infrastructure
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

The Solo AI Agency Tech Stack: 8 Tools, Zero Full-Time Hires
Running an AI agency solo in 2026 is possible with the right stack. Here are the 8 core tools, what each costs, and what they let you accomplish without hiring.

The AI Agency Opportunity in Legal Services: A Market Guide
Legal services represent one of the largest untapped markets for AI agencies. Here's the market landscape, demand signals, and a go-to-market strategy for agencies targeting law firms.

Multi-Tenant AI Deployment: One Base Model, Dozens of Client Adapters
How AI agencies can serve dozens of clients from a single base model using LoRA adapter hot-swapping — the architecture behind scalable, cost-effective multi-tenant AI.