AI Agency Tech Stack for Legal Clients: n8n + Fine-Tuned Models + On-Prem Deployment

Building AI solutions for law firms requires a specific technology stack that satisfies legal compliance requirements while remaining manageable for a small agency team. This article documents the full architecture — every component, why it was chosen, and how the pieces connect.

The Full Architecture

┌─────────────────────────────────────────────────────────┐
│                    CLIENT NETWORK                        │
│                                                          │
│  ┌──────────┐    ┌──────────┐    ┌───────────────────┐  │
│  │   DMS    │───→│   n8n    │───→│  LLM Inference    │  │
│  │(iManage) │    │(self-    │    │  (Ollama/vLLM)    │  │
│  └──────────┘    │ hosted)  │    │  + LoRA Adapters   │  │
│                  └────┬─────┘    └───────────────────┘  │
│                       │                                  │
│                  ┌────▼─────┐    ┌───────────────────┐  │
│                  │ Vector   │    │  Client Portal    │  │
│                  │   DB     │    │  (results UI)     │  │
│                  │(Chroma/  │    └───────────────────┘  │
│                  │ Qdrant)  │                            │
│                  └──────────┘                            │
└─────────────────────────────────────────────────────────┘

Every component runs within the law firm's network. No data leaves the perimeter.

Component Selection

n8n: Workflow Orchestration

Why n8n:

Self-hostable (Docker, bare metal) — no SaaS dependency
Visual workflow builder that non-technical staff can understand during demos
OpenAI-compatible node connects directly to local LLM endpoints
Webhook triggers for real-time document processing
Built-in error handling, retry logic, and execution logging
Active open-source community with legal-relevant workflow templates

Why not Make.com or Zapier:

Both are cloud-only SaaS — data must leave the firm's network
Cannot self-host for air-gapped deployments
Vendor dependency creates risk for long-term engagements

n8n deployment: Docker container with PostgreSQL backend. Resource-light — 2 CPU cores and 4 GB RAM handles most agency workloads.

LLM Inference: Ollama or vLLM

Ollama for simpler deployments:

Single binary installation, minimal configuration
Built-in model management (download, version, switch models)
OpenAI-compatible API endpoint out of the box
Lower throughput but simpler operations

vLLM for production deployments:

Higher throughput with continuous batching
Better GPU utilisation under concurrent load
OpenAI-compatible API
More operational complexity (Python environment, model loading)

Decision framework: Start with Ollama for pilot deployments and single-client setups. Move to vLLM when you need to serve multiple concurrent users or multiple client adapters on the same GPU.

Fine-Tuned Models + LoRA Adapters

The base model + adapter architecture is the foundation of multi-client agency operations:

One base model (Llama 3.1 8B) loaded in GPU memory
Per-client LoRA adapters (50-200 MB each) that customise the base model for each firm's specific tasks and style
Dynamic adapter loading — swap adapters at inference time based on which client's request is being processed

This architecture means a single GPU serves all your legal clients. Each client gets a model that behaves as if it was trained exclusively on their data, but the infrastructure cost is shared. See our multi-client LoRA guide for the technical details.

Fine-tuning happens through Ertas Studio — upload client data, configure training, export the adapter. No ML expertise required.

Vector Database: Chroma or Qdrant

For legal AI, pure fine-tuning is often complemented by Retrieval-Augmented Generation (RAG) for tasks that require referencing specific documents:

Chroma for lightweight deployments:

Embedded mode runs in-process (no separate server)
Simple Python API
Good for collections under 1M documents

Qdrant for production deployments:

Dedicated server with REST and gRPC APIs
Better performance at scale (millions of documents)
Built-in filtering (useful for multi-client data isolation)
Docker deployment

When to use RAG alongside fine-tuning:

Contract review against a clause library → RAG retrieves similar clauses, fine-tuned model analyses
Legal research → RAG retrieves relevant case law, fine-tuned model summarises and synthesises
Due diligence → RAG searches the data room, fine-tuned model extracts and classifies

Client Portal

Law firms expect a polished interface, not raw API outputs. Options:

Custom web app: A simple React or Next.js application that:

Accepts document uploads
Shows processing status
Displays analysis results in a formatted report
Provides an export function (PDF, DOCX)
Authenticates against the firm's identity provider (SAML/OIDC)

n8n + form interface: For simpler deployments, n8n's webhook + form trigger can serve as a basic intake interface. Less polished but faster to deploy.

Integration with existing tools: Many firms prefer results delivered into their existing document management system (iManage, NetDocuments) or matter management platform rather than a separate portal.

Deployment Topology

Single-Client Deployment

For a small firm (10-50 lawyers):

Component	Hardware	Notes
n8n + PostgreSQL + Vector DB	Client's existing server or VM	4 CPU, 8 GB RAM
LLM inference + model files	Dedicated GPU workstation	RTX 5090, 32 GB VRAM
Client portal	Same server as n8n	Served via Nginx

Total additional hardware cost to client: $2,500-4,000 (GPU workstation only if they do not already have one).

Multi-Client Agency Deployment

For an agency managing 5-15 law firm clients:

Option A: Centralised (Agency-Hosted)

Agency operates a server room or colocation rack
Each client's data is logically isolated (separate databases, separate LoRA adapters)
Requires robust access controls and audit logging
Lower hardware cost per client
Note: Some firms will not accept this model — their data must be on their own hardware

Option B: Distributed (Client-Hosted)

Each client has their own hardware stack
Agency manages remotely via VPN or secure remote access
Higher hardware cost (duplicated across clients) but maximum data isolation
Preferred by most law firms due to data sovereignty requirements

Option C: Hybrid

Client-hosted inference (GPU + model on client hardware)
Agency-hosted n8n (orchestration only, no client data persisted)
Fine-tuning happens on agency infrastructure, adapter files delivered to client

Most agencies start with Option B and migrate clients willing to centralise to Option A as trust builds.

Data Flow: A Complete Example

Here is the step-by-step data flow for a contract review workflow:

Lawyer uploads contract to the client portal (or drops it into a monitored DMS folder)
n8n webhook fires, triggering the contract review workflow
n8n extracts text from the document (PDF parsing node)
n8n chunks the document into sections (Function node)
For each section, n8n queries the vector DB for similar clauses from the firm's precedent library
n8n sends each section + retrieved context to the local LLM with the firm-specific LoRA adapter loaded
LLM returns risk analysis for each section
n8n aggregates results into a structured review report
Report is delivered to the client portal, emailed, or written back to the DMS
All execution data is logged in n8n's execution history and the audit log

Total processing time for a 30-page contract: 2-5 minutes.

Scaling Considerations

Adding More Clients

Each new client requires:

A new LoRA adapter (trained via Ertas Studio)
A new vector DB collection (if using RAG)
New n8n workflows (cloned from templates, customised per client)
Client-specific configuration in the portal

The base model and inference infrastructure are shared. Marginal cost per new client: fine-tuning time + adapter storage (trivial).

Handling Increased Volume

When a single GPU becomes saturated:

Add a second GPU to the same server (most workstations support 2 GPUs)
Use vLLM's tensor parallelism to split models across GPUs
Or deploy a second inference server and load-balance with Nginx

Adding New Capabilities

New use cases (e.g., adding legal research to a firm that started with contract review) require:

A new fine-tuned adapter for the new task
New n8n workflows
New vector DB collection (if the task uses RAG)

The infrastructure scales horizontally — same stack, new adapters.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →