Fine-Tuning vs. Prompt Engineering for Legal Document Review

Every AI agency building legal AI tools starts with prompt engineering. It is fast, requires no training data, and works surprisingly well for generic tasks. But as clients demand higher accuracy on their specific document types, prompt engineering hits a ceiling that no amount of clever prompting can break through.

This article compares the two approaches head-to-head on contract review — one of the most common legal AI use cases — and provides a framework for deciding when to make the jump to fine-tuning.

Where Prompt Engineering Works

Prompt engineering is the right starting point. For general legal tasks with well-defined outputs, a carefully crafted prompt with a frontier model (GPT-4o, Claude Sonnet) delivers good results:

Good prompt engineering use cases:

Summarising publicly available case law
Generating first drafts of standard legal documents from templates
Answering general legal questions (not case-specific)
Classifying documents into broad categories (contract, motion, brief, correspondence)

For these tasks, the model's pre-training knowledge covers the domain well. The prompt provides structure and constraints. Results are acceptable for a first pass that a lawyer reviews.

Where Prompt Engineering Hits Its Ceiling

Legal document review — the detailed analysis of contracts, leases, regulatory filings, and similar documents for specific issues — is where prompting breaks down.

The Contract Review Test

Consider a practical test: reviewing a commercial lease agreement for a specific client, checking for 25 common risk factors (indemnification clauses, assignment restrictions, termination triggers, insurance requirements, etc.).

With prompt engineering (GPT-4o):

System: You are a legal document analyst specialising in commercial leases.
Review the following lease agreement and identify all instances of the
following risk factors: [list of 25 risk factors with descriptions]
For each, provide the relevant clause, your assessment, and a risk rating.

Results on a benchmark set of 50 leases:

Metric	Score
Risk factors correctly identified	72%
False positives (flagged non-issues)	18%
Missed critical clauses	15%
Consistent risk ratings	61%

72% identification is impressive for a general-purpose model. But for a law firm, it means missing roughly 1 in 4 relevant clauses. That is not a tool — it is a liability.

Why Prompting Cannot Close the Gap

Jurisdiction-specific language. Legal language varies significantly by jurisdiction. A "quiet enjoyment" clause in New South Wales reads differently from one in New York. Prompt engineering cannot encode these differences without making the prompt so long it degrades performance.

Client-specific risk tolerance. One client considers a 30-day termination notice acceptable. Another requires 90 days minimum. These client-specific thresholds cannot be reliably encoded in prompts.

Document structure variation. Leases from different counterparties use different structures, numbering systems, and cross-referencing conventions. A general-purpose model struggles to track references across a 60-page document with inconsistent formatting.

Consistency. The same lease reviewed twice with the same prompt produces different results. For legal work, inconsistency is unacceptable — the firm needs the same clause flagged the same way every time.

What Fine-Tuning Changes

Fine-tuning teaches the model the specific patterns, terminology, and judgment criteria that prompting cannot convey. The same contract review task with a fine-tuned model:

Training data: 2,000 annotated lease reviews from the firm's historical work — clauses tagged with risk factors, assessments, and ratings by experienced lawyers.

Fine-tuned model (Llama 3.1 8B + LoRA):

Metric	Prompt Engineering (GPT-4o)	Fine-Tuned (8B)
Risk factors correctly identified	72%	94%
False positives	18%	6%
Missed critical clauses	15%	3%
Consistent risk ratings	61%	92%
Average review time	45 sec	12 sec
Cost per review	$0.15-0.40	~$0 (local)

The fine-tuned 8B model outperforms the prompted GPT-4o on every metric. It is faster because it is smaller and running locally. It is cheaper because there are no API charges. And it is more accurate because it has learned the specific patterns this firm cares about.

Why Fine-Tuning Works for Legal Tasks

Pattern imprinting. Fine-tuning embeds the firm's analysis patterns directly into the model weights. The model does not need to be told what a problematic indemnification clause looks like — it has seen hundreds of examples.

Consistency by construction. A fine-tuned model produces more consistent outputs because the training data teaches it a specific analytical framework. The same clause triggers the same assessment.

Speed from compression. A fine-tuned 8B model replaces a prompted 175B+ model. The knowledge has been compressed into a smaller, faster architecture that excels on the specific task.

Cost at scale. Local inference on a fine-tuned model costs essentially nothing per document. For a firm reviewing thousands of contracts per year, this transforms the economics of AI-assisted review.

The Decision Framework

Use this framework to decide whether fine-tuning is worth the investment for a specific legal use case:

Stay with Prompt Engineering If:

The task is general-purpose (not client or jurisdiction-specific)
Volume is low (fewer than 100 documents per month)
Accuracy requirements are moderate (first-pass screening, not final review)
You do not have historical examples to train on
The client is in exploration mode and not ready to commit to a specific workflow

Move to Fine-Tuning If:

The task is repetitive and domain-specific (same document type, same analysis)
Volume justifies the investment (100+ documents per month)
Accuracy requirements are high (the output influences legal decisions)
You have 1,000+ historical examples with quality annotations
Consistency matters (the same clause must always be flagged the same way)
Cost matters at scale (API charges are becoming a meaningful expense)
Data privacy requires local inference

The Hybrid Approach

Many agencies start with prompt engineering to validate the use case, then transition to fine-tuning once the client is committed:

Month 1-2: Deploy prompt-engineered solution, collect client feedback
Month 3: Use the accumulated interactions as training data for fine-tuning
Month 4: Deploy fine-tuned model, compare against prompted baseline
Ongoing: Retrain periodically as the firm's review standards evolve

This approach de-risks the fine-tuning investment by validating demand before committing resources.

Practical Implementation

For agencies ready to fine-tune legal AI models:

Data preparation: Export the firm's historical document reviews. Standardise the annotation format. Clean and deduplicate.
Base model selection: Llama 3.1 8B for standard tasks, 13B for complex multi-step analysis. Smaller models fine-tune faster and run cheaper.
Fine-tuning: Use Ertas Studio for no-code fine-tuning, or LoRA training if you prefer hands-on control.
Evaluation: Test on a held-out set of documents the model has never seen. Compare against the prompted baseline on the same documents.
Deployment: Export to GGUF, deploy via Ollama on the firm's hardware.

The entire process from data preparation to deployed model typically takes 1-2 weeks for an experienced agency.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →