
Fine-Tuning vs. Prompt Engineering for Legal Document Review
When does prompt engineering hit its ceiling for legal AI tasks? A practical comparison of prompt engineering and fine-tuning for contract review, with a decision framework for agencies.
Every AI agency building legal AI tools starts with prompt engineering. It is fast, requires no training data, and works surprisingly well for generic tasks. But as clients demand higher accuracy on their specific document types, prompt engineering hits a ceiling that no amount of clever prompting can break through.
This article compares the two approaches head-to-head on contract review — one of the most common legal AI use cases — and provides a framework for deciding when to make the jump to fine-tuning.
Where Prompt Engineering Works
Prompt engineering is the right starting point. For general legal tasks with well-defined outputs, a carefully crafted prompt with a frontier model (GPT-4o, Claude Sonnet) delivers good results:
Good prompt engineering use cases:
- Summarising publicly available case law
- Generating first drafts of standard legal documents from templates
- Answering general legal questions (not case-specific)
- Classifying documents into broad categories (contract, motion, brief, correspondence)
For these tasks, the model's pre-training knowledge covers the domain well. The prompt provides structure and constraints. Results are acceptable for a first pass that a lawyer reviews.
Where Prompt Engineering Hits Its Ceiling
Legal document review — the detailed analysis of contracts, leases, regulatory filings, and similar documents for specific issues — is where prompting breaks down.
The Contract Review Test
Consider a practical test: reviewing a commercial lease agreement for a specific client, checking for 25 common risk factors (indemnification clauses, assignment restrictions, termination triggers, insurance requirements, etc.).
With prompt engineering (GPT-4o):
System: You are a legal document analyst specialising in commercial leases.
Review the following lease agreement and identify all instances of the
following risk factors: [list of 25 risk factors with descriptions]
For each, provide the relevant clause, your assessment, and a risk rating.
Results on a benchmark set of 50 leases:
| Metric | Score |
|---|---|
| Risk factors correctly identified | 72% |
| False positives (flagged non-issues) | 18% |
| Missed critical clauses | 15% |
| Consistent risk ratings | 61% |
72% identification is impressive for a general-purpose model. But for a law firm, it means missing roughly 1 in 4 relevant clauses. That is not a tool — it is a liability.
Why Prompting Cannot Close the Gap
Jurisdiction-specific language. Legal language varies significantly by jurisdiction. A "quiet enjoyment" clause in New South Wales reads differently from one in New York. Prompt engineering cannot encode these differences without making the prompt so long it degrades performance.
Client-specific risk tolerance. One client considers a 30-day termination notice acceptable. Another requires 90 days minimum. These client-specific thresholds cannot be reliably encoded in prompts.
Document structure variation. Leases from different counterparties use different structures, numbering systems, and cross-referencing conventions. A general-purpose model struggles to track references across a 60-page document with inconsistent formatting.
Consistency. The same lease reviewed twice with the same prompt produces different results. For legal work, inconsistency is unacceptable — the firm needs the same clause flagged the same way every time.
What Fine-Tuning Changes
Fine-tuning teaches the model the specific patterns, terminology, and judgment criteria that prompting cannot convey. The same contract review task with a fine-tuned model:
Training data: 2,000 annotated lease reviews from the firm's historical work — clauses tagged with risk factors, assessments, and ratings by experienced lawyers.
Fine-tuned model (Llama 3.1 8B + LoRA):
| Metric | Prompt Engineering (GPT-4o) | Fine-Tuned (8B) |
|---|---|---|
| Risk factors correctly identified | 72% | 94% |
| False positives | 18% | 6% |
| Missed critical clauses | 15% | 3% |
| Consistent risk ratings | 61% | 92% |
| Average review time | 45 sec | 12 sec |
| Cost per review | $0.15-0.40 | ~$0 (local) |
The fine-tuned 8B model outperforms the prompted GPT-4o on every metric. It is faster because it is smaller and running locally. It is cheaper because there are no API charges. And it is more accurate because it has learned the specific patterns this firm cares about.
Why Fine-Tuning Works for Legal Tasks
Pattern imprinting. Fine-tuning embeds the firm's analysis patterns directly into the model weights. The model does not need to be told what a problematic indemnification clause looks like — it has seen hundreds of examples.
Consistency by construction. A fine-tuned model produces more consistent outputs because the training data teaches it a specific analytical framework. The same clause triggers the same assessment.
Speed from compression. A fine-tuned 8B model replaces a prompted 175B+ model. The knowledge has been compressed into a smaller, faster architecture that excels on the specific task.
Cost at scale. Local inference on a fine-tuned model costs essentially nothing per document. For a firm reviewing thousands of contracts per year, this transforms the economics of AI-assisted review.
The Decision Framework
Use this framework to decide whether fine-tuning is worth the investment for a specific legal use case:
Stay with Prompt Engineering If:
- The task is general-purpose (not client or jurisdiction-specific)
- Volume is low (fewer than 100 documents per month)
- Accuracy requirements are moderate (first-pass screening, not final review)
- You do not have historical examples to train on
- The client is in exploration mode and not ready to commit to a specific workflow
Move to Fine-Tuning If:
- The task is repetitive and domain-specific (same document type, same analysis)
- Volume justifies the investment (100+ documents per month)
- Accuracy requirements are high (the output influences legal decisions)
- You have 1,000+ historical examples with quality annotations
- Consistency matters (the same clause must always be flagged the same way)
- Cost matters at scale (API charges are becoming a meaningful expense)
- Data privacy requires local inference
The Hybrid Approach
Many agencies start with prompt engineering to validate the use case, then transition to fine-tuning once the client is committed:
- Month 1-2: Deploy prompt-engineered solution, collect client feedback
- Month 3: Use the accumulated interactions as training data for fine-tuning
- Month 4: Deploy fine-tuned model, compare against prompted baseline
- Ongoing: Retrain periodically as the firm's review standards evolve
This approach de-risks the fine-tuning investment by validating demand before committing resources.
Practical Implementation
For agencies ready to fine-tune legal AI models:
- Data preparation: Export the firm's historical document reviews. Standardise the annotation format. Clean and deduplicate.
- Base model selection: Llama 3.1 8B for standard tasks, 13B for complex multi-step analysis. Smaller models fine-tune faster and run cheaper.
- Fine-tuning: Use Ertas Studio for no-code fine-tuning, or LoRA training if you prefer hands-on control.
- Evaluation: Test on a held-out set of documents the model has never seen. Compare against the prompted baseline on the same documents.
- Deployment: Export to GGUF, deploy via Ollama on the firm's hardware.
The entire process from data preparation to deployed model typically takes 1-2 weeks for an experienced agency.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Fine-Tuning vs. RAG: When to Use Each — Understanding the complementary roles of fine-tuning and retrieval-augmented generation
- How to Fine-Tune an LLM — Step-by-step technical guide to LoRA fine-tuning
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

From Prompt Engineering to Fine-Tuning: The Migration Playbook
A practical playbook for teams migrating from prompt engineering to fine-tuning — when to make the switch, how to convert prompts into training data, and the step-by-step migration process.

Prompt Engineering Has a Ceiling. Here's What Comes After.
Prompt engineering can take you far — but every agency and developer hits the wall eventually. Here's what the ceiling looks like, why it exists, and what techniques come after.

Model Distillation Explained: Run Sonnet-Quality Output on a $0 Inference Bill
A complete guide to model distillation — how to transfer capabilities from large frontier models like Claude Sonnet into small local models, achieving comparable quality at zero ongoing inference cost.