Back to blog
    Fine-Tuning vs. Prompt Engineering for Legal Document Review
    fine-tuningprompt-engineeringlegaldocument-reviewsegment:agency

    Fine-Tuning vs. Prompt Engineering for Legal Document Review

    When does prompt engineering hit its ceiling for legal AI tasks? A practical comparison of prompt engineering and fine-tuning for contract review, with a decision framework for agencies.

    EErtas Team·

    Every AI agency building legal AI tools starts with prompt engineering. It is fast, requires no training data, and works surprisingly well for generic tasks. But as clients demand higher accuracy on their specific document types, prompt engineering hits a ceiling that no amount of clever prompting can break through.

    This article compares the two approaches head-to-head on contract review — one of the most common legal AI use cases — and provides a framework for deciding when to make the jump to fine-tuning.

    Where Prompt Engineering Works

    Prompt engineering is the right starting point. For general legal tasks with well-defined outputs, a carefully crafted prompt with a frontier model (GPT-4o, Claude Sonnet) delivers good results:

    Good prompt engineering use cases:

    • Summarising publicly available case law
    • Generating first drafts of standard legal documents from templates
    • Answering general legal questions (not case-specific)
    • Classifying documents into broad categories (contract, motion, brief, correspondence)

    For these tasks, the model's pre-training knowledge covers the domain well. The prompt provides structure and constraints. Results are acceptable for a first pass that a lawyer reviews.

    Where Prompt Engineering Hits Its Ceiling

    Legal document review — the detailed analysis of contracts, leases, regulatory filings, and similar documents for specific issues — is where prompting breaks down.

    The Contract Review Test

    Consider a practical test: reviewing a commercial lease agreement for a specific client, checking for 25 common risk factors (indemnification clauses, assignment restrictions, termination triggers, insurance requirements, etc.).

    With prompt engineering (GPT-4o):

    System: You are a legal document analyst specialising in commercial leases.
    Review the following lease agreement and identify all instances of the
    following risk factors: [list of 25 risk factors with descriptions]
    For each, provide the relevant clause, your assessment, and a risk rating.
    

    Results on a benchmark set of 50 leases:

    MetricScore
    Risk factors correctly identified72%
    False positives (flagged non-issues)18%
    Missed critical clauses15%
    Consistent risk ratings61%

    72% identification is impressive for a general-purpose model. But for a law firm, it means missing roughly 1 in 4 relevant clauses. That is not a tool — it is a liability.

    Why Prompting Cannot Close the Gap

    Jurisdiction-specific language. Legal language varies significantly by jurisdiction. A "quiet enjoyment" clause in New South Wales reads differently from one in New York. Prompt engineering cannot encode these differences without making the prompt so long it degrades performance.

    Client-specific risk tolerance. One client considers a 30-day termination notice acceptable. Another requires 90 days minimum. These client-specific thresholds cannot be reliably encoded in prompts.

    Document structure variation. Leases from different counterparties use different structures, numbering systems, and cross-referencing conventions. A general-purpose model struggles to track references across a 60-page document with inconsistent formatting.

    Consistency. The same lease reviewed twice with the same prompt produces different results. For legal work, inconsistency is unacceptable — the firm needs the same clause flagged the same way every time.

    What Fine-Tuning Changes

    Fine-tuning teaches the model the specific patterns, terminology, and judgment criteria that prompting cannot convey. The same contract review task with a fine-tuned model:

    Training data: 2,000 annotated lease reviews from the firm's historical work — clauses tagged with risk factors, assessments, and ratings by experienced lawyers.

    Fine-tuned model (Llama 3.1 8B + LoRA):

    MetricPrompt Engineering (GPT-4o)Fine-Tuned (8B)
    Risk factors correctly identified72%94%
    False positives18%6%
    Missed critical clauses15%3%
    Consistent risk ratings61%92%
    Average review time45 sec12 sec
    Cost per review$0.15-0.40~$0 (local)

    The fine-tuned 8B model outperforms the prompted GPT-4o on every metric. It is faster because it is smaller and running locally. It is cheaper because there are no API charges. And it is more accurate because it has learned the specific patterns this firm cares about.

    Pattern imprinting. Fine-tuning embeds the firm's analysis patterns directly into the model weights. The model does not need to be told what a problematic indemnification clause looks like — it has seen hundreds of examples.

    Consistency by construction. A fine-tuned model produces more consistent outputs because the training data teaches it a specific analytical framework. The same clause triggers the same assessment.

    Speed from compression. A fine-tuned 8B model replaces a prompted 175B+ model. The knowledge has been compressed into a smaller, faster architecture that excels on the specific task.

    Cost at scale. Local inference on a fine-tuned model costs essentially nothing per document. For a firm reviewing thousands of contracts per year, this transforms the economics of AI-assisted review.

    The Decision Framework

    Use this framework to decide whether fine-tuning is worth the investment for a specific legal use case:

    Stay with Prompt Engineering If:

    • The task is general-purpose (not client or jurisdiction-specific)
    • Volume is low (fewer than 100 documents per month)
    • Accuracy requirements are moderate (first-pass screening, not final review)
    • You do not have historical examples to train on
    • The client is in exploration mode and not ready to commit to a specific workflow

    Move to Fine-Tuning If:

    • The task is repetitive and domain-specific (same document type, same analysis)
    • Volume justifies the investment (100+ documents per month)
    • Accuracy requirements are high (the output influences legal decisions)
    • You have 1,000+ historical examples with quality annotations
    • Consistency matters (the same clause must always be flagged the same way)
    • Cost matters at scale (API charges are becoming a meaningful expense)
    • Data privacy requires local inference

    The Hybrid Approach

    Many agencies start with prompt engineering to validate the use case, then transition to fine-tuning once the client is committed:

    1. Month 1-2: Deploy prompt-engineered solution, collect client feedback
    2. Month 3: Use the accumulated interactions as training data for fine-tuning
    3. Month 4: Deploy fine-tuned model, compare against prompted baseline
    4. Ongoing: Retrain periodically as the firm's review standards evolve

    This approach de-risks the fine-tuning investment by validating demand before committing resources.

    Practical Implementation

    For agencies ready to fine-tune legal AI models:

    1. Data preparation: Export the firm's historical document reviews. Standardise the annotation format. Clean and deduplicate.
    2. Base model selection: Llama 3.1 8B for standard tasks, 13B for complex multi-step analysis. Smaller models fine-tune faster and run cheaper.
    3. Fine-tuning: Use Ertas Studio for no-code fine-tuning, or LoRA training if you prefer hands-on control.
    4. Evaluation: Test on a held-out set of documents the model has never seen. Compare against the prompted baseline on the same documents.
    5. Deployment: Export to GGUF, deploy via Ollama on the firm's hardware.

    The entire process from data preparation to deployed model typically takes 1-2 weeks for an experienced agency.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading