
Contract Clause Extraction: A Data Preparation Guide for Legal AI
Fine-tuning a model for contract review starts with extracting and annotating clause-level data from contract archives. This guide covers the full preparation pipeline — on-premise and privilege-preserved.
Building a useful contract AI starts with a data problem. The model needs training examples — contracts where specific clauses have been identified, typed, and assessed. Before any training can happen, you need a pipeline that extracts individual clauses from raw contract documents and prepares them for annotation.
This guide covers that pipeline: the extraction steps, the annotation approach, who should do the labeling, the quality requirements, the output format, and the expected dataset size for a viable contract review model.
What Contract AI Models Do
Contract AI operates at the clause level, not the document level. The useful tasks are clause classification (what type is this provision?), obligation extraction (what does this party have to do?), unfavorable clause detection (does this provision deviate from standard terms in ways that create risk?), and comparison (how does this clause differ from our standard position?).
All of these require the model to understand individual clauses in context. A model trained on full contract documents, without clause-level structure, learns to associate contract text with document-level labels — which is useful for some classification tasks but inadequate for clause-level work.
The training data structure that enables these tasks is clause-level: each training example is a single clause (or multi-clause provision), with labels indicating the clause type, the risk classification, and relevant metadata.
What Training Data Is Required
For a clause classification model, each training example is a text span representing one clause or section of a contract, plus a label indicating the clause type. For a risk classification model, each example also has a risk label (standard, non-standard, escalate) indicating whether the clause requires negotiation.
For a multilabel model that classifies clause type and risk together:
{
"text": "In no event shall either party's liability under this agreement exceed the total fees paid by Customer in the twelve-month period immediately preceding the claim.",
"clause_type": "limitation_of_liability",
"risk_level": "standard",
"governing_law": "New York",
"agreement_type": "enterprise_software",
"mutual": true
}
The metadata fields — governing law, agreement type, whether the clause is mutual — are important for the model to learn context-dependent standards. A limitation of liability clause that is standard in a software license agreement may be non-standard in a professional services agreement. Without this context, the model cannot make that distinction.
Extraction Pipeline
Clause extraction has four steps: document ingestion, section segmentation, clause boundary detection, and metadata normalization.
Document ingestion. Contracts arrive as PDFs (court-filed versions, scanned originals, printed and scanned) and Word documents (draft versions, redlines, clean versions). PDF ingestion requires different handling depending on whether the PDF is digitally created (text layer present) or scanned (requires OCR). Word documents should be processed from the .docx format rather than PDF exports, because the .docx preserves heading structure and style information that aids segmentation.
For scanned PDFs — which are common in transaction archives for older deals — OCR must run first. OCR quality on contract documents is generally high because contracts use standard body fonts with high contrast. The main OCR failures are: signature pages with mixed handwriting and print, documents with stamps or annotations overlaid on text, and documents that were faxed and then scanned (double degradation).
Section segmentation. After ingestion, the contract text is split into sections. A section corresponds to a numbered heading at the agreement's top level of structure (1. Definitions, 2. Services, 3. Fees, etc.). Sections are identified by heading format — numbered headings in a larger or bolder font than body text.
The challenge is that contract heading formats are not standardized. Some agreements use Roman numerals (I., II., III.), some use decimal numbering (1.1, 1.2), some use alphabetical lettering (A., B.), and some use only text with no number prefix. A segmentation approach that relies on a single numbering format will miss sections in differently formatted contracts.
A robust segmentation approach combines format detection (identifying the numbering style used in this specific document from the first 20 sections) with a fall-back model-based approach for documents with unusual formatting.
Clause boundary detection. Within a section, individual clauses must be separated. A "section" in a contract might contain a single sentence or thirty. The boundaries between clauses within a section correspond to logical breaks in the subject matter — a transition from one topic (what the licensor grants) to another (what the licensee may not do).
Clause boundary detection at this level requires semantic understanding, not just formatting cues. Within a section, paragraph breaks are unreliable indicators of clause boundaries — some provisions span multiple paragraphs without being distinct clauses, and some distinct clauses share a single paragraph.
For training data preparation, a pragmatic approach: segment at the section level for simple sections (fewer than 200 words), and at the subsection level (using decimal numbering like 5.1, 5.2) for complex sections. This is not perfect clause segmentation, but it produces segments that are small enough to be meaningful training units without requiring the full complexity of semantic clause boundary detection.
Metadata normalization. After segmentation, each clause segment needs metadata extracted from the document: governing law (typically in a "Governing Law" section), agreement type (often in the title or recitals), date of execution (often in the signature block), and party types (vendor/customer, employer/employee, etc.).
These fields are not always present or consistently formatted. A metadata extraction step runs on each document before clause-level annotation begins, and documents with missing critical metadata are flagged for manual completion.
Who Does the Annotation
The annotation task for contract review is clause-type classification and risk assessment. This is a legal judgment task, not a mechanical labeling task.
For clause-type classification, a paralegal with contract review experience can reliably classify most clause types: limitation of liability, indemnification, confidentiality, change of control, assignment, dispute resolution, termination, IP ownership, warranty, and the other standard provisions. Edge cases — provisions that combine elements of multiple clause types, unusual bespoke provisions, jurisdiction-specific drafting patterns — require associate-level input.
For risk assessment, the classification is more judgment-intensive. "Is this limitation of liability clause standard or non-standard?" requires knowing what the firm's standard position is, what the client's risk tolerance is, and how the clause compares to typical market terms. This requires associate or senior associate input, particularly during guideline development.
The practical annotation workflow: associates design the annotation schema and write the guidelines. Paralegals apply labels to the bulk of the documents. Associates review a sample (10–15%) and handle escalated edge cases. Partners review the guidelines before the project begins and after the first 50 contracts, to calibrate the risk classification definitions.
Quality Requirements
Inter-annotator agreement. For clause-type classification, Cohen's kappa above 0.75 is a reasonable target. Below 0.70 indicates guideline ambiguity that will produce noisy training data. For risk classification, agreement will be naturally lower (0.60–0.70 is typical) because risk assessment involves judgment — but systematic disagreements (some annotators consistently rate certain clause types as higher risk than others) indicate calibration issues that should be resolved through guideline revision, not averaged away in the data.
Annotation guideline specificity. The annotation guidelines must include: the full list of clause types, a one-sentence definition for each, two to three positive examples of each type, one to two negative examples (common confusables), and decision rules for clauses that could fit multiple types. Without this specificity, inter-annotator agreement will be low.
Edge case handling. The guidelines must specify how to handle: clauses that combine two types (label with the primary type, flag as multi-type), clauses that are too short to classify reliably (minimum length threshold, flag for review), and clauses with highly unusual drafting (annotate with the closest type, flag as unusual).
Output Format
Annotated clause data is exported in JSONL for fine-tuning:
{"text": "...", "clause_type": "limitation_of_liability", "risk_level": "standard", "governing_law": "Delaware", "agreement_type": "enterprise_software", "mutual": true, "word_count": 52}
{"text": "...", "clause_type": "indemnification", "risk_level": "escalate", "governing_law": "California", "agreement_type": "professional_services", "mutual": false, "word_count": 241}
The training set should be stratified: roughly equal representation of each clause type (or weighted by the frequency that type appears in real review work), and a deliberate sample of risk-escalate examples (which are rare in practice but critical for the model to learn).
Expected Dataset Size for a Viable Model
For a clause classification model covering the 15 most common clause types:
- Minimum viable: 200 annotated contracts, approximately 4,000–8,000 clause examples across all types
- Useful: 350 annotated contracts, 8,000–15,000 clause examples with good type distribution
- Strong: 500+ annotated contracts, 15,000–25,000 clause examples with metadata variation across agreement types and governing laws
For a risk classification model, the dataset requirements are similar but the rare class problem is more acute. Risk-escalate examples may represent only 5–10% of all clauses in a real archive. Deliberately oversampling high-risk clauses during annotation (annotators are assigned to batches of documents known to contain unusual provisions) helps address this imbalance.
A 350-contract annotation project, at 2–3 hours per contract including quality review, represents approximately 700–1,000 hours of legal professional time. At typical paralegal rates, this is a meaningful but not prohibitive investment — and it produces a training dataset that cannot be replicated by any vendor selling generic legal AI.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- How Law Firms Build AI Models Without Sharing Privileged Documents — Privilege preservation and the legal AI data pipeline
- Domain Experts Locked Out of AI Data Pipelines — Why annotation tools must work for lawyers, not just ML engineers
- Enterprise AI Data Preparation Guide — Full enterprise data prep pipeline overview
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Bill of Quantities Data Extraction: A Guide for Construction AI Projects
Bill of quantities documents are dense, mixed-format files that hold critical domain knowledge for construction AI. Here's how to extract and structure BOQ data for model training — on-premise.

How to Convert Bill of Quantities into AI Training Data
A technical guide to converting Bills of Quantities (BOQs) from varied formats into structured AI training data — covering table extraction, normalization, labeling, and export.

Training AI on Financial Statements: Data Extraction and Labeling On-Premise
How to extract and label financial statement data for AI training — parsing XBRL, extracting tables from PDFs, handling format variation, and building classification models for financial analysis.