法律合同审查数据集 Template

    用于构建训练 AI 模型识别条款、标记风险并分类法律合同条款的数据集模板。

    Classification

    Overview

    法律合同审查数据集训练 AI 模型分析合同,包括识别特定条款类型、标记潜在风险条款、提取关键术语,以及按法律功能分类合同章节。这些数据集使组织能够自动化合同初步审查,减少律师在常规文件审查上花费的时间,同时确保高风险条款得到适当的人工关注。

    数据集结构必须反映法律文件的层级性质。合同包含章节、条款和子条款,各自服务于特定的法律目的。训练数据应包含常见条款类型的示例——责任限制、赔偿、终止、保密、管辖法律、转让、不可抗力和知识产权条款。每个条款都应标注其类型、风险等级(标准、有利、不利、缺失)以及任何具体关注点,如单方面赔偿义务或过于宽泛的竞业限制条款。

    合同审查 AI 对训练数据质量特别敏感,因为错误可能导致严重的法律和财务后果。数据集必须由理解合同语言细微差别的合格法律专业人士审核。上下文非常重要——同样的条款措辞在一种协议中可能是标准的,在另一种中则可能非常不寻常。训练数据应涵盖多种合同类型(保密协议、服务协议、雇佣合同、许可协议、合伙协议),以确保模型能在不同文件类别之间适当泛化。

    Dataset Schema

    typescript
    interface ContractClauseExample {
      instruction: string;
      input: string;       // The contract clause or section text
      output: string;      // Classification, risk assessment, or extraction result
      metadata: {
        contract_type: "NDA" | "MSA" | "SaaS" | "Employment" | "License" | "Partnership";
        clause_type: string;
        risk_level: "standard" | "favorable" | "unfavorable" | "critical";
        jurisdiction: string;
      };
    }
    法律合同审查训练示例的架构,包含条款元数据

    Sample Data

    json
    [
      {
        "instruction": "Classify the following contract clause and assess its risk level for the receiving party.",
        "input": "The Receiving Party shall indemnify, defend, and hold harmless the Disclosing Party from and against any and all claims, damages, losses, costs, and expenses (including reasonable attorneys' fees) arising from or related to any breach of this Agreement by the Receiving Party, including but not limited to any unauthorized disclosure or use of Confidential Information.",
        "output": "Clause Type: Indemnification\nRisk Level: Standard\n\nAnalysis: This is a standard one-way indemnification clause in an NDA context. The Receiving Party indemnifies the Disclosing Party for breaches of confidentiality obligations. The scope is reasonable — it covers damages arising from breach of the agreement, specifically unauthorized disclosure. The inclusion of attorneys' fees is standard. No unusual or overly broad provisions detected.\n\nRecommendation: Acceptable as-is for standard NDA.",
        "metadata": {
          "contract_type": "NDA",
          "clause_type": "indemnification",
          "risk_level": "standard",
          "jurisdiction": "US-General"
        }
      },
      {
        "instruction": "Identify any concerning provisions in this limitation of liability clause.",
        "input": "IN NO EVENT SHALL PROVIDER BE LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL, OR PUNITIVE DAMAGES, REGARDLESS OF THE CAUSE OF ACTION OR THE THEORY OF LIABILITY. PROVIDER'S TOTAL AGGREGATE LIABILITY UNDER THIS AGREEMENT SHALL NOT EXCEED THE FEES PAID BY CUSTOMER IN THE TWELVE (12) MONTH PERIOD IMMEDIATELY PRECEDING THE EVENT GIVING RISE TO THE CLAIM. THIS LIMITATION SHALL APPLY NOTWITHSTANDING THE FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY.",
        "output": "Clause Type: Limitation of Liability\nRisk Level: Unfavorable (for Customer)\n\nConcerns Identified:\n1. **No carve-outs**: The limitation applies universally with no exceptions for gross negligence, willful misconduct, IP infringement, or confidentiality breaches. Industry standard is to carve out at least IP indemnification and confidentiality obligations.\n2. **Low liability cap**: 12-month trailing fees may be insufficient for enterprise agreements. Consider negotiating 24-month or a fixed dollar floor.\n3. **Failure of essential purpose language**: The final sentence attempts to preserve the limitation even when a remedy fails its essential purpose, which may be unenforceable in some jurisdictions but signals aggressive positioning.\n\nRecommendation: Negotiate carve-outs for IP, confidentiality, and data breach. Consider a higher liability cap or a super-cap for critical obligations.",
        "metadata": {
          "contract_type": "SaaS",
          "clause_type": "limitation-of-liability",
          "risk_level": "unfavorable",
          "jurisdiction": "US-General"
        }
      }
    ]
    展示条款分类和风险评估的法律合同审查示例

    Data Collection Guide

    从你组织的历史合同审查备忘录、批注意见和条款库中获取训练数据。法律团队通常维护已批准的条款语言数据库、谈判手册和风险评估模板——这些都是训练示例的优质来源。将条款库条目转换为指令-回复对,其中指令要求模型分类或评估条款,输出则提供初级律师级别的分析。

    PII 脱敏对法律训练数据至关重要。合同文本包含当事方名称、地址、财务条款和其他机密商业信息。使用 Ertas Data Suite 的 PII 脱敏功能将公司名称替换为通用占位符(甲方、乙方),遮蔽具体金额(同时保留相对量级),并删除地址和联系信息。在去除识别信息的同时保留条款结构和法律术语。

    聘请执业律师审核和验证训练示例,特别是风险评估。"标准"与"不利"条款之间的区别往往取决于上下文、管辖区域和交易动态,这需要法律专业知识来评判。计划每个示例至少由两位律师审核以建立一致的质量标准,通过资深律师的审查来解决分歧。

    Quality Criteria

    法律准确性是最重要的质量标准。训练数据中的每项风险评估、条款分类和建议都必须在法律上站得住脚。让合格律师验证条款类型是否正确识别、风险等级是否准确反映当事方的风险敞口,以及建议是否符合合理的法律实践。训练数据中的错误法律分析将导致模型给出危险的错误建议。

    确保涵盖多种合同类型、条款类型和管辖区域。数据集应包含至少 5-6 种合同类型的示例,并覆盖所有常见条款类别(10-15 种)。包含起草良好和起草不当的条款示例——模型需要将起草不当识别为风险因素。包含缺失条款的示例(识别合同缺少某项标准条款),因为这是合同审查 AI 最有价值的能力之一。

    验证所有示例的输出格式一致。每项分析都应遵循相同的结构(条款类型、风险等级、分析、建议),以训练模型产出可预测、可解析的输出。测试模型输出能否集成到期望结构化风险评估的下游法律工作流工具中。

    Using This Template with Ertas

    将合同文本导入 Ertas Data Suite 进行 PII 脱敏,去除当事方名称、财务条款和其他机密信息,同时保留法律语言和条款结构。数据溯源追踪系统记录每次脱敏操作,为法律合规团队提供所需的审计记录。以 Alpaca 或 JSONL 格式导出清洗后的数据集用于微调。

    本地部署架构对法律数据尤为重要,因为这些数据通常受律师-客户特权和严格保密义务的约束。通过 Ertas Data Suite 的气隙环境处理合同文本,确保特权信息永远不会离开你组织的基础设施。

    Recommended Model

    法律合同审查受益于能够处理风险评估所需复杂推理的较大模型。建议从 13B-14B 参数的模型开始,如 Llama 3.1 14B,以获得更细致的分析。对于较简单的条款分类任务(仅识别条款类型而不进行风险评估),7B-8B 模型即可提供足够的性能和更快的推理速度。

    考虑在监督微调之前,先在大规模法律文本语料库上进行领域自适应预训练。在通用文本上预训练的模型可能难以处理法律术语、引用格式和合同中常见的复杂句式结构。

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.