摘要生成训练数据集 Template

    用于构建训练 AI 模型对文档、文章、会议记录和报告生成准确、简洁摘要的数据集模板。

    Generation

    Overview

    摘要生成数据集训练 AI 模型将长文档浓缩为较短的、连贯的摘要,捕捉最重要的信息。应用场景涵盖企业文档摘要(压缩冗长的报告、研究论文和法律文件)、会议记录摘要(提取关键决策和行动项)、新闻摘要(为简报生成简洁的文章摘要)和客户交互摘要(为客服交接和记录总结支持对话)。

    两种主要方法定义了摘要训练数据。抽取式摘要从源文档中选择并组合最重要的句子——训练数据标识应包含哪些句子。摘要式摘要生成新文本,对源内容进行改写和综合——训练数据将源文档与人工编写的摘要配对。现代 LLM 擅长摘要式摘要,使其成为当代训练数据集中更常见和灵活的方法。

    摘要数据集的关键挑战在于定义什么是好的摘要。摘要应当忠实(仅包含源文档中的信息)、完整(覆盖所有要点)、简洁(按设定的压缩比短于源文档)、连贯(读起来流畅、组织有序)且有用(满足读者的信息需求)。这些特质必须在训练示例中体现,并通过质量验证流程进行衡量。

    Dataset Schema

    typescript
    interface SummarizationExample {
      instruction: string;
      input: string;            // Source document text
      output: string;           // Target summary
      metadata: {
        source_type: "article" | "report" | "transcript" | "email_thread" | "legal_filing";
        source_word_count: number;
        summary_word_count: number;
        compression_ratio: number;   // source_words / summary_words
        summary_style: "executive" | "technical" | "bullet_points" | "abstract";
        key_topics: string[];
      };
    }
    摘要生成训练数据的架构,包含压缩比和风格元数据

    Sample Data

    json
    [
      {
        "instruction": "Summarize the following meeting transcript, highlighting key decisions and action items.",
        "input": "QUARTERLY PRODUCT REVIEW - MARCH 2026\n\nAttendees: VP Engineering, Product Director, Design Lead, QA Manager\n\nProduct Director opened the meeting reviewing Q1 metrics. User engagement increased 23% quarter-over-quarter, driven primarily by the new dashboard redesign launched in February. Monthly active users reached 145,000, exceeding the Q1 target of 130,000. However, the mobile app retention rate dropped from 42% to 38%, which the team identified as a priority concern.\n\nVP Engineering presented the technical roadmap. The team completed 47 of 52 planned story points. The five incomplete items were deferred to Q2 due to an unexpected security patch that consumed two engineering weeks. The security vulnerability (CVE-2026-1847) was identified through the bug bounty program and patched within 48 hours of discovery.\n\nDesign Lead shared user research findings from 30 customer interviews. Three key themes emerged: users want better data export capabilities, the onboarding flow is confusing for non-technical users, and enterprise customers need SSO integration. The team agreed to prioritize SSO for Q2 based on three pending enterprise deals totaling $850K ARR that list SSO as a requirement.\n\nQA Manager reported that automated test coverage reached 78%, up from 71% in Q4. Two critical bugs were found in production during Q1, both related to timezone handling in scheduled reports. A dedicated testing sprint for timezone-related features was proposed for early Q2.\n\nDecisions made: (1) SSO integration prioritized for Q2 sprint 1-2, (2) Mobile retention improvement project approved with dedicated PM allocation, (3) Onboarding redesign to begin in Q2 sprint 3.\n\nAction items assigned: Engineering to scope SSO implementation by March 22, Design to create onboarding improvement mockups by April 1, QA to plan timezone testing sprint, Product to draft mobile retention improvement brief.",
        "output": "Q1 Product Review Summary:\n\nKey Metrics: User engagement up 23% QoQ. MAU reached 145K (vs. 130K target). Mobile retention declined from 42% to 38%, flagged as priority concern.\n\nEngineering: Completed 47/52 story points. Shortfall due to emergency security patch (CVE-2026-1847), resolved within 48 hours. Test coverage improved to 78%.\n\nDecisions:\n- SSO integration prioritized for Q2 (driven by $850K in pending enterprise deals)\n- Mobile retention improvement project approved with dedicated PM\n- Onboarding redesign scheduled for Q2 sprint 3\n\nAction Items:\n- Engineering: Scope SSO implementation (due March 22)\n- Design: Onboarding improvement mockups (due April 1)\n- QA: Plan timezone testing sprint\n- Product: Draft mobile retention improvement brief",
        "metadata": {
          "source_type": "transcript",
          "source_word_count": 312,
          "summary_word_count": 124,
          "compression_ratio": 2.5,
          "summary_style": "executive",
          "key_topics": ["product-metrics", "engineering-roadmap", "user-research", "qa"]
        }
      }
    ]
    会议记录摘要示例,采用高管风格摘要捕捉决策和行动项

    Data Collection Guide

    收集源文档并将其与人工编写的摘要配对。最有效的方法是让领域专家在日常工作流中为他们遇到的文档编写摘要——会议组织者总结自己的会议、分析师总结自己的报告、支持主管总结升级案例。这些自然产出的摘要反映了真实的信息需求和优先级。

    为训练数据中的每种文档类型定义摘要风格指南。会议摘要应强调决策和行动项。技术报告摘要应捕捉关键发现和方法论。法律文件摘要应突出重要条款和风险。训练数据中不一致的摘要风格将导致模型产出不可预测的输出。

    控制数据集中的压缩比。非常短的摘要(10:1 压缩)需要极端选择性,可能遗漏重要细节。非常长的摘要(2:1 压缩)可能不够节省时间。大多数企业应用的目标压缩比为 3:1 到 5:1。在训练指令中包含目标摘要长度或压缩比,使模型学会根据需要调整输出长度。

    Quality Criteria

    从四个维度评估摘要:忠实性(不包含源文档中不存在的信息)、完整性(覆盖所有要点)、简洁性(无不必要的重复或填充内容)和连贯性(读起来流畅、组织有序)。让审核员对每个维度在 1-5 分范围内评分,排除任何维度低于 3 分的示例。

    忠实性是最关键的标准。摘要中的每项事实声明都必须可从源文档中验证。引入源文档中不存在的信息的摘要(幻觉)是有害的,必须被识别和删除。让审核员标出摘要中无法追溯到源文档特定段落的任何陈述。

    通过让审核员独立列出源文档中最重要的 5 个要点,然后检查摘要是否涵盖了至少 4/5 个要点来测试覆盖完整性。遗漏关键信息是一个严重的质量问题,特别是对于用于决策场景的摘要,遗漏的细节可能改变结论。

    Using This Template with Ertas

    将源文档及其摘要导入 Ertas Data Suite。对源文本和摘要文本都应用 PII 脱敏——会议记录和业务报告中经常包含员工姓名、客户名称、财务数据和项目代号。脱敏引擎在处理两个字段时保持一致性(如果源文本中的某个名称被遮蔽,摘要中也会应用相同的遮蔽)。

    以 Alpaca 格式导出配对的源-摘要数据用于在 Ertas Studio 中微调。GGUF 导出的模型支持本地文档摘要生成,这对于处理不能发送到外部摘要 API 的机密文档的组织至关重要。

    Recommended Model

    摘要生成需要足够大的模型来处理长输入上下文。支持 8K 上下文的 7B-8B 模型能有效处理大多数商业文档。对于超长文档(超过 10,000 词),考虑使用扩展上下文窗口的模型,或实施分块摘要管道,模型先独立摘要各部分,然后合并结果。

    对于生产部署,Q5_K_M 的 GGUF 提供良好的摘要质量。摘要生成对量化伪影的敏感度低于创意生成任务,因此在处理大量文档时,Q4_K_M 也是一个可行的快速推理选项。

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.