摘要資料集 Template

    用於訓練文字摘要和文件壓縮 AI 模型的資料集範本

    Generation

    Overview

    摘要資料集訓練 AI 模型將長文件濃縮為較短、連貫的摘要,捕捉最重要的資訊。應用範圍涵蓋企業文件摘要(壓縮冗長的報告、研究論文和法律文件)、會議記錄摘要(提取關鍵決策和行動項目)、新聞摘要(為摘要報產出簡潔的文章摘要),以及客戶互動摘要(為客服移交和記錄摘要支援對話)。

    兩種主要方法定義了摘要訓練資料。擷取式摘要從來源文件中選取並組合最重要的句子——訓練資料識別應包含哪些句子。摘要式摘要生成新的文字,改寫並綜合來源內容——訓練資料將來源文件與人工撰寫的摘要配對。現代 LLM 擅長摘要式摘要,使其成為當代訓練資料集中更常見且靈活的方法。

    摘要資料集的主要挑戰在於定義什麼是好的摘要。摘要應該是忠實的(僅包含來源中存在的資訊)、完整的(涵蓋所有重要觀點)、簡潔的(比來源短且符合定義的壓縮比)、連貫的(讀起來流暢且組織良好),以及實用的(服務於讀者的資訊需求)。這些品質必須反映在訓練範例中,並透過品質驗證流程進行衡量。

    Dataset Schema

    typescript
    interface SummarizationExample {
      instruction: string;
      input: string;            // Source document text
      output: string;           // Target summary
      metadata: {
        source_type: "article" | "report" | "transcript" | "email_thread" | "legal_filing";
        source_word_count: number;
        summary_word_count: number;
        compression_ratio: number;   // source_words / summary_words
        summary_style: "executive" | "technical" | "bullet_points" | "abstract";
        key_topics: string[];
      };
    }
    包含壓縮比和風格元資料的摘要訓練資料結構定義

    Sample Data

    json
    [
      {
        "instruction": "Summarize the following meeting transcript, highlighting key decisions and action items.",
        "input": "QUARTERLY PRODUCT REVIEW - MARCH 2026\n\nAttendees: VP Engineering, Product Director, Design Lead, QA Manager\n\nProduct Director opened the meeting reviewing Q1 metrics. User engagement increased 23% quarter-over-quarter, driven primarily by the new dashboard redesign launched in February. Monthly active users reached 145,000, exceeding the Q1 target of 130,000. However, the mobile app retention rate dropped from 42% to 38%, which the team identified as a priority concern.\n\nVP Engineering presented the technical roadmap. The team completed 47 of 52 planned story points. The five incomplete items were deferred to Q2 due to an unexpected security patch that consumed two engineering weeks. The security vulnerability (CVE-2026-1847) was identified through the bug bounty program and patched within 48 hours of discovery.\n\nDesign Lead shared user research findings from 30 customer interviews. Three key themes emerged: users want better data export capabilities, the onboarding flow is confusing for non-technical users, and enterprise customers need SSO integration. The team agreed to prioritize SSO for Q2 based on three pending enterprise deals totaling $850K ARR that list SSO as a requirement.\n\nQA Manager reported that automated test coverage reached 78%, up from 71% in Q4. Two critical bugs were found in production during Q1, both related to timezone handling in scheduled reports. A dedicated testing sprint for timezone-related features was proposed for early Q2.\n\nDecisions made: (1) SSO integration prioritized for Q2 sprint 1-2, (2) Mobile retention improvement project approved with dedicated PM allocation, (3) Onboarding redesign to begin in Q2 sprint 3.\n\nAction items assigned: Engineering to scope SSO implementation by March 22, Design to create onboarding improvement mockups by April 1, QA to plan timezone testing sprint, Product to draft mobile retention improvement brief.",
        "output": "Q1 Product Review Summary:\n\nKey Metrics: User engagement up 23% QoQ. MAU reached 145K (vs. 130K target). Mobile retention declined from 42% to 38%, flagged as priority concern.\n\nEngineering: Completed 47/52 story points. Shortfall due to emergency security patch (CVE-2026-1847), resolved within 48 hours. Test coverage improved to 78%.\n\nDecisions:\n- SSO integration prioritized for Q2 (driven by $850K in pending enterprise deals)\n- Mobile retention improvement project approved with dedicated PM\n- Onboarding redesign scheduled for Q2 sprint 3\n\nAction Items:\n- Engineering: Scope SSO implementation (due March 22)\n- Design: Onboarding improvement mockups (due April 1)\n- QA: Plan timezone testing sprint\n- Product: Draft mobile retention improvement brief",
        "metadata": {
          "source_type": "transcript",
          "source_word_count": 312,
          "summary_word_count": 124,
          "compression_ratio": 2.5,
          "summary_style": "executive",
          "key_topics": ["product-metrics", "engineering-roadmap", "user-research", "qa"]
        }
      }
    ]
    會議記錄摘要範例,以高階主管風格摘要捕捉決策和行動項目

    Data Collection Guide

    收集來源文件並將其與人工撰寫的摘要配對。最有效的方法是讓領域專家在日常工作流程中撰寫他們所遇到文件的摘要——會議組織者摘要自己的會議、分析師摘要自己的報告、支援主管摘要升級案例。這些自然產出的摘要反映了真實的資訊需求和優先順序。

    為訓練資料中的每種文件類型定義摘要風格指南。會議摘要應強調決策和行動項目。技術報告摘要應捕捉關鍵發現和方法。法律文件摘要應突出重要條款和風險。訓練資料中不一致的摘要風格將產出生成不可預測輸出的模型。

    在資料集中控制壓縮比。非常短的摘要(10:1 壓縮)需要極度的選擇性,可能會遺漏重要細節。非常長的摘要(2:1 壓縮)可能無法提供足夠的時間節省以發揮實用價值。大多數企業應用的目標壓縮比為 3:1 到 5:1。在訓練指令中包含目標摘要長度或壓縮比,以使模型學會根據要求調整其輸出長度。

    Quality Criteria

    在四個維度上評估摘要:忠實度(沒有來源中不存在的資訊)、完整度(涵蓋所有重要觀點)、簡潔度(沒有不必要的重複或填充)和連貫度(讀起來組織良好且文筆流暢)。讓審查者對每個維度進行 1-5 分的評分,並排除任何維度低於 3 分的範例。

    忠實度是最關鍵的標準。摘要中的每個事實陳述都必須能從來源文件中驗證。引入來源中不存在資訊的摘要(幻覺)會造成積極的損害,必須被識別並移除。讓審查者標記摘要中任何無法追溯到來源文件特定段落的陳述。

    透過讓審查者獨立列出來源文件中最重要的 5 個觀點,然後檢查摘要是否至少涵蓋了其中 4 個來測試覆蓋完整性。遺漏關鍵資訊是一個重大的品質問題,特別是對於用於決策的摘要,遺漏的細節可能改變結論。

    Using This Template with Ertas

    將來源文件及其摘要匯入 Ertas Data Suite。對來源和摘要文字都套用個資遮蔽——會議記錄和商業報告經常包含員工姓名、客戶名稱、財務數據和專案代號。遮蔽引擎在維持一致性的同時處理兩個欄位(如果來源中的名稱被遮蔽,摘要中也會套用相同的遮蔽)。

    以 Alpaca 格式匯出配對的來源-摘要資料在 Ertas Studio 中進行微調。GGUF 匯出的模型可進行本地文件摘要,對於處理不能發送到外部摘要 API 的機密文件的組織而言,這是不可或缺的。

    Recommended Model

    摘要需要能處理長輸入上下文的模型。具有 8K 上下文支援的 7B-8B 模型可有效處理大多數商業文件。對於非常長的文件(超過 10,000 字),考慮使用具有擴展上下文窗口的模型,或實施分段摘要管道,讓模型獨立摘要各段落然後再合併。

    對於生產部署,Q5_K_M 的 GGUF 提供良好的摘要品質。摘要對量化偽影的敏感度低於創意生成任務,因此 Q4_K_M 在處理大量文件時也是一個可行的選擇,可提供更快的推論速度。

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.