Summarization Training Dataset Template
Template for building datasets that train AI models to produce accurate, concise summaries of documents, articles, meetings, and reports.
GenerationOverview
Summarization datasets train AI models to condense long documents into shorter, coherent summaries that capture the most important information. Applications span enterprise document summarization (compressing lengthy reports, research papers, and legal filings), meeting transcript summarization (extracting key decisions and action items), news summarization (generating concise article summaries for digests), and customer interaction summarization (summarizing support conversations for agent handoffs and records).
Two primary approaches define summarization training data. Extractive summarization selects and combines the most important sentences from the source document — the training data identifies which sentences to include. Abstractive summarization generates new text that paraphrases and synthesizes the source content — the training data pairs source documents with human-written summaries. Modern LLMs excel at abstractive summarization, making it the more common and flexible approach for contemporary training datasets.
The key challenge in summarization datasets is defining what makes a good summary. Summaries should be faithful (containing only information present in the source), complete (covering all important points), concise (shorter than the source by a defined compression ratio), coherent (reading as fluent, well-organized text), and useful (serving the reader's information needs). These qualities must be reflected in the training examples and measured through quality validation processes.
Dataset Schema
interface SummarizationExample {
instruction: string;
input: string; // Source document text
output: string; // Target summary
metadata: {
source_type: "article" | "report" | "transcript" | "email_thread" | "legal_filing";
source_word_count: number;
summary_word_count: number;
compression_ratio: number; // source_words / summary_words
summary_style: "executive" | "technical" | "bullet_points" | "abstract";
key_topics: string[];
};
}Sample Data
[
{
"instruction": "Summarize the following meeting transcript, highlighting key decisions and action items.",
"input": "QUARTERLY PRODUCT REVIEW - MARCH 2026\n\nAttendees: VP Engineering, Product Director, Design Lead, QA Manager\n\nProduct Director opened the meeting reviewing Q1 metrics. User engagement increased 23% quarter-over-quarter, driven primarily by the new dashboard redesign launched in February. Monthly active users reached 145,000, exceeding the Q1 target of 130,000. However, the mobile app retention rate dropped from 42% to 38%, which the team identified as a priority concern.\n\nVP Engineering presented the technical roadmap. The team completed 47 of 52 planned story points. The five incomplete items were deferred to Q2 due to an unexpected security patch that consumed two engineering weeks. The security vulnerability (CVE-2026-1847) was identified through the bug bounty program and patched within 48 hours of discovery.\n\nDesign Lead shared user research findings from 30 customer interviews. Three key themes emerged: users want better data export capabilities, the onboarding flow is confusing for non-technical users, and enterprise customers need SSO integration. The team agreed to prioritize SSO for Q2 based on three pending enterprise deals totaling $850K ARR that list SSO as a requirement.\n\nQA Manager reported that automated test coverage reached 78%, up from 71% in Q4. Two critical bugs were found in production during Q1, both related to timezone handling in scheduled reports. A dedicated testing sprint for timezone-related features was proposed for early Q2.\n\nDecisions made: (1) SSO integration prioritized for Q2 sprint 1-2, (2) Mobile retention improvement project approved with dedicated PM allocation, (3) Onboarding redesign to begin in Q2 sprint 3.\n\nAction items assigned: Engineering to scope SSO implementation by March 22, Design to create onboarding improvement mockups by April 1, QA to plan timezone testing sprint, Product to draft mobile retention improvement brief.",
"output": "Q1 Product Review Summary:\n\nKey Metrics: User engagement up 23% QoQ. MAU reached 145K (vs. 130K target). Mobile retention declined from 42% to 38%, flagged as priority concern.\n\nEngineering: Completed 47/52 story points. Shortfall due to emergency security patch (CVE-2026-1847), resolved within 48 hours. Test coverage improved to 78%.\n\nDecisions:\n- SSO integration prioritized for Q2 (driven by $850K in pending enterprise deals)\n- Mobile retention improvement project approved with dedicated PM\n- Onboarding redesign scheduled for Q2 sprint 3\n\nAction Items:\n- Engineering: Scope SSO implementation (due March 22)\n- Design: Onboarding improvement mockups (due April 1)\n- QA: Plan timezone testing sprint\n- Product: Draft mobile retention improvement brief",
"metadata": {
"source_type": "transcript",
"source_word_count": 312,
"summary_word_count": 124,
"compression_ratio": 2.5,
"summary_style": "executive",
"key_topics": ["product-metrics", "engineering-roadmap", "user-research", "qa"]
}
}
]Data Collection Guide
Collect source documents and pair them with human-written summaries. The most effective approach is to have domain experts write summaries of documents they encounter in their normal workflow — meeting organizers summarizing their own meetings, analysts summarizing their own reports, and support leads summarizing escalation cases. These naturally produced summaries reflect real information needs and priorities.
Define summary style guidelines for each document type in your training data. Meeting summaries should emphasize decisions and action items. Technical report summaries should capture key findings and methodologies. Legal document summaries should highlight material terms and risks. Inconsistent summary styles in the training data will produce a model that generates unpredictable outputs.
Control compression ratio across your dataset. Very short summaries (10:1 compression) require extreme selectivity and may miss important details. Very long summaries (2:1 compression) may not provide enough time savings to be useful. Most enterprise applications target 3:1 to 5:1 compression ratios. Include the target summary length or compression ratio in your training instructions so the model learns to adjust its output length.
Quality Criteria
Evaluate summaries on four dimensions: faithfulness (no information not present in the source), completeness (all important points covered), conciseness (no unnecessary repetition or filler), and coherence (reads as well-organized, fluent text). Have reviewers score each dimension on a 1-5 scale, and exclude examples where any dimension scores below 3.
Faithfulness is the most critical criterion. Every factual claim in the summary must be verifiable from the source document. Summaries that introduce information not present in the source (hallucinations) are actively harmful and must be identified and removed. Have reviewers highlight any statement in the summary that cannot be traced to a specific passage in the source document.
Test for coverage completeness by having reviewers independently list the top 5 most important points in the source document, then checking that the summary addresses at least 4 of 5. Missing critical information is a significant quality issue, particularly for summaries used in decision-making contexts where omitted details could change the conclusion.
Using This Template with Ertas
Import source documents and their summaries into Ertas Data Suite. Apply PII redaction to both source and summary texts — meeting transcripts and business reports frequently contain employee names, client names, financial figures, and project code names. The redaction engine processes both fields while maintaining consistency (if a name is masked in the source, the same mask is applied in the summary).
Export paired source-summary data in Alpaca format for fine-tuning in Ertas Studio. The GGUF-exported model enables local document summarization, which is essential for organizations handling confidential documents that cannot be sent to external summarization APIs.
Recommended Model
Summarization requires models large enough to handle long input contexts. A 7B-8B model with 8K context support handles most business documents effectively. For very long documents (over 10,000 words), consider models with extended context windows or implement a chunk-and-summarize pipeline where the model summarizes sections independently and then combines them.
For production deployments, GGUF at Q5_K_M provides good summarization quality. Summarization is less sensitive to quantization artifacts than creative generation tasks, so Q4_K_M is also a viable option for faster inference when processing large document volumes.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.