问答数据集 Template

    用于构建抽取式和生成式问答数据集的模板,适用于领域专用的知识检索。

    NLP

    Overview

    问答(QA)数据集训练 AI 模型基于给定的上下文、内部知识或两者结合来回答问题。QA 是企业 AI 应用的核心能力,包括知识库搜索、文档检索、FAQ 自动化、内部工单机器人和检索增强生成(RAG)系统。数据集结构因 QA 类型而异:抽取式 QA(从给定段落中选取答案片段)、摘要式 QA(从上下文中生成综合答案)和闭卷 QA(仅从模型的参数化知识中回答)。

    对于企业应用,最常见且实用的方法是基于上下文的 QA,即模型接收相关文档段落并基于该上下文回答问题。这与 RAG 架构完美契合,检索器找到相关文档,模型从检索到的内容中生成答案。此类用例的训练数据将问题与上下文段落和预期答案配对,教会模型从提供的文本中查找和综合信息。

    领域专用性对 QA 数据集至关重要。通用 QA 模型在处理专业术语、领域特定的推理模式以及正确回答法律、医学、金融或工程等领域问题所需的隐含知识时往往表现不佳。在领域专用的 QA 对上微调可显著提高答案的准确性和相关性,特别是对于需要理解行话、缩略语和领域惯例的问题。

    Dataset Schema

    typescript
    interface QAExample {
      question: string;
      context: string;        // Source passage for context-grounded QA
      answer: string;         // The expected answer
      answer_start?: number;  // Character offset for extractive QA
      is_answerable: boolean; // Whether the context contains the answer
      metadata: {
        source_document: string;
        domain: string;
        difficulty: "easy" | "medium" | "hard";
        question_type: "factoid" | "reasoning" | "comparison" | "procedural";
      };
    }
    问答示例的架构,支持抽取式和生成式

    Sample Data

    json
    [
      {
        "question": "What is the maximum file size supported for GGUF model uploads?",
        "context": "Ertas Studio supports GGUF model files up to 50GB in size. For models exceeding this limit, the sharded upload feature allows splitting the model into multiple files, each up to 50GB. The platform automatically reassembles sharded models during deployment. Upload speeds depend on your network connection, with typical enterprise connections achieving 500MB-1GB per minute. All uploads are encrypted in transit using TLS 1.3.",
        "answer": "The maximum file size for GGUF model uploads is 50GB. For larger models, the sharded upload feature can split files into multiple 50GB chunks that are automatically reassembled during deployment.",
        "answer_start": 30,
        "is_answerable": true,
        "metadata": {
          "source_document": "ertas-studio-docs/uploads.md",
          "domain": "product-documentation",
          "difficulty": "easy",
          "question_type": "factoid"
        }
      },
      {
        "question": "How does the PII redaction engine handle names that are also common words?",
        "context": "The PII redaction engine uses a combination of named entity recognition (NER), pattern matching, and contextual analysis to identify personal information. For names that are also common words (e.g., 'Grace', 'Rose', 'Mark'), the engine evaluates surrounding context to determine whether the word is being used as a proper name or a common noun. Contextual signals include capitalization, position in the sentence, surrounding words (Mr., Ms., Dr.), and semantic role. The engine defaults to redaction when the confidence score exceeds 0.7, with a configurable threshold that organizations can adjust based on their risk tolerance.",
        "answer": "The PII redaction engine uses contextual analysis alongside NER and pattern matching to distinguish names from common words. It evaluates capitalization, sentence position, surrounding words (like Mr., Ms., Dr.), and semantic role. When the confidence score exceeds 0.7 (a configurable threshold), it defaults to redaction.",
        "answer_start": 0,
        "is_answerable": true,
        "metadata": {
          "source_document": "ertas-data-suite-docs/pii-redaction.md",
          "domain": "product-documentation",
          "difficulty": "medium",
          "question_type": "reasoning"
        }
      },
      {
        "question": "Can I export models in TensorRT format from Ertas Studio?",
        "context": "Ertas Studio supports model export in three formats: GGUF for local CPU/GPU inference, SafeTensors for GPU-based deployment with frameworks like vLLM and TGI, and ONNX for cross-platform deployment with ONNX Runtime. Export configurations include quantization level selection, metadata embedding, and tokenizer bundling.",
        "answer": "No, TensorRT format is not currently supported. Ertas Studio supports export in three formats: GGUF, SafeTensors, and ONNX.",
        "answer_start": -1,
        "is_answerable": true,
        "metadata": {
          "source_document": "ertas-studio-docs/export.md",
          "domain": "product-documentation",
          "difficulty": "easy",
          "question_type": "factoid"
        }
      }
    ]
    涵盖事实型、推理型和否定答案场景的产品文档 QA 示例

    Data Collection Guide

    从你现有的文档、知识库文章、FAQ 和内部 wiki 开始。对于每个文档,生成 3-5 个该文档能回答的问题,难度从简单的事实提取到多步推理不等。包含需要综合段落不同部分信息的问题,以及上下文中不包含答案的问题(训练模型说"我不知道"而非虚构答案)。

    让领域专家参与问题生成。请主题专家编写他们预期用户会提出的问题,而非机械地从文本中生成问题。专家撰写的问题捕捉了真实用户的自然语言模式和信息需求,比算法生成的问题能产出更有效的训练数据集。

    在数据集中包含无法回答的问题(占 15-20%)。这些问题的上下文不包含足够的信息来作答。预期输出应明确说明上下文中没有相关信息。这对 RAG 应用至关重要,因为检索器可能返回边缘相关但实际上不能回答用户问题的段落——模型必须学会识别这种情况并拒绝回答,而非编造答案。

    Quality Criteria

    通过让第二位审核员仅使用提供的上下文独立回答每个问题来验证答案的正确性,然后与标注的答案进行比较。分歧应由领域专家裁定。对于事实型问题,应有且仅有一个可从上下文推导出的正确答案。对于推理型问题,应记录可接受的答案变体。

    确保问题在类型(什么、如何、为什么、何时、哪个、比较、操作步骤)、难度级别和来源文档之间的多样性。以简单事实型问题为主的数据集无法训练出能够处理用户在实践中提出的复杂推理问题的模型。目标分布大约为 40% 简单、40% 中等和 20% 困难。

    验证上下文段落是自包含的——答案应仅从提供的上下文中推导得出,不需要外部知识。如果某个问题只能通过将上下文与文本中不存在的领域专业知识结合才能回答,要么扩展上下文以包含必要信息,要么将该问题重新分类为闭卷 QA 示例。

    Using This Template with Ertas

    将你的文档语料库导入 Ertas Data Suite。如果文档包含客户专属信息、内部 URL 或员工姓名,在将文本用作训练上下文之前应用 PII 脱敏。数据溯源系统追踪哪些源文档贡献了每个 QA 对,保持回溯到原始知识库的来源记录。

    以 Alpaca 格式(instruction: 问题, input: 上下文, output: 答案)或 JSONL 导出 QA 对,在 Ertas Studio 中进行微调。训练后,GGUF 导出的模型可驱动本地 RAG 系统,在本地处理查询而无需将用户问题发送到外部 API。

    Recommended Model

    对于 QA 系统,7B-8B 参数的模型在理解能力和推理速度之间提供了合适的平衡。该规模的模型能有效处理 4,000-8,000 词元的上下文窗口,覆盖大多数文档段落。对于需要处理超长文档的 QA 系统,考虑使用支持扩展上下文的模型。

    对于高吞吐量的抽取式 QA(选取精确答案片段),在 SQuAD 风格数据上微调的编码器模型(如 DeBERTa)提供更快的推理速度。对于带解释的生成式 QA,微调后导出为 GGUF 的 7B 模型在质量和速度之间提供最佳平衡。

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.