Question-Answering Dataset Template

Template for building extractive and generative question-answering datasets for domain-specific knowledge retrieval.

NLP

Overview

Question-answering (QA) datasets train AI models to answer questions based on provided context, internal knowledge, or a combination of both. QA is a core capability for enterprise AI applications including knowledge base search, document retrieval, FAQ automation, internal helpdesk bots, and retrieval-augmented generation (RAG) systems. The dataset structure varies by QA type: extractive QA (selecting an answer span from a given passage), abstractive QA (generating a synthesized answer from context), and closed-book QA (answering from the model's parametric knowledge alone).

For enterprise applications, the most common and practical approach is context-grounded QA, where the model receives a relevant document passage and answers questions based on that context. This aligns perfectly with RAG architectures where a retriever finds relevant documents and the model generates answers from the retrieved content. Training data for this use case pairs questions with context passages and expected answers, teaching the model to find and synthesize information from provided text.

Domain specificity is crucial for QA datasets. General-purpose QA models often struggle with specialized terminology, domain-specific reasoning patterns, and the implicit knowledge required to correctly answer questions in fields like law, medicine, finance, or engineering. Fine-tuning on domain-specific QA pairs significantly improves answer accuracy and relevance, particularly for questions that require understanding of jargon, acronyms, and domain conventions.

Dataset Schema

typescript

interface QAExample {
  question: string;
  context: string;        // Source passage for context-grounded QA
  answer: string;         // The expected answer
  answer_start?: number;  // Character offset for extractive QA
  is_answerable: boolean; // Whether the context contains the answer
  metadata: {
    source_document: string;
    domain: string;
    difficulty: "easy" | "medium" | "hard";
    question_type: "factoid" | "reasoning" | "comparison" | "procedural";
  };
}

Schema for question-answering examples with extractive and generative support

Sample Data

json

[
  {
    "question": "What is the maximum file size supported for GGUF model uploads?",
    "context": "Ertas Studio supports GGUF model files up to 50GB in size. For models exceeding this limit, the sharded upload feature allows splitting the model into multiple files, each up to 50GB. The platform automatically reassembles sharded models during deployment. Upload speeds depend on your network connection, with typical enterprise connections achieving 500MB-1GB per minute. All uploads are encrypted in transit using TLS 1.3.",
    "answer": "The maximum file size for GGUF model uploads is 50GB. For larger models, the sharded upload feature can split files into multiple 50GB chunks that are automatically reassembled during deployment.",
    "answer_start": 30,
    "is_answerable": true,
    "metadata": {
      "source_document": "ertas-studio-docs/uploads.md",
      "domain": "product-documentation",
      "difficulty": "easy",
      "question_type": "factoid"
    }
  },
  {
    "question": "How does the PII redaction engine handle names that are also common words?",
    "context": "The PII redaction engine uses a combination of named entity recognition (NER), pattern matching, and contextual analysis to identify personal information. For names that are also common words (e.g., 'Grace', 'Rose', 'Mark'), the engine evaluates surrounding context to determine whether the word is being used as a proper name or a common noun. Contextual signals include capitalization, position in the sentence, surrounding words (Mr., Ms., Dr.), and semantic role. The engine defaults to redaction when the confidence score exceeds 0.7, with a configurable threshold that organizations can adjust based on their risk tolerance.",
    "answer": "The PII redaction engine uses contextual analysis alongside NER and pattern matching to distinguish names from common words. It evaluates capitalization, sentence position, surrounding words (like Mr., Ms., Dr.), and semantic role. When the confidence score exceeds 0.7 (a configurable threshold), it defaults to redaction.",
    "answer_start": 0,
    "is_answerable": true,
    "metadata": {
      "source_document": "ertas-data-suite-docs/pii-redaction.md",
      "domain": "product-documentation",
      "difficulty": "medium",
      "question_type": "reasoning"
    }
  },
  {
    "question": "Can I export models in TensorRT format from Ertas Studio?",
    "context": "Ertas Studio supports model export in three formats: GGUF for local CPU/GPU inference, SafeTensors for GPU-based deployment with frameworks like vLLM and TGI, and ONNX for cross-platform deployment with ONNX Runtime. Export configurations include quantization level selection, metadata embedding, and tokenizer bundling.",
    "answer": "No, TensorRT format is not currently supported. Ertas Studio supports export in three formats: GGUF, SafeTensors, and ONNX.",
    "answer_start": -1,
    "is_answerable": true,
    "metadata": {
      "source_document": "ertas-studio-docs/export.md",
      "domain": "product-documentation",
      "difficulty": "easy",
      "question_type": "factoid"
    }
  }
]

QA examples covering factoid, reasoning, and negative-answer scenarios from product documentation

Data Collection Guide

Start with your existing documentation, knowledge base articles, FAQs, and internal wikis. For each document, generate 3-5 questions that the document answers, ranging in difficulty from simple factoid extraction to multi-step reasoning. Include questions where the answer requires synthesizing information from different parts of the passage, and questions where the context does not contain the answer (teaching the model to say "I don't know" rather than hallucinate).

Involve domain experts in question generation. Ask subject matter experts to write the questions they would expect users to ask, rather than generating questions mechanically from text. Expert-written questions capture the natural language patterns and information needs of real users, producing a more effective training dataset than algorithmically generated questions.

Include unanswerable questions in your dataset (15-20 percent of examples). These are questions where the provided context does not contain sufficient information to answer. The expected output should clearly state that the information is not available in the context. This is critical for RAG applications where the retriever may return marginally relevant passages that do not actually answer the user's question — the model must learn to recognize this and decline rather than fabricate an answer.

Quality Criteria

Verify answer correctness by having a second reviewer independently answer each question using only the provided context, then compare with the labeled answer. Disagreements should be adjudicated by a domain expert. For factoid questions, there should be exactly one correct answer derivable from the context. For reasoning questions, acceptable answer variations should be documented.

Ensure question diversity across question types (what, how, why, when, which, comparison, procedural), difficulty levels, and source documents. A dataset dominated by simple factoid questions will not train a model capable of handling the complex reasoning questions that users ask in practice. Aim for a distribution of roughly 40 percent easy, 40 percent medium, and 20 percent hard questions.

Validate that context passages are self-contained — the answer should be derivable from the provided context alone without requiring external knowledge. If a question can only be answered by combining the context with domain expertise not present in the text, either expand the context to include the necessary information or reclassify the question as a closed-book QA example.

Using This Template with Ertas

Import your documentation corpus into Ertas Data Suite. If documentation contains customer-specific information, internal URLs, or employee names, apply PII redaction before using the text as training contexts. The data lineage system tracks which source documents contributed to each QA pair, maintaining provenance back to the original knowledge base.

Export QA pairs in Alpaca format (instruction: question, input: context, output: answer) or JSONL for fine-tuning in Ertas Studio. After training, the GGUF-exported model can power on-premise RAG systems that process queries locally without sending user questions to external APIs.

Recommended Model

For QA systems, a 7B-8B parameter model provides the right balance of comprehension capability and inference speed. Models at this scale handle context windows of 4,000-8,000 tokens effectively, covering most document passages. For QA systems that must process very long documents, consider a model trained with extended context support.

For high-throughput extractive QA (selecting exact answer spans), encoder-based models like DeBERTa fine-tuned on SQuAD-style data provide faster inference. For generative QA with explanations, a fine-tuned 7B model exported to GGUF gives the best quality-speed balance.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →