Named Entity Recognition Dataset Template

    Template for building NER datasets to train models that identify and classify named entities in domain-specific text.

    NLP

    Overview

    Named Entity Recognition (NER) datasets train AI models to identify and classify spans of text that refer to real-world entities — people, organizations, locations, dates, monetary values, products, medical terms, legal references, and other domain-specific entity types. NER is a foundational NLP capability that powers information extraction, knowledge graph construction, document indexing, PII detection, and numerous downstream applications.

    While pre-trained NER models handle common entity types (person, organization, location) reasonably well, most enterprise applications require recognition of domain-specific entity types that general models miss. Financial NER must identify ticker symbols, regulatory bodies, financial instruments, and filing types. Legal NER must recognize case citations, statute references, court names, and legal terms of art. Healthcare NER must identify drug names, dosages, anatomical terms, and clinical procedures. These specialized entity types require domain-specific training data.

    NER datasets use the BIO (or IOB2) tagging scheme at the token level: B-ENTITY marks the beginning of an entity span, I-ENTITY marks continuation tokens within an entity, and O marks tokens that are not part of any entity. More expressive schemes like BIOES (adding Single for single-token entities and End for the last token of multi-token entities) can improve model performance by providing richer boundary information. The choice of tagging scheme should be consistent across your entire dataset.

    Dataset Schema

    typescript
    // Token-level NER format
    interface NERExample {
      tokens: string[];
      ner_tags: string[];     // BIO-tagged labels aligned with tokens
      metadata?: {
        source: string;
        domain: string;
        sentence_id: string;
      };
    }
    
    // Span-level NER format (alternative)
    interface SpanNERExample {
      text: string;
      entities: {
        start: number;       // Character offset start
        end: number;         // Character offset end
        label: string;       // Entity type
        text: string;        // Entity surface form
      }[];
    }
    
    // Entity type definitions
    interface EntitySchema {
      types: {
        name: string;        // e.g., "MEDICATION"
        description: string;
        examples: string[];
      }[];
    }
    NER dataset schemas: token-level BIO format, span-level format, and entity type definitions

    Sample Data

    json
    [
      {
        "tokens": ["Dr.", "Sarah", "Chen", "prescribed", "metformin", "500mg", "twice", "daily", "for", "type", "2", "diabetes", "at", "Memorial", "General", "Hospital", "."],
        "ner_tags": ["O", "B-PROVIDER", "I-PROVIDER", "O", "B-MEDICATION", "B-DOSAGE", "B-FREQUENCY", "I-FREQUENCY", "O", "B-CONDITION", "I-CONDITION", "I-CONDITION", "O", "B-FACILITY", "I-FACILITY", "I-FACILITY", "O"],
        "metadata": {"source": "clinical_notes", "domain": "healthcare", "sentence_id": "clinical_001"}
      },
      {
        "tokens": ["Apple", "Inc.", "reported", "Q4", "revenue", "of", "$89.5", "billion", ",", "exceeding", "Wall", "Street", "estimates", "by", "3.2%", "."],
        "ner_tags": ["B-ORG", "I-ORG", "O", "B-FISCAL_PERIOD", "O", "O", "B-MONETARY", "I-MONETARY", "O", "O", "B-ORG", "I-ORG", "O", "O", "B-PERCENTAGE", "O"],
        "metadata": {"source": "financial_news", "domain": "finance", "sentence_id": "finance_001"}
      },
      {
        "text": "The court cited Brown v. Board of Education, 347 U.S. 483 (1954) in its ruling on the equal protection claim filed in the Southern District of New York.",
        "entities": [
          {"start": 16, "end": 65, "label": "CASE_CITATION", "text": "Brown v. Board of Education, 347 U.S. 483 (1954)"},
          {"start": 100, "end": 122, "label": "LEGAL_CONCEPT", "text": "equal protection claim"},
          {"start": 137, "end": 164, "label": "COURT", "text": "Southern District of New York"}
        ]
      }
    ]
    NER examples from healthcare (clinical entities), finance (fiscal entities), and legal (case citations) domains

    Data Collection Guide

    Define your entity type schema before annotation begins. For each entity type, document: the type name, a clear definition, 5-10 examples of varying complexity, boundary rules (should "Dr." be included in a PROVIDER entity? should currency symbols be included in MONETARY values?), and nesting rules (can entities overlap or be nested?). Ambiguous boundary definitions are the primary source of annotation inconsistency.

    Select annotation tooling that supports your chosen format and provides efficient workflows for entity marking. Tools like Prodigy, Label Studio, BRAT, and Doccano support span-level annotation with conversion to BIO format. For high-volume annotation, consider active learning workflows where the model identifies uncertain predictions for human review, focusing annotator effort on the most informative examples.

    Pre-annotate text with an existing NER model and have annotators correct the predictions rather than annotating from scratch. This is significantly faster than manual annotation from a blank slate, typically reducing annotation time by 40-60 percent. Ensure annotators correct both false positives (incorrectly identified entities) and false negatives (missed entities) to avoid biasing the corrected dataset toward the pre-annotation model's error patterns.

    Quality Criteria

    Measure inter-annotator agreement at the entity level using F1 score between annotator pairs. Two annotators should independently annotate the same 200-300 sentences, and entity-level F1 should exceed 0.85 for the dataset to be considered reliable. For complex entity types (legal citations, medical procedures), lower agreement thresholds of 0.75-0.80 may be acceptable, but indicate a need for more detailed annotation guidelines.

    Validate entity boundary consistency. Check that annotators are consistent about whether to include titles (Dr., Mr.), suffixes (Inc., LLC), and delimiters in entity spans. Boundary inconsistency degrades model performance significantly because the model receives conflicting signals about where entities begin and end. Run automated consistency checks comparing entity spans across similar contexts.

    Ensure adequate representation of each entity type. Rare entity types need a minimum of 200-300 annotated instances for the model to learn reliable recognition patterns. If certain entity types appear rarely in natural text, seek out documents where they appear frequently or create synthetic examples that include them in realistic contexts. Track entity type frequency and flag any types with fewer than 100 examples for targeted augmentation.

    Using This Template with Ertas

    Import your raw text corpus into Ertas Data Suite for PII assessment — ironically, NER training data for PII detection must itself be handled carefully, as the source text contains real personal information. Use the data lineage system to track which documents have been annotated, by whom, and what stage of quality review they have completed. Export prepared text for annotation in your chosen tool.

    After annotation, re-import the labeled dataset for format conversion and final quality validation. Export in CoNLL format for token classification model training or in JSONL format for LLM-based NER fine-tuning. The on-premise processing ensures that sensitive source text never leaves your controlled environment during the annotation pipeline.

    Recommended Model

    For production NER with high throughput, fine-tune an encoder model (BERT, DeBERTa, or RoBERTa) for token classification. These models process text at thousands of tokens per second on CPU and provide state-of-the-art NER performance when fine-tuned on domain-specific data. DeBERTa-v3-base is the current best choice for English NER.

    For NER tasks requiring flexibility to add new entity types without retraining, fine-tune a 7B generative model with instruction-based NER prompts. The model can be instructed to identify specific entity types through the prompt, allowing entity schema changes without model retraining. Export to GGUF for local inference.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.