Text Classification Dataset Template

Template for building general-purpose text classification datasets covering topic categorization, intent detection, and content moderation.

Classification

Overview

Text classification is the foundational NLP task of assigning predefined categories to text documents. Applications include email routing (spam, primary, promotions, social), support ticket categorization (billing, technical, account, feature request), content moderation (safe, flagged, review needed), document classification (contracts, invoices, reports, correspondence), and intent detection in conversational AI. A well-constructed text classification dataset enables organizations to automate these categorization tasks with high accuracy.

The flexibility of text classification makes the dataset design critical. The taxonomy of categories must be well-defined, mutually exclusive (for single-label classification) or clearly specified (for multi-label), and comprehensive enough to cover all expected inputs. Ambiguous category boundaries are the primary cause of poor classification performance — if human annotators cannot consistently agree on the correct label, the model will not be able to either.

Text classification datasets can be structured for different modeling approaches. For traditional ML or encoder-based models (BERT, DeBERTa), the dataset consists of text-label pairs. For LLM-based classification, the dataset uses an instruction format where the model is prompted to classify the text and explain its reasoning. The LLM approach has the advantage of producing explanations alongside predictions, but at the cost of higher inference latency. Choose the approach that matches your deployment requirements.

Dataset Schema

typescript

// Single-label classification
interface TextClassificationExample {
  text: string;
  label: string;
  confidence?: number;
  metadata?: {
    source: string;
    annotator_agreement: number;
    word_count: number;
  };
}

// Multi-label classification
interface MultiLabelExample {
  text: string;
  labels: string[];
  metadata?: {
    source: string;
    primary_label: string;
  };
}

// LLM instruction format for classification with explanation
interface LLMClassificationExample {
  instruction: string;
  input: string;
  output: string;  // Contains both label and reasoning
}

Schema variants for single-label, multi-label, and LLM-based text classification

Sample Data

json

[
  {
    "text": "I can't log into my account even after resetting my password three times. The reset email comes through but the new password doesn't work. I've tried different browsers and clearing cookies.",
    "label": "technical_issue",
    "confidence": 0.95,
    "metadata": {"source": "support_tickets", "annotator_agreement": 1.0, "word_count": 36}
  },
  {
    "text": "Can you explain the difference between the Pro and Enterprise plans? We're a team of 50 and need to understand what additional features we'd get with the upgrade.",
    "label": "sales_inquiry",
    "confidence": 0.92,
    "metadata": {"source": "support_tickets", "annotator_agreement": 0.9, "word_count": 30}
  },
  {
    "text": "I was charged $49.99 on March 3rd but I cancelled my subscription on February 28th. Please process a refund for this charge.",
    "label": "billing_issue",
    "confidence": 0.97,
    "metadata": {"source": "support_tickets", "annotator_agreement": 1.0, "word_count": 25}
  },
  {
    "text": "It would be really helpful if the dashboard had a dark mode option. Several people on our team work late hours and the bright interface causes eye strain.",
    "label": "feature_request",
    "confidence": 0.91,
    "metadata": {"source": "support_tickets", "annotator_agreement": 0.85, "word_count": 29}
  },
  {
    "text": "Just wanted to say thanks to your support team — Sarah resolved my issue in under 5 minutes. Great service!",
    "label": "positive_feedback",
    "confidence": 0.98,
    "metadata": {"source": "support_tickets", "annotator_agreement": 1.0, "word_count": 21}
  }
]

Support ticket classification examples with five categories and annotator agreement scores

Data Collection Guide

Start by defining your category taxonomy through analysis of existing data. Export a sample of 500-1,000 documents, have domain experts manually group them into natural categories, and iterate on the taxonomy until categories are clear and comprehensive. Common mistakes include creating too many fine-grained categories (leading to annotation confusion and sparse training data) or too few broad categories (limiting the practical utility of classification). Most applications work well with 5-20 categories.

Write detailed annotation guidelines with clear definitions, multiple examples, and explicit guidance for edge cases. For each category, provide 3-5 clear examples and 2-3 borderline examples with explanations of why they belong in that category. Document which category should be assigned when a text could reasonably fit multiple categories (decision priority rules). Test your guidelines by having 3-5 annotators label the same 100 documents and measuring agreement before scaling to the full dataset.

For LLM-based classification, convert your labeled examples into instruction format. The instruction describes the classification task and lists the available categories. The input contains the text to classify. The output contains the predicted label and a 2-3 sentence explanation of the reasoning. Having domain experts write the reasoning for the initial examples establishes the explanation quality standard for the dataset.

Quality Criteria

Inter-annotator agreement (measured by Cohen's kappa for two annotators or Fleiss' kappa for three or more) should exceed 0.80 for the dataset to be considered high quality. Categories with lower agreement should be revisited — either the guidelines need clarification or the categories need to be merged or split. Track per-category agreement to identify specific taxonomy problems.

Class balance significantly affects model performance. While perfect balance is rarely achievable or desirable (it may not reflect real-world distributions), extreme imbalances (one category with 5,000 examples and another with 50) will cause the model to perform poorly on minority classes. Either collect additional examples for underrepresented categories, apply oversampling techniques, or use class-weighted loss during training. Aim for no category having fewer than 100-200 examples.

Remove or relabel ambiguous examples where even the annotation guidelines do not clearly indicate the correct category. These examples add noise to the training data without providing useful signal. Better to have 3,000 unambiguous examples than 5,000 examples with 30 percent noise.

Using This Template with Ertas

Import your raw text documents into Ertas Data Suite for PII redaction — support tickets and customer communications frequently contain names, account numbers, and contact details. After redaction, export for annotation, then re-import the labeled dataset for final quality checks and format conversion.

Export in JSONL or CSV format depending on your model architecture. For encoder-based classification, CSV with text and label columns works well. For LLM-based classification, export in Alpaca format with instruction-input-output structure. Ertas Studio handles both training approaches with automatic format validation.

Recommended Model

For production text classification with high throughput requirements, fine-tune a BERT-class encoder model (DeBERTa-v3-base or RoBERTa-base). These models classify text in 5-10 milliseconds on CPU, making them suitable for real-time applications processing thousands of documents per minute.

For classification with explanations or when your taxonomy may evolve frequently, fine-tune a 7B generative model. The instruction-following capability allows you to modify the classification taxonomy through prompt changes without retraining, providing flexibility at the cost of higher inference latency. GGUF export at Q4_K_M keeps inference fast enough for most batch processing workflows.

Related Resources

Use Case

Ertas for Document Classification

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →