Medical Notes Classification Dataset Template

Template for building datasets that train AI models to classify clinical notes by diagnosis category, urgency level, and department routing.

Classification

Overview

Medical notes classification datasets train AI models to categorize clinical documentation — including physician notes, discharge summaries, radiology reports, and nursing assessments — by medical specialty, diagnosis category, urgency level, and appropriate department routing. These datasets enable healthcare organizations to automate the triage and organization of clinical documentation, improving workflow efficiency while ensuring that critical findings receive timely attention.

The unique challenge of medical notes classification is the domain-specific language of clinical documentation. Physicians use abbreviations (SOB for shortness of breath, PRN for as needed, BID for twice daily), medical terminology, and structured documentation patterns (SOAP notes, H&P format) that general-purpose language models may not handle well without fine-tuning. Training data must capture these clinical language patterns while covering the full spectrum of medical specialties and documentation types.

Data privacy is a paramount concern for medical training datasets. All clinical notes contain protected health information (PHI) under HIPAA, and the dataset must be thoroughly de-identified before use in model training. The HIPAA Safe Harbor method requires removing 18 specific identifier types, while Expert Determination requires a qualified statistician to certify that re-identification risk is very small. The de-identification process must be documented and auditable, making on-premise data processing with comprehensive audit trails essential for compliance.

Dataset Schema

typescript

interface MedicalNoteExample {
  text: string;          // De-identified clinical note text
  labels: {
    specialty: string;   // e.g., "cardiology", "pulmonology", "orthopedics"
    urgency: "routine" | "urgent" | "emergent";
    note_type: "progress_note" | "discharge_summary" | "consult" | "procedure" | "radiology";
    icd10_category?: string;  // Primary ICD-10 chapter
  };
  metadata: {
    word_count: number;
    has_medications: boolean;
    has_lab_values: boolean;
    de_identification_method: "safe_harbor" | "expert_determination";
  };
}

Schema for medical notes classification with specialty, urgency, and note type labels

Sample Data

json

[
  {
    "text": "DISCHARGE SUMMARY\n\nPatient: [REDACTED], Age: 67, Sex: M\nAdmitting Diagnosis: Acute exacerbation of COPD\nDischarge Diagnosis: Acute exacerbation of COPD with community-acquired pneumonia\n\nHPI: Patient presented to ED with 3-day history of worsening dyspnea, productive cough with yellow-green sputum, and low-grade fever (100.4F). History of COPD Gold Stage III, former smoker (45 pack-years, quit 2019). On home O2 2L NC.\n\nHospital Course: Admitted to general medicine. Started on IV levofloxacin 750mg daily and methylprednisolone 125mg IV q8h. Chest X-ray showed RLL infiltrate consistent with pneumonia. Blood cultures negative. Transitioned to oral prednisone taper and oral levofloxacin on day 3. O2 requirements normalized to baseline by day 4.\n\nDischarge Medications: Prednisone 40mg taper over 10 days, Levofloxacin 750mg PO daily x 4 remaining days, Continue home medications including tiotropium and albuterol PRN.\n\nFollow-up: PCP in 1 week, Pulmonology in 2 weeks.",
    "labels": {
      "specialty": "pulmonology",
      "urgency": "urgent",
      "note_type": "discharge_summary",
      "icd10_category": "J44.1"
    },
    "metadata": {
      "word_count": 168,
      "has_medications": true,
      "has_lab_values": false,
      "de_identification_method": "safe_harbor"
    }
  },
  {
    "text": "PROGRESS NOTE\n\nSubjective: Patient reports improvement in left knee pain since starting physical therapy 3 weeks ago. Pain now 3/10 at rest, 5/10 with activity, down from 7/10 at initial visit. Able to walk 20 minutes without significant discomfort. Denies swelling, locking, or giving way.\n\nObjective: Left knee ROM: flexion 125 degrees (was 110), extension full. No effusion. Stable to varus/valgus stress. Negative McMurray. Quad strength 4+/5 (was 4/5).\n\nAssessment: Left knee osteoarthritis, improving with conservative management.\n\nPlan: Continue PT 2x/week for 4 more weeks. May advance to low-impact exercise (swimming, cycling). Follow up in 6 weeks. If plateau in progress, consider intra-articular injection.",
    "labels": {
      "specialty": "orthopedics",
      "urgency": "routine",
      "note_type": "progress_note",
      "icd10_category": "M17"
    },
    "metadata": {
      "word_count": 132,
      "has_medications": false,
      "has_lab_values": false,
      "de_identification_method": "safe_harbor"
    }
  }
]

De-identified clinical note examples for pulmonology discharge summary and orthopedics progress note

Data Collection Guide

Source clinical notes from your organization's electronic health record (EHR) system with appropriate IRB approval and HIPAA compliance. Work with your compliance team to establish a data use agreement that permits the use of de-identified clinical notes for AI model training. Extract notes across all relevant specialties, note types, and urgency levels to build a representative dataset.

De-identification is the most critical step. Use automated NLP-based de-identification tools to detect and remove all 18 HIPAA Safe Harbor identifiers: names, geographic data, dates, phone numbers, fax numbers, email addresses, SSNs, medical record numbers, health plan numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full-face photos, and any other unique identifying number. After automated de-identification, conduct manual review on a sample (10-20 percent) to verify the automated system caught all identifiers.

Ertas Data Suite's on-premise PII redaction engine is designed for this workflow. Process all clinical notes through the redaction pipeline before any further data handling, and use the audit logging to document the de-identification process for HIPAA compliance evidence. The air-gapped architecture ensures that PHI never leaves your healthcare organization's controlled environment during the entire dataset preparation process.

Quality Criteria

Verify complete de-identification through both automated scanning and manual review. Any note containing residual PHI must be flagged and re-processed before inclusion in the training dataset. Document the de-identification verification process as part of your HIPAA compliance records.

Clinical accuracy of labels is essential. Have board-certified physicians or experienced clinical informaticists review specialty classifications, urgency ratings, and ICD-10 category assignments. Inter-annotator agreement should be measured and should exceed 85 percent for specialty classification and 80 percent for urgency rating. Disagreements should be resolved through a senior clinician review process.

Ensure balanced representation across medical specialties. Clinical documentation from high-volume departments (internal medicine, emergency medicine) will naturally dominate the dataset. Actively oversample from lower-volume specialties (rheumatology, endocrinology, neurology) to prevent the model from developing a bias toward common specialties. Aim for a minimum of 200-300 examples per specialty for adequate classification performance.

Using This Template with Ertas

Import clinical notes from your EHR export into Ertas Data Suite's on-premise environment. Apply the PII redaction engine to automatically detect and mask all HIPAA identifiers. Review the redaction results using the data lineage tracking, which documents every redaction applied with the identifier type, location, and masking method. Export the de-identified dataset in JSONL format for model training.

The entire workflow occurs within your healthcare organization's infrastructure. No clinical data is transmitted externally. After fine-tuning in Ertas Studio, export the model in GGUF format for local inference within your clinical systems, maintaining HIPAA compliance throughout the model lifecycle.

Recommended Model

Medical notes classification benefits from models with biomedical domain knowledge. Consider starting with a biomedical-pretrained base model if available, or fine-tune a general 7B-8B model on a combination of biomedical text and your classification dataset. For multi-label classification across specialty, urgency, and note type, encoder-based models (BERT-family) fine-tuned for classification may outperform decoder-based LLMs while being significantly more efficient for inference.

For applications requiring both classification and explanation (identifying why a note is classified as urgent), a generative 7B-8B model provides the flexibility to output structured classifications alongside natural language rationale.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →