Back to blog
    Fine-Tuned Models for Medical Coding and Clinical Documentation
    healthcaremedical-codingclinical-documentationfine-tuninghipaause-case

    Fine-Tuned Models for Medical Coding and Clinical Documentation

    How to fine-tune local AI models for ICD-10/CPT code suggestion and clinical documentation improvement — covering training data structures, accuracy targets, EHR integration, and ROI math for healthcare organizations.

    EErtas Team·

    Medical coding is a $20B+ industry in the US alone. There are approximately 350,000 medical coders handling the translation of clinical encounters into ICD-10, CPT, and HCPCS codes — the language that drives reimbursement. The average coder processes 50-80 charts per day. The average error rate is 10-20%, and each coding error costs a hospital $1,200-$2,400 in denied or underpaid claims.

    AI-assisted coding is not theoretical. Organizations using fine-tuned models for code suggestion report 40-60% throughput increases while maintaining or improving accuracy. The key: these are not general-purpose chatbots. They are purpose-built models trained on the specific mapping between clinical language and coding systems.

    This guide covers two related use cases — ICD-10/CPT code suggestion and clinical documentation improvement (CDI) — and provides the training data structures, accuracy targets, and deployment architecture to build both.

    Use Case 1: ICD-10/CPT Code Suggestion

    The Problem

    The ICD-10-CM system contains 72,184 diagnosis codes. ICD-10-PCS has 78,220 procedure codes. CPT has over 10,000 codes. A medical coder must select the correct combination from these code sets based on clinical documentation that was written by a physician who was not thinking about coding.

    Physicians document to communicate clinical thinking. Coders need specificity: laterality (left vs right), acuity (acute vs chronic), etiology (cause), and anatomical detail. The mismatch between how physicians write and what coding requires is where errors and revenue loss occur.

    Training Data Structure

    Each training example maps a clinical note excerpt to the correct code(s) with confidence scores:

    {
      "instruction": "Assign ICD-10-CM diagnosis codes to the following clinical documentation. Include code, description, and confidence level.",
      "input": "Assessment: Type 2 diabetes mellitus with diabetic chronic kidney disease, stage 3. Patient also presents with peripheral neuropathy in bilateral lower extremities attributed to diabetes. HbA1c 7.8%.",
      "output": "E11.22 - Type 2 diabetes mellitus with diabetic chronic kidney disease (Confidence: HIGH)\nN18.3 - Chronic kidney disease, stage 3 (Confidence: HIGH)\nE11.42 - Type 2 diabetes mellitus with diabetic polyneuropathy (Confidence: HIGH)\nE11.65 - Type 2 diabetes mellitus with hyperglycemia (Confidence: MEDIUM - HbA1c 7.8% indicates suboptimal control but clinical determination needed)"
    }
    

    For CPT procedure coding:

    {
      "instruction": "Assign CPT procedure codes to the following operative note excerpt.",
      "input": "Procedure: Laparoscopic cholecystectomy with intraoperative cholangiogram. Three 5mm ports and one 12mm port placed. Gallbladder dissected from liver bed using electrocautery. Cholangiogram performed via cystic duct catheter showing no common bile duct stones. Gallbladder removed through 12mm port.",
      "output": "47563 - Laparoscopic cholecystectomy with cholangiography (Confidence: HIGH)\nNote: Do not separately code 74300 (cholangiography) as 47563 includes intraoperative cholangiogram. Separate coding would result in denial."
    }
    

    Training Data Volume and Sources

    Data RequirementMinimumRecommendedNotes
    Total training examples5001,000-1,500Per specialty focus area
    Specialty coverage3-5 specialties8-12 specialtiesStart with highest-volume specialties
    Code coverageTop 200 codes per specialtyTop 500 codes per specialtyLong-tail codes need specific examples
    Edge cases50 per specialty100-200 per specialtyModifier usage, bundling rules, exclusions
    Validation set100 examples200-300 examplesHeld out from training, reviewed by certified coders

    Data source: The ideal training data comes from historical coding records where a certified coder has already assigned and verified codes against the clinical note. Most hospitals have 2-5 years of this data in their EHR billing system.

    Accuracy Targets

    Medical coding accuracy must be measured at multiple levels:

    MetricTargetMeasurement
    Code-level accuracy (exact match)85-90%Predicted code matches gold-standard code exactly
    Code family accuracy (3-character match)92-95%Predicted ICD-10 matches at category level (e.g., E11 for Type 2 DM)
    Specificity capture rate80-85%Model suggests the most specific code, not a less specific parent
    False suggestion rateunder 10%Percentage of suggested codes that are clearly incorrect
    Critical miss rateunder 3%Failure to suggest a code for a documented diagnosis/procedure

    Human review is mandatory. No healthcare organization should deploy AI-autonomous coding. The model suggests codes; a certified coder reviews, accepts, modifies, or rejects each suggestion. The value is speed — reviewing AI suggestions is 2-3x faster than coding from scratch.

    Use Case 2: Clinical Documentation Improvement (CDI)

    The Problem

    Physicians write notes to document patient care. Coders need specific documentation to assign the most accurate (and highest-specificity) codes. The gap between what physicians write and what coders need costs hospitals an estimated $10-15 per chart in undercoding — multiplied across hundreds of thousands of encounters per year.

    CDI specialists review clinical notes and query physicians for missing documentation. This is labor-intensive: a CDI specialist reviews 15-25 charts per day and generates queries for roughly 30-40% of them.

    Training Data Structure

    CDI training data maps draft notes to improved notes with specific documentation gaps identified:

    {
      "instruction": "Review the following clinical note for documentation completeness. Identify gaps that would affect coding specificity and suggest improvements.",
      "input": "Discharge Dx: Pneumonia. Patient treated with IV antibiotics for 5 days, switched to oral on day 4. CXR showed right lower lobe infiltrate. Blood cultures negative. Sputum culture grew Streptococcus pneumoniae.",
      "output": "DOCUMENTATION GAPS IDENTIFIED:\n\n1. ORGANISM SPECIFICITY: Sputum culture positive for S. pneumoniae but discharge diagnosis states only 'Pneumonia.' Suggest: 'Pneumococcal pneumonia' (maps to J13 instead of J18.9, increasing specificity and reimbursement accuracy).\n\n2. LATERALITY: CXR findings mention right lower lobe but diagnosis does not specify laterality. While ICD-10 J13 does not require laterality, documenting 'right lower lobe pneumococcal pneumonia' supports medical necessity for imaging.\n\n3. ACUITY: No documentation of whether this is community-acquired vs healthcare-associated. This distinction affects DRG assignment.\n\n4. SEVERITY: No mention of sepsis evaluation. If SIRS criteria were met, documenting 'pneumococcal pneumonia with sepsis' (A40.3 + J13) significantly affects reimbursement.\n\nSUGGESTED REVISED DIAGNOSIS: Right lower lobe pneumococcal pneumonia, community-acquired. [Add sepsis status if applicable.]"
    }
    

    Common Documentation Gaps AI Catches

    Fine-tuned CDI models consistently identify these high-impact gaps:

    Gap TypeFrequencyRevenue Impact
    Missing organism specificity (e.g., "UTI" vs "E. coli UTI")25-35% of infectious disease notes$800-2,000 per case (DRG shift)
    Missing acuity (acute vs chronic)20-30% of notes$500-1,500 per case
    Unaddressed diagnoses (labs suggest condition not documented)15-25% of notes$1,000-5,000 per case
    Incomplete HPI (missing onset, duration, severity)30-40% of notes$200-800 per case
    Missing laterality15-20% of musculoskeletal/surgical notes$100-500 per case
    Missing causal relationships ("due to," "secondary to")20-30% of complex cases$1,500-4,000 per case

    De-Identification Pipeline for Both Use Cases

    All training data must be de-identified before fine-tuning. The pipeline is the same for both coding and CDI:

    EHR Export → Automated NER De-ID → Rule-Based Scrubbing → Manual Sample Review → Training Dataset
    

    Step-by-Step Process

    1. Export historical records from EHR (Epic Clarity/Caboodle, Cerner HealtheDataLab). Include clinical notes + assigned codes (for coding) or original + revised notes (for CDI).

    2. Automated NER de-identification. Use a medical NER model (spaCy with en_core_sci_lg, Amazon Comprehend Medical, or Microsoft Text Analytics for Health) to detect and replace PHI entities. Replace with realistic synthetic data to preserve note structure:

      • Names → synthetic names from census data
      • Dates → shift by random offset (consistent per patient)
      • Locations → replace with same-size city in different state
      • MRNs → sequential synthetic identifiers
    3. Rule-based scrubbing. Regex patterns catch what NER misses: phone number formats, SSN patterns, email addresses, URLs.

    4. Manual sample review. Review 200+ randomly selected records. If PHI found in >2% of samples, iterate on rules and re-review.

    5. NER verification check. Run a second NER pass on the "cleaned" data. Any entity that the second pass flags as a potential PHI leak gets manual review.

    Target: under 0.5% residual PHI rate after full pipeline. This is achievable with the two-pass approach.

    ROI Math: Medical Coding

    The financial case for AI-assisted coding is straightforward:

    Current state (per coder):

    • Average medical coder salary: $55,000/year ($26.44/hour)
    • Average throughput: 60 charts/day
    • Average accuracy: 85% (15% error rate)
    • Cost per coding error (denied/underpaid claim): $1,800 average

    AI-assisted state (per coder):

    • Same salary: $55,000/year
    • Increased throughput: 85-95 charts/day (40-58% increase)
    • Improved accuracy: 92-95% (with AI pre-suggestions and human review)
    • Reduced error cost: 5-8% error rate

    Value per coder per year:

    MetricBefore AIAfter AIDelta
    Charts per day6090+30
    Charts per year (250 days)15,00022,500+7,500
    Revenue per chart (coding value)$8.50$8.50
    Coding error rate15%6%-9%
    Error cost per year$40,500$16,200-$24,300
    Throughput value (additional charts)$63,750+$63,750
    Total value per coder$88,050

    For a team of 10 coders, that is $880,500 in annual value against a deployment cost of $10,000-15,000 (one-time hardware) plus ongoing maintenance. ROI is measured in weeks, not years.

    Deployment: EHR Integration Architecture

    Medical coding and CDI models must integrate with existing EHR systems. No hospital will adopt a standalone tool that requires coders to copy-paste between applications.

    Epic Integration

    Epic supports AI integration through two mechanisms:

    • Epic App Orchard / FHIR R4 APIs: Read clinical notes via DocumentReference resources; write code suggestions via CommunicationRequest or Task resources
    • Epic Cognitive Computing Platform: Direct integration point for AI models (requires Epic partnership or certification)

    Cerner (Oracle Health) Integration

    • FHIR R4 APIs: Similar pattern to Epic — read clinical documents, write suggestions as annotations
    • Millennium Open APIs: Legacy integration for sites not yet on FHIR

    Architecture Pattern

    ┌────────────────────────────────────────────────┐
    │              Hospital Internal Network           │
    │                                                  │
    │  ┌────────┐     ┌──────────────┐    ┌────────┐  │
    │  │  EHR   │────→│  FHIR Server │───→│  API   │  │
    │  │(Epic/  │     │  (HAPI FHIR) │    │Gateway │  │
    │  │Cerner) │←────│              │←───│(Kong)  │  │
    │  └────────┘     └──────────────┘    └───┬────┘  │
    │                                         │        │
    │                        ┌────────────────▼─────┐  │
    │                        │  Inference Server     │  │
    │                        │  (Ollama / llama.cpp) │  │
    │                        │  + Coding LoRA        │  │
    │                        │  + CDI LoRA           │  │
    │                        └──────────────────────┘  │
    └────────────────────────────────────────────────┘
    

    Key details:

    • FHIR intermediary (HAPI FHIR server) decouples the EHR from the AI model. The EHR sends documents via standard FHIR APIs; the FHIR server queues them for inference.
    • Separate LoRA adapters for coding and CDI loaded on the same base model. Adapter hot-swapping takes under 100ms — no need for separate servers.
    • mTLS between all services. Certificate-based authentication, not just API keys.
    • All inference behind the hospital firewall. No data leaves the network.

    Quality Assurance

    For Medical Coding

    Human-in-the-loop is non-negotiable. The workflow:

    1. Model processes clinical note and generates code suggestions with confidence scores
    2. Suggestions appear in coder's queue, sorted by confidence (HIGH first)
    3. Coder accepts (one click), modifies (edit code), or rejects (flag for manual review)
    4. All accept/modify/reject actions logged for model improvement
    5. Weekly accuracy reports: model accuracy by specialty, coder override rate, revenue impact

    For CDI

    Documentation improvement suggestions go through a different quality gate:

    1. Model identifies documentation gaps in clinical notes
    2. CDI specialist reviews suggestions and drafts physician queries for valid gaps
    3. Queries sent to physicians through standard CDI workflow (Epic InBasket, Cerner Message Center)
    4. Physician response rate and documentation improvement rate tracked
    5. Monthly calibration: compare AI-identified gaps against CDI specialist identification on the same charts

    Auto-Audit System

    Run automated audits on model outputs monthly:

    • Code validity check: Are all suggested codes valid ICD-10-CM/PCS or CPT codes? (Invalid codes indicate model degradation)
    • Bundling rule check: Does the model ever suggest unbundled codes that should be bundled? (CCI edits compliance)
    • Modifier consistency: Are modifier suggestions consistent with documentation?
    • Trend analysis: Is accuracy drifting over time? (New documentation patterns, code updates)

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading