
Fine-Tuned Models for Medical Coding and Clinical Documentation
How to fine-tune local AI models for ICD-10/CPT code suggestion and clinical documentation improvement — covering training data structures, accuracy targets, EHR integration, and ROI math for healthcare organizations.
Medical coding is a $20B+ industry in the US alone. There are approximately 350,000 medical coders handling the translation of clinical encounters into ICD-10, CPT, and HCPCS codes — the language that drives reimbursement. The average coder processes 50-80 charts per day. The average error rate is 10-20%, and each coding error costs a hospital $1,200-$2,400 in denied or underpaid claims.
AI-assisted coding is not theoretical. Organizations using fine-tuned models for code suggestion report 40-60% throughput increases while maintaining or improving accuracy. The key: these are not general-purpose chatbots. They are purpose-built models trained on the specific mapping between clinical language and coding systems.
This guide covers two related use cases — ICD-10/CPT code suggestion and clinical documentation improvement (CDI) — and provides the training data structures, accuracy targets, and deployment architecture to build both.
Use Case 1: ICD-10/CPT Code Suggestion
The Problem
The ICD-10-CM system contains 72,184 diagnosis codes. ICD-10-PCS has 78,220 procedure codes. CPT has over 10,000 codes. A medical coder must select the correct combination from these code sets based on clinical documentation that was written by a physician who was not thinking about coding.
Physicians document to communicate clinical thinking. Coders need specificity: laterality (left vs right), acuity (acute vs chronic), etiology (cause), and anatomical detail. The mismatch between how physicians write and what coding requires is where errors and revenue loss occur.
Training Data Structure
Each training example maps a clinical note excerpt to the correct code(s) with confidence scores:
{
"instruction": "Assign ICD-10-CM diagnosis codes to the following clinical documentation. Include code, description, and confidence level.",
"input": "Assessment: Type 2 diabetes mellitus with diabetic chronic kidney disease, stage 3. Patient also presents with peripheral neuropathy in bilateral lower extremities attributed to diabetes. HbA1c 7.8%.",
"output": "E11.22 - Type 2 diabetes mellitus with diabetic chronic kidney disease (Confidence: HIGH)\nN18.3 - Chronic kidney disease, stage 3 (Confidence: HIGH)\nE11.42 - Type 2 diabetes mellitus with diabetic polyneuropathy (Confidence: HIGH)\nE11.65 - Type 2 diabetes mellitus with hyperglycemia (Confidence: MEDIUM - HbA1c 7.8% indicates suboptimal control but clinical determination needed)"
}
For CPT procedure coding:
{
"instruction": "Assign CPT procedure codes to the following operative note excerpt.",
"input": "Procedure: Laparoscopic cholecystectomy with intraoperative cholangiogram. Three 5mm ports and one 12mm port placed. Gallbladder dissected from liver bed using electrocautery. Cholangiogram performed via cystic duct catheter showing no common bile duct stones. Gallbladder removed through 12mm port.",
"output": "47563 - Laparoscopic cholecystectomy with cholangiography (Confidence: HIGH)\nNote: Do not separately code 74300 (cholangiography) as 47563 includes intraoperative cholangiogram. Separate coding would result in denial."
}
Training Data Volume and Sources
| Data Requirement | Minimum | Recommended | Notes |
|---|---|---|---|
| Total training examples | 500 | 1,000-1,500 | Per specialty focus area |
| Specialty coverage | 3-5 specialties | 8-12 specialties | Start with highest-volume specialties |
| Code coverage | Top 200 codes per specialty | Top 500 codes per specialty | Long-tail codes need specific examples |
| Edge cases | 50 per specialty | 100-200 per specialty | Modifier usage, bundling rules, exclusions |
| Validation set | 100 examples | 200-300 examples | Held out from training, reviewed by certified coders |
Data source: The ideal training data comes from historical coding records where a certified coder has already assigned and verified codes against the clinical note. Most hospitals have 2-5 years of this data in their EHR billing system.
Accuracy Targets
Medical coding accuracy must be measured at multiple levels:
| Metric | Target | Measurement |
|---|---|---|
| Code-level accuracy (exact match) | 85-90% | Predicted code matches gold-standard code exactly |
| Code family accuracy (3-character match) | 92-95% | Predicted ICD-10 matches at category level (e.g., E11 for Type 2 DM) |
| Specificity capture rate | 80-85% | Model suggests the most specific code, not a less specific parent |
| False suggestion rate | under 10% | Percentage of suggested codes that are clearly incorrect |
| Critical miss rate | under 3% | Failure to suggest a code for a documented diagnosis/procedure |
Human review is mandatory. No healthcare organization should deploy AI-autonomous coding. The model suggests codes; a certified coder reviews, accepts, modifies, or rejects each suggestion. The value is speed — reviewing AI suggestions is 2-3x faster than coding from scratch.
Use Case 2: Clinical Documentation Improvement (CDI)
The Problem
Physicians write notes to document patient care. Coders need specific documentation to assign the most accurate (and highest-specificity) codes. The gap between what physicians write and what coders need costs hospitals an estimated $10-15 per chart in undercoding — multiplied across hundreds of thousands of encounters per year.
CDI specialists review clinical notes and query physicians for missing documentation. This is labor-intensive: a CDI specialist reviews 15-25 charts per day and generates queries for roughly 30-40% of them.
Training Data Structure
CDI training data maps draft notes to improved notes with specific documentation gaps identified:
{
"instruction": "Review the following clinical note for documentation completeness. Identify gaps that would affect coding specificity and suggest improvements.",
"input": "Discharge Dx: Pneumonia. Patient treated with IV antibiotics for 5 days, switched to oral on day 4. CXR showed right lower lobe infiltrate. Blood cultures negative. Sputum culture grew Streptococcus pneumoniae.",
"output": "DOCUMENTATION GAPS IDENTIFIED:\n\n1. ORGANISM SPECIFICITY: Sputum culture positive for S. pneumoniae but discharge diagnosis states only 'Pneumonia.' Suggest: 'Pneumococcal pneumonia' (maps to J13 instead of J18.9, increasing specificity and reimbursement accuracy).\n\n2. LATERALITY: CXR findings mention right lower lobe but diagnosis does not specify laterality. While ICD-10 J13 does not require laterality, documenting 'right lower lobe pneumococcal pneumonia' supports medical necessity for imaging.\n\n3. ACUITY: No documentation of whether this is community-acquired vs healthcare-associated. This distinction affects DRG assignment.\n\n4. SEVERITY: No mention of sepsis evaluation. If SIRS criteria were met, documenting 'pneumococcal pneumonia with sepsis' (A40.3 + J13) significantly affects reimbursement.\n\nSUGGESTED REVISED DIAGNOSIS: Right lower lobe pneumococcal pneumonia, community-acquired. [Add sepsis status if applicable.]"
}
Common Documentation Gaps AI Catches
Fine-tuned CDI models consistently identify these high-impact gaps:
| Gap Type | Frequency | Revenue Impact |
|---|---|---|
| Missing organism specificity (e.g., "UTI" vs "E. coli UTI") | 25-35% of infectious disease notes | $800-2,000 per case (DRG shift) |
| Missing acuity (acute vs chronic) | 20-30% of notes | $500-1,500 per case |
| Unaddressed diagnoses (labs suggest condition not documented) | 15-25% of notes | $1,000-5,000 per case |
| Incomplete HPI (missing onset, duration, severity) | 30-40% of notes | $200-800 per case |
| Missing laterality | 15-20% of musculoskeletal/surgical notes | $100-500 per case |
| Missing causal relationships ("due to," "secondary to") | 20-30% of complex cases | $1,500-4,000 per case |
De-Identification Pipeline for Both Use Cases
All training data must be de-identified before fine-tuning. The pipeline is the same for both coding and CDI:
EHR Export → Automated NER De-ID → Rule-Based Scrubbing → Manual Sample Review → Training Dataset
Step-by-Step Process
-
Export historical records from EHR (Epic Clarity/Caboodle, Cerner HealtheDataLab). Include clinical notes + assigned codes (for coding) or original + revised notes (for CDI).
-
Automated NER de-identification. Use a medical NER model (spaCy with
en_core_sci_lg, Amazon Comprehend Medical, or Microsoft Text Analytics for Health) to detect and replace PHI entities. Replace with realistic synthetic data to preserve note structure:- Names → synthetic names from census data
- Dates → shift by random offset (consistent per patient)
- Locations → replace with same-size city in different state
- MRNs → sequential synthetic identifiers
-
Rule-based scrubbing. Regex patterns catch what NER misses: phone number formats, SSN patterns, email addresses, URLs.
-
Manual sample review. Review 200+ randomly selected records. If PHI found in >2% of samples, iterate on rules and re-review.
-
NER verification check. Run a second NER pass on the "cleaned" data. Any entity that the second pass flags as a potential PHI leak gets manual review.
Target: under 0.5% residual PHI rate after full pipeline. This is achievable with the two-pass approach.
ROI Math: Medical Coding
The financial case for AI-assisted coding is straightforward:
Current state (per coder):
- Average medical coder salary: $55,000/year ($26.44/hour)
- Average throughput: 60 charts/day
- Average accuracy: 85% (15% error rate)
- Cost per coding error (denied/underpaid claim): $1,800 average
AI-assisted state (per coder):
- Same salary: $55,000/year
- Increased throughput: 85-95 charts/day (40-58% increase)
- Improved accuracy: 92-95% (with AI pre-suggestions and human review)
- Reduced error cost: 5-8% error rate
Value per coder per year:
| Metric | Before AI | After AI | Delta |
|---|---|---|---|
| Charts per day | 60 | 90 | +30 |
| Charts per year (250 days) | 15,000 | 22,500 | +7,500 |
| Revenue per chart (coding value) | $8.50 | $8.50 | — |
| Coding error rate | 15% | 6% | -9% |
| Error cost per year | $40,500 | $16,200 | -$24,300 |
| Throughput value (additional charts) | — | $63,750 | +$63,750 |
| Total value per coder | — | — | $88,050 |
For a team of 10 coders, that is $880,500 in annual value against a deployment cost of $10,000-15,000 (one-time hardware) plus ongoing maintenance. ROI is measured in weeks, not years.
Deployment: EHR Integration Architecture
Medical coding and CDI models must integrate with existing EHR systems. No hospital will adopt a standalone tool that requires coders to copy-paste between applications.
Epic Integration
Epic supports AI integration through two mechanisms:
- Epic App Orchard / FHIR R4 APIs: Read clinical notes via DocumentReference resources; write code suggestions via CommunicationRequest or Task resources
- Epic Cognitive Computing Platform: Direct integration point for AI models (requires Epic partnership or certification)
Cerner (Oracle Health) Integration
- FHIR R4 APIs: Similar pattern to Epic — read clinical documents, write suggestions as annotations
- Millennium Open APIs: Legacy integration for sites not yet on FHIR
Architecture Pattern
┌────────────────────────────────────────────────┐
│ Hospital Internal Network │
│ │
│ ┌────────┐ ┌──────────────┐ ┌────────┐ │
│ │ EHR │────→│ FHIR Server │───→│ API │ │
│ │(Epic/ │ │ (HAPI FHIR) │ │Gateway │ │
│ │Cerner) │←────│ │←───│(Kong) │ │
│ └────────┘ └──────────────┘ └───┬────┘ │
│ │ │
│ ┌────────────────▼─────┐ │
│ │ Inference Server │ │
│ │ (Ollama / llama.cpp) │ │
│ │ + Coding LoRA │ │
│ │ + CDI LoRA │ │
│ └──────────────────────┘ │
└────────────────────────────────────────────────┘
Key details:
- FHIR intermediary (HAPI FHIR server) decouples the EHR from the AI model. The EHR sends documents via standard FHIR APIs; the FHIR server queues them for inference.
- Separate LoRA adapters for coding and CDI loaded on the same base model. Adapter hot-swapping takes under 100ms — no need for separate servers.
- mTLS between all services. Certificate-based authentication, not just API keys.
- All inference behind the hospital firewall. No data leaves the network.
Quality Assurance
For Medical Coding
Human-in-the-loop is non-negotiable. The workflow:
- Model processes clinical note and generates code suggestions with confidence scores
- Suggestions appear in coder's queue, sorted by confidence (HIGH first)
- Coder accepts (one click), modifies (edit code), or rejects (flag for manual review)
- All accept/modify/reject actions logged for model improvement
- Weekly accuracy reports: model accuracy by specialty, coder override rate, revenue impact
For CDI
Documentation improvement suggestions go through a different quality gate:
- Model identifies documentation gaps in clinical notes
- CDI specialist reviews suggestions and drafts physician queries for valid gaps
- Queries sent to physicians through standard CDI workflow (Epic InBasket, Cerner Message Center)
- Physician response rate and documentation improvement rate tracked
- Monthly calibration: compare AI-identified gaps against CDI specialist identification on the same charts
Auto-Audit System
Run automated audits on model outputs monthly:
- Code validity check: Are all suggested codes valid ICD-10-CM/PCS or CPT codes? (Invalid codes indicate model degradation)
- Bundling rule check: Does the model ever suggest unbundled codes that should be bundled? (CCI edits compliance)
- Modifier consistency: Are modifier suggestions consistent with documentation?
- Trend analysis: Is accuracy drifting over time? (New documentation patterns, code updates)
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Fine-Tuning Healthcare AI: From Clinical Notes to Compliant Deployment — End-to-end guide covering clinical NLP training and deployment
- Fine-Tuning AI for Healthcare: HIPAA-Compliant Pipeline — The hub guide for building HIPAA-compliant fine-tuning pipelines
- How to Evaluate a Fine-Tuned Model — Framework for measuring model quality before deployment
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

PHI Redaction for AI Training: A Step-by-Step Guide for Healthcare ML Teams
Before clinical data can be used to train AI models, PHI must be identified and redacted. This guide covers automated PHI detection, HIPAA de-identification standards, and on-premise redaction pipelines.

Fine-Tuning for AML Transaction Monitoring: Reducing False Positives
Banks spend $30B+ annually on AML compliance while rule-based systems generate 95%+ false positive rates. Learn how fine-tuning local models can cut false positives by 40-60% while maintaining 99%+ true positive capture — without sending transaction data to cloud APIs.

Fine-Tuned vs. RAG for Clinical Decision Support: When Each Wins
RAG or fine-tuning for healthcare AI? The answer depends on the clinical task. This guide compares both approaches across 8 healthcare use cases, covering accuracy, latency, cost, HIPAA implications, and a hybrid architecture that combines the best of both.