Fine-Tuned Models for Medical Coding and Clinical Documentation

Medical coding is a $20B+ industry in the US alone. There are approximately 350,000 medical coders handling the translation of clinical encounters into ICD-10, CPT, and HCPCS codes — the language that drives reimbursement. The average coder processes 50-80 charts per day. The average error rate is 10-20%, and each coding error costs a hospital $1,200-$2,400 in denied or underpaid claims.

AI-assisted coding is not theoretical. Organizations using fine-tuned models for code suggestion report 40-60% throughput increases while maintaining or improving accuracy. The key: these are not general-purpose chatbots. They are purpose-built models trained on the specific mapping between clinical language and coding systems.

This guide covers two related use cases — ICD-10/CPT code suggestion and clinical documentation improvement (CDI) — and provides the training data structures, accuracy targets, and deployment architecture to build both.

Use Case 1: ICD-10/CPT Code Suggestion

The Problem

The ICD-10-CM system contains 72,184 diagnosis codes. ICD-10-PCS has 78,220 procedure codes. CPT has over 10,000 codes. A medical coder must select the correct combination from these code sets based on clinical documentation that was written by a physician who was not thinking about coding.

Physicians document to communicate clinical thinking. Coders need specificity: laterality (left vs right), acuity (acute vs chronic), etiology (cause), and anatomical detail. The mismatch between how physicians write and what coding requires is where errors and revenue loss occur.

Training Data Structure

Each training example maps a clinical note excerpt to the correct code(s) with confidence scores:

{
  "instruction": "Assign ICD-10-CM diagnosis codes to the following clinical documentation. Include code, description, and confidence level.",
  "input": "Assessment: Type 2 diabetes mellitus with diabetic chronic kidney disease, stage 3. Patient also presents with peripheral neuropathy in bilateral lower extremities attributed to diabetes. HbA1c 7.8%.",
  "output": "E11.22 - Type 2 diabetes mellitus with diabetic chronic kidney disease (Confidence: HIGH)\nN18.3 - Chronic kidney disease, stage 3 (Confidence: HIGH)\nE11.42 - Type 2 diabetes mellitus with diabetic polyneuropathy (Confidence: HIGH)\nE11.65 - Type 2 diabetes mellitus with hyperglycemia (Confidence: MEDIUM - HbA1c 7.8% indicates suboptimal control but clinical determination needed)"
}

For CPT procedure coding:

{
  "instruction": "Assign CPT procedure codes to the following operative note excerpt.",
  "input": "Procedure: Laparoscopic cholecystectomy with intraoperative cholangiogram. Three 5mm ports and one 12mm port placed. Gallbladder dissected from liver bed using electrocautery. Cholangiogram performed via cystic duct catheter showing no common bile duct stones. Gallbladder removed through 12mm port.",
  "output": "47563 - Laparoscopic cholecystectomy with cholangiography (Confidence: HIGH)\nNote: Do not separately code 74300 (cholangiography) as 47563 includes intraoperative cholangiogram. Separate coding would result in denial."
}

Training Data Volume and Sources

Data Requirement	Minimum	Recommended	Notes
Total training examples	500	1,000-1,500	Per specialty focus area
Specialty coverage	3-5 specialties	8-12 specialties	Start with highest-volume specialties
Code coverage	Top 200 codes per specialty	Top 500 codes per specialty	Long-tail codes need specific examples
Edge cases	50 per specialty	100-200 per specialty	Modifier usage, bundling rules, exclusions
Validation set	100 examples	200-300 examples	Held out from training, reviewed by certified coders

Data source: The ideal training data comes from historical coding records where a certified coder has already assigned and verified codes against the clinical note. Most hospitals have 2-5 years of this data in their EHR billing system.

Accuracy Targets

Medical coding accuracy must be measured at multiple levels:

Metric	Target	Measurement
Code-level accuracy (exact match)	85-90%	Predicted code matches gold-standard code exactly
Code family accuracy (3-character match)	92-95%	Predicted ICD-10 matches at category level (e.g., E11 for Type 2 DM)
Specificity capture rate	80-85%	Model suggests the most specific code, not a less specific parent
False suggestion rate	under 10%	Percentage of suggested codes that are clearly incorrect
Critical miss rate	under 3%	Failure to suggest a code for a documented diagnosis/procedure

Human review is mandatory. No healthcare organization should deploy AI-autonomous coding. The model suggests codes; a certified coder reviews, accepts, modifies, or rejects each suggestion. The value is speed — reviewing AI suggestions is 2-3x faster than coding from scratch.

Use Case 2: Clinical Documentation Improvement (CDI)

The Problem

Physicians write notes to document patient care. Coders need specific documentation to assign the most accurate (and highest-specificity) codes. The gap between what physicians write and what coders need costs hospitals an estimated $10-15 per chart in undercoding — multiplied across hundreds of thousands of encounters per year.

CDI specialists review clinical notes and query physicians for missing documentation. This is labor-intensive: a CDI specialist reviews 15-25 charts per day and generates queries for roughly 30-40% of them.

Training Data Structure

CDI training data maps draft notes to improved notes with specific documentation gaps identified:

{
  "instruction": "Review the following clinical note for documentation completeness. Identify gaps that would affect coding specificity and suggest improvements.",
  "input": "Discharge Dx: Pneumonia. Patient treated with IV antibiotics for 5 days, switched to oral on day 4. CXR showed right lower lobe infiltrate. Blood cultures negative. Sputum culture grew Streptococcus pneumoniae.",
  "output": "DOCUMENTATION GAPS IDENTIFIED:\n\n1. ORGANISM SPECIFICITY: Sputum culture positive for S. pneumoniae but discharge diagnosis states only 'Pneumonia.' Suggest: 'Pneumococcal pneumonia' (maps to J13 instead of J18.9, increasing specificity and reimbursement accuracy).\n\n2. LATERALITY: CXR findings mention right lower lobe but diagnosis does not specify laterality. While ICD-10 J13 does not require laterality, documenting 'right lower lobe pneumococcal pneumonia' supports medical necessity for imaging.\n\n3. ACUITY: No documentation of whether this is community-acquired vs healthcare-associated. This distinction affects DRG assignment.\n\n4. SEVERITY: No mention of sepsis evaluation. If SIRS criteria were met, documenting 'pneumococcal pneumonia with sepsis' (A40.3 + J13) significantly affects reimbursement.\n\nSUGGESTED REVISED DIAGNOSIS: Right lower lobe pneumococcal pneumonia, community-acquired. [Add sepsis status if applicable.]"
}

Common Documentation Gaps AI Catches

Fine-tuned CDI models consistently identify these high-impact gaps:

Gap Type	Frequency	Revenue Impact
Missing organism specificity (e.g., "UTI" vs "E. coli UTI")	25-35% of infectious disease notes	$800-2,000 per case (DRG shift)
Missing acuity (acute vs chronic)	20-30% of notes	$500-1,500 per case
Unaddressed diagnoses (labs suggest condition not documented)	15-25% of notes	$1,000-5,000 per case
Incomplete HPI (missing onset, duration, severity)	30-40% of notes	$200-800 per case
Missing laterality	15-20% of musculoskeletal/surgical notes	$100-500 per case
Missing causal relationships ("due to," "secondary to")	20-30% of complex cases	$1,500-4,000 per case

De-Identification Pipeline for Both Use Cases

All training data must be de-identified before fine-tuning. The pipeline is the same for both coding and CDI:

EHR Export → Automated NER De-ID → Rule-Based Scrubbing → Manual Sample Review → Training Dataset

Step-by-Step Process

Export historical records from EHR (Epic Clarity/Caboodle, Cerner HealtheDataLab). Include clinical notes + assigned codes (for coding) or original + revised notes (for CDI).
Automated NER de-identification. Use a medical NER model (spaCy with en_core_sci_lg, Amazon Comprehend Medical, or Microsoft Text Analytics for Health) to detect and replace PHI entities. Replace with realistic synthetic data to preserve note structure:
- Names → synthetic names from census data
- Dates → shift by random offset (consistent per patient)
- Locations → replace with same-size city in different state
- MRNs → sequential synthetic identifiers
Rule-based scrubbing. Regex patterns catch what NER misses: phone number formats, SSN patterns, email addresses, URLs.
Manual sample review. Review 200+ randomly selected records. If PHI found in >2% of samples, iterate on rules and re-review.
NER verification check. Run a second NER pass on the "cleaned" data. Any entity that the second pass flags as a potential PHI leak gets manual review.

Target: under 0.5% residual PHI rate after full pipeline. This is achievable with the two-pass approach.

ROI Math: Medical Coding

The financial case for AI-assisted coding is straightforward:

Current state (per coder):

Average medical coder salary: $55,000/year ($26.44/hour)
Average throughput: 60 charts/day
Average accuracy: 85% (15% error rate)
Cost per coding error (denied/underpaid claim): $1,800 average

AI-assisted state (per coder):

Same salary: $55,000/year
Increased throughput: 85-95 charts/day (40-58% increase)
Improved accuracy: 92-95% (with AI pre-suggestions and human review)
Reduced error cost: 5-8% error rate

Value per coder per year:

Metric	Before AI	After AI	Delta
Charts per day	60	90	+30
Charts per year (250 days)	15,000	22,500	+7,500
Revenue per chart (coding value)	$8.50	$8.50	—
Coding error rate	15%	6%	-9%
Error cost per year	$40,500	$16,200	-$24,300
Throughput value (additional charts)	—	$63,750	+$63,750
Total value per coder	—	—	$88,050

For a team of 10 coders, that is $880,500 in annual value against a deployment cost of $10,000-15,000 (one-time hardware) plus ongoing maintenance. ROI is measured in weeks, not years.

Deployment: EHR Integration Architecture

Medical coding and CDI models must integrate with existing EHR systems. No hospital will adopt a standalone tool that requires coders to copy-paste between applications.

Epic Integration

Epic supports AI integration through two mechanisms:

Epic App Orchard / FHIR R4 APIs: Read clinical notes via DocumentReference resources; write code suggestions via CommunicationRequest or Task resources
Epic Cognitive Computing Platform: Direct integration point for AI models (requires Epic partnership or certification)

Cerner (Oracle Health) Integration

FHIR R4 APIs: Similar pattern to Epic — read clinical documents, write suggestions as annotations
Millennium Open APIs: Legacy integration for sites not yet on FHIR

Architecture Pattern

┌────────────────────────────────────────────────┐
│              Hospital Internal Network           │
│                                                  │
│  ┌────────┐     ┌──────────────┐    ┌────────┐  │
│  │  EHR   │────→│  FHIR Server │───→│  API   │  │
│  │(Epic/  │     │  (HAPI FHIR) │    │Gateway │  │
│  │Cerner) │←────│              │←───│(Kong)  │  │
│  └────────┘     └──────────────┘    └───┬────┘  │
│                                         │        │
│                        ┌────────────────▼─────┐  │
│                        │  Inference Server     │  │
│                        │  (Ollama / llama.cpp) │  │
│                        │  + Coding LoRA        │  │
│                        │  + CDI LoRA           │  │
│                        └──────────────────────┘  │
└────────────────────────────────────────────────┘

Key details:

FHIR intermediary (HAPI FHIR server) decouples the EHR from the AI model. The EHR sends documents via standard FHIR APIs; the FHIR server queues them for inference.
Separate LoRA adapters for coding and CDI loaded on the same base model. Adapter hot-swapping takes under 100ms — no need for separate servers.
mTLS between all services. Certificate-based authentication, not just API keys.
All inference behind the hospital firewall. No data leaves the network.

Quality Assurance

For Medical Coding

Human-in-the-loop is non-negotiable. The workflow:

Model processes clinical note and generates code suggestions with confidence scores
Suggestions appear in coder's queue, sorted by confidence (HIGH first)
Coder accepts (one click), modifies (edit code), or rejects (flag for manual review)
All accept/modify/reject actions logged for model improvement
Weekly accuracy reports: model accuracy by specialty, coder override rate, revenue impact

For CDI

Documentation improvement suggestions go through a different quality gate:

Model identifies documentation gaps in clinical notes
CDI specialist reviews suggestions and drafts physician queries for valid gaps
Queries sent to physicians through standard CDI workflow (Epic InBasket, Cerner Message Center)
Physician response rate and documentation improvement rate tracked
Monthly calibration: compare AI-identified gaps against CDI specialist identification on the same charts

Auto-Audit System

Run automated audits on model outputs monthly:

Code validity check: Are all suggested codes valid ICD-10-CM/PCS or CPT codes? (Invalid codes indicate model degradation)
Bundling rule check: Does the model ever suggest unbundled codes that should be bundled? (CCI edits compliance)
Modifier consistency: Are modifier suggestions consistent with documentation?
Trend analysis: Is accuracy drifting over time? (New documentation patterns, code updates)

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Fine-Tuned Models for Medical Coding and Clinical Documentation

Use Case 1: ICD-10/CPT Code Suggestion

The Problem

Training Data Structure

Training Data Volume and Sources

Accuracy Targets

Use Case 2: Clinical Documentation Improvement (CDI)

The Problem

Training Data Structure

Common Documentation Gaps AI Catches

De-Identification Pipeline for Both Use Cases

Step-by-Step Process

ROI Math: Medical Coding

Deployment: EHR Integration Architecture

Epic Integration

Cerner (Oracle Health) Integration

Architecture Pattern

Quality Assurance

For Medical Coding

For CDI

Auto-Audit System

Further Reading

Ship AI that runs on your users' devices.

Keep reading

PHI Redaction for AI Training: A Step-by-Step Guide for Healthcare ML Teams

Fine-Tuning for AML Transaction Monitoring: Reducing False Positives

Fine-Tuned vs. RAG for Clinical Decision Support: When Each Wins