Human-in-the-Loop in Clinical Decision Support: How Healthcare AI Should (and Shouldn't) Work

In 2021, a hospital system in the US deployed an AI tool to predict patient deterioration and flag high-risk patients for early intervention. The system was accurate in aggregate. What it wasn't designed for was clear communication of which specific physiological markers drove each flag, or how to prioritize when fifty patients were flagged simultaneously. Nursing staff, overwhelmed with alerts and lacking context for each one, developed workarounds. High-confidence flags were acknowledged and deprioritized. Patients deteriorated. The AI was not defective. The human-in-the-loop process was.

This is the failure mode that matters most in healthcare AI. Not a rogue model. Not a catastrophic hallucination. A technically functional system embedded in a clinical workflow that made meaningful human oversight structurally impossible.

FDA's SaMD Framework and What It Requires

The FDA classifies Software as a Medical Device (SaMD) into three risk tiers based on the significance of the information it provides and the state of the healthcare situation it affects.

Class I SaMD: Low risk. AI that provides information for non-serious conditions, where incorrect information is unlikely to cause patient harm. Example: a wellness app that tracks sleep patterns. Minimal regulatory requirements.

Class II SaMD: Moderate risk. AI that informs clinical management of non-serious or serious conditions, or drives clinical management of non-serious conditions. Requires 510(k) clearance. Must demonstrate substantial equivalence to a predicate device. HITL is expected; the software should provide information that a clinician reviews and acts upon.

Class III SaMD: High risk. AI that diagnoses, treats, or drives clinical management of serious or immediately life-threatening conditions. Requires Premarket Approval (PMA). HITL is not optional. The FDA's position is explicit: AI recommendations that bypass qualified clinician review are not approvable for Class III indications.

The FDA's 2019 and updated 2023 guidance on Predetermined Change Control Plans (PCCPs) added a second dimension: model updates. A PCCP defines in advance what types of changes a manufacturer can make to an AI/ML-based SaMD without requiring a new submission. Every PCCP must include a description of how the manufacturer will verify that changes perform as intended — and that verification must involve qualified human review of clinical performance data before the updated model deploys to production. You cannot silently update a clinical AI model the way you update a web app.

HIPAA's Accountability Problem

HIPAA does not address AI directly. It doesn't need to. The accountability structure it establishes makes the requirement clear.

Covered entities — hospitals, clinics, health plans, healthcare clearinghouses — are legally responsible for the actions of their workforce and business associates in handling protected health information and making treatment decisions. The treating clinician is accountable for clinical decisions made in the course of care.

An AI system cannot be a covered entity. It cannot be a business associate in the clinical decision sense. It has no license to revoke and no malpractice insurance to exhaust. When an AI system makes a clinical recommendation and a clinician acts on it without independent professional judgment, the liability exposure doesn't transfer to the AI vendor. It stays with the clinician and the institution.

This means that any clinical AI deployment that routes around physician review — that allows AI output to directly drive treatment without documented clinical validation — creates a HIPAA accountability gap. The institution cannot say the AI decided. They deployed the AI. They own the decision.

What HITL Looks Like in Clinical Practice

Good HITL in healthcare is not a single pattern. It varies by clinical context and risk tier.

Imaging AI (radiology, pathology, dermatology): The AI analyzes the image and produces a structured output — a flagged region, a differential, a confidence score. The radiologist or pathologist receives this output as additional information, not a final read. They perform their own independent analysis, then compare with the AI's finding. Their signed report is the clinical record. The AI's output is a tool they used, not the determination.

Medication decision support: A pharmacy AI flags a potential drug interaction or dosing anomaly. The system presents the flag with specificity: the interacting agents, the mechanism of concern, the severity tier, and published references. The pharmacist reviews and either confirms the order is appropriate for this patient's clinical context, modifies the order, or escalates to the prescribing physician. The pharmacist's name is on the verification.

Prior authorization AI: Insurance and health system AI tools pre-populate prior auth requests based on clinical documentation. A clinical staff member reviews the pre-populated request, confirms it accurately reflects the patient's record, and submits under their professional attestation.

Sepsis prediction: The AI flags patients above a risk threshold. A nurse or clinical coordinator reviews the flagged patients, applies clinical judgment about which represent actionable risk given current context, and determines who to escalate to the rapid response team. The flag is not the action. The clinician's assessment is.

The Alert Fatigue Problem

Alert fatigue is where well-designed HITL goes to die.

A clinical AI that flags 50 patients per shift for a nurse managing 12 beds is not providing decision support. It's providing noise. When clinicians are overwhelmed with alerts — most of which, on examination, are low-signal or irrelevant to their specific patient context — they adapt. They acknowledge alerts without reading them. They develop blanket policies: "if it's just an AI flag, put it in the chart and move on." The human-in-the-loop process is technically in place. It is functionally inert.

The research on this is clear. A 2023 study in JAMIA found that clinicians overrode more than 90% of AI-generated medication alerts in one EHR deployment. Not because the AI was always wrong — it was right about 40% of the time. But the signal-to-noise ratio was so low that discerning which 40% required effort that the workflow didn't support.

Alert fatigue doesn't mean clinicians stopped caring. It means the system was designed for coverage, not for clinical usability.

The consequence is that high-signal alerts get missed in the noise of low-signal ones. The AI was there. The human was there. The loop was broken anyway.

Designing HITL That Clinicians Actually Use

The solution to alert fatigue is not fewer alerts for their own sake. It's alerts with sufficient signal quality that reviewers can make confident decisions quickly.

Principle 1: Threshold calibration beats blanket alerting. If your sepsis model alerts on every patient above a 15% predicted risk, you will generate alerts on patients who are being appropriately managed and who the bedside nurse already knows are not deteriorating. Tune the threshold to where the alert changes clinical behavior — not where the model becomes technically correct.

Principle 2: Show the why, not just the what. "Patient flagged for sepsis risk" is not HITL. "Patient flagged for sepsis risk: temperature 38.9°C, lactate 2.1 mmol/L trending up over 4 hours, MAP declining — three of four SIRS criteria met" is HITL. The reviewer needs enough information to validate the AI's reasoning independently, not to accept it on faith.

Principle 3: Friction proportional to risk. A medication interaction alert for a minor, well-known interaction should take one click to acknowledge. A high-confidence sepsis flag for a patient the clinician hasn't seen in two hours should require a documented clinical assessment. The effort to dismiss should match the cost of being wrong.

Principle 4: Measure reviewer behavior, not just alert volume. If 95% of alerts are dismissed within five seconds, you don't have a review process. Track time-to-decision, override rate by alert type, and downstream outcomes for overridden vs. confirmed alerts. This data tells you whether your HITL is working.

How Ertas Data Suite Supports Healthcare AI Development

Before a clinical AI system goes anywhere near patients, the training data has to be prepared. In healthcare, that means working with PHI — and PHI cannot leave the building to sit in a cloud vendor's training infrastructure.

Ertas Data Suite runs entirely on-premise as a native desktop application. PHI redaction, annotation, and export all happen within the institution's security boundary. Every annotation is logged with operator identity and timestamp. The audit trail is built into the tool, not assembled from system logs after the fact.

For healthcare organizations building or fine-tuning AI for clinical applications, the data preparation pipeline needs to meet the same HITL and governance standards as the deployed model. A clinical AI trained on data prepared in a privacy-compliant, auditable pipeline starts with a defensible foundation.

For more on the broader HITL framework, see What Is Human-in-the-Loop AI?. For coverage of fine-tuning models for healthcare deployment specifically, see our article on fine-tuning healthcare AI for clinical deployment.

Book a discovery call with Ertas →

Clinical AI that keeps humans in the loop isn't AI that's waiting to be replaced. It's AI that earns trust by being auditable, explainable, and clinician-controlled. The FDA, HIPAA, and clinical ethics all point the same direction. The question for your institution is whether your current AI deployments are designed to meet that standard — or designed to look like they do.

Human-in-the-Loop in Clinical Decision Support: How Healthcare AI Should (and Shouldn't) Work

FDA's SaMD Framework and What It Requires

HIPAA's Accountability Problem

What HITL Looks Like in Clinical Practice

The Alert Fatigue Problem

Designing HITL That Clinicians Actually Use

How Ertas Data Suite Supports Healthcare AI Development

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

AI Governance Framework for Healthcare: HIPAA, FDA SaMD, and Clinical Oversight Requirements

Human-in-the-Loop for Legal AI: Why Attorney Review Isn't Just a Compliance Checkbox

Human-in-the-Loop for Financial AI: SR 11-7, Model Risk, and What the Fed Actually Requires