
HITL Workflow Design Worksheet: Turn Any AI Use Case into a Human-in-the-Loop System
A practical worksheet for designing human oversight into AI workflows. Covers risk assessment, intervention points, review interface requirements, escalation thresholds, and audit requirements.
Human-in-the-loop (HITL) isn't a philosophy — it's an engineering decision. For every AI use case that requires human oversight, someone needs to specify exactly: where does the human intervene, who are they, what do they see, how long do they have, and what happens when they don't respond? Without those specifics, "we have human oversight" is a statement of intent, not a control.
Use this worksheet for each AI use case you're deploying that requires human oversight. One completed worksheet per system. Keep it with your model inventory entry.
Part 1: Use Case Profile
Fill this in before doing any technical design. If you can't answer these questions clearly, the use case isn't ready for implementation.
Use case name: _______________________________________________
System owner (name and title): _______________________________________________
Date completed: _______________________________________________
Brief description — What does the AI do, in one or two sentences?
What input does it receive?
What output does it produce? (Be specific: a score, a classification label, generated text, a recommended action, etc.)
What downstream action does the output trigger? (What happens next if the output is accepted?)
Part 2: Risk Assessment
Rate this use case on two dimensions, then use the result matrix to determine your required oversight mode.
Consequence Severity: How bad is a wrong output?
| Level | Definition | Examples |
|---|---|---|
| High | Significant harm to an individual; difficult or impossible to reverse | Loan rejection, medical treatment recommendation, legal filing, account termination |
| Medium | Material impact, reversible with effort and cost | Incorrect pricing, wrong content shown to user, delayed service |
| Low | Minor inconvenience, easily corrected with no lasting impact | Autocomplete suggestion, tag recommendation, internal summary |
Your rating: [ ] High [ ] Medium [ ] Low
Decision Frequency: How many decisions per day?
| Level | Volume |
|---|---|
| High | More than 10,000 decisions per day |
| Medium | 100 to 10,000 decisions per day |
| Low | Fewer than 100 decisions per day |
Your rating: [ ] High [ ] Medium [ ] Low
Result Matrix
| Low Frequency | Medium Frequency | High Frequency | |
|---|---|---|---|
| High Consequence | HITL required | HITL required | HITL required |
| Medium Consequence | HITL recommended | HOTL with sampling | HOTL with sampling |
| Low Consequence | HOOTL with logging | HOOTL with logging | HOOTL with logging |
Your oversight mode:
[ ] HITL — Human approves every output before downstream action is taken
[ ] HOTL — Human monitors with defined sampling rate; can override. Sampling rate: _______%
[ ] HOOTL — Fully automated; aggregate metrics monitored by humans
Note: If your organization's governance policy mandates HITL for this risk tier regardless of volume, that policy takes precedence over this matrix.
Part 3: Intervention Point Design
Complete this section only for HITL and HOTL systems.
Where does the human review occur?
Mark all that apply:
[ ] Before the model runs (human reviews input data and decides whether to proceed)
[ ] After the model produces output but before downstream action (most common for HITL)
[ ] Before a specific downstream action within a multi-step workflow (specify which action below)
[ ] All of the above
Specific action gating human approval (if applicable):
Reviewer specification
Who is the designated reviewer? (role/title, not individual name — specify individual assignment in your team's RACI)
Is there a secondary reviewer for escalations? (role/title)
What is the maximum acceptable review time (SLA)?
What happens if a reviewer doesn't respond within the SLA?
[ ] Escalate to secondary reviewer
[ ] Escalate to [role]: _______________
[ ] Auto-reject (do not take downstream action)
[ ] Auto-approve (only for Tier 3 systems with documented justification)
[ ] Pause entire workflow and alert system owner
Can the reviewer request additional information before deciding?
[ ] No — reviewer must decide based on what is displayed
[ ] Yes — reviewer can request: _______________________________________________
SLA for information request response: _______________
Part 4: Review Interface Requirements
The review interface is where automation bias risk is highest. Reviewers shown only the AI output with a single approve/reject button will rubber-stamp most outputs. The interface must show enough context for the reviewer to exercise genuine judgment.
Required elements — check all that must be present in your review interface:
[ ] AI output (the specific recommendation, classification, or generated text)
[ ] Confidence score or probability if the model produces one
[ ] Alternative outputs considered by the model (top-N alternatives, if applicable)
[ ] The input data that produced this output
[ ] The model version that produced this output
[ ] Similar past cases and their outcomes (reference cases)
[ ] Relevant regulatory or policy context for this decision type
[ ] Time remaining in the review SLA (visible countdown)
[ ] Flags or anomaly indicators if the input is outside the model's training distribution
[ ] Previous decisions by this reviewer on similar cases (to support calibration)
Additional context required for this specific use case:
What must the reviewer record when rejecting?
[ ] Rejection reason (required, free text)
[ ] Rejection category (required, select from list): _______________
[ ] Recommended correction: _______________
Part 5: Escalation Thresholds
Define the thresholds that determine routing for each decision. Be specific — vague thresholds produce inconsistent behavior.
| Routing | Condition |
|---|---|
| Auto-approve | Confidence > _____% AND output type = _____ AND no flags raised |
| Standard review | Confidence between _____% and _____% OR any of these flags: _____ |
| Senior review | Confidence < _____% OR output includes any of: _____ |
| Automatic reject | Output contains any of: _____ (e.g., prohibited content, out-of-scope request) |
Note on setting thresholds: Don't set auto-approve thresholds before you have empirical data on the model's actual confidence calibration. Run the model in shadow mode (outputs logged but no downstream actions taken) for at least 30 days and review the distribution of confidence scores alongside human reviewer decisions before finalizing thresholds.
Part 6: Automation Bias Prevention
Automation bias — the tendency of human reviewers to defer to AI outputs without genuine independent evaluation — undermines HITL controls. These controls reduce it.
Retrospective spot-check
[ ] _____% of auto-approved decisions are reviewed retrospectively each week by a senior reviewer
Reviewer calibration
[ ] Reviewers are tested on a set of known cases (where the correct answer is established) at least monthly. Pass rate required: _____%. Reviewers failing calibration tests are retrained before returning to review duties.
Override rate monitoring
[ ] Override rates are calculated weekly. Target override rate range: _____% to _____%
[ ] Override rate deviating more than _____% from the baseline range triggers: _______________________________________________
Reviewer rotation
[ ] No single reviewer handles more than _____% of cases in any given week
[ ] Minimum number of active reviewers: _____
Feedback loop
[ ] Reviewer decisions are used to periodically evaluate model performance. When reviewers consistently override the model on a specific input type, this triggers model evaluation within _____ days.
Part 7: Audit Trail Requirements
For every human decision in this workflow, the following must be logged automatically. This is not optional for Tier 1 or Tier 2 systems — it's a regulatory requirement under SR 11-7, EU AI Act Article 30, and HIPAA (where applicable).
Required log fields — confirm each will be captured:
[ ] Timestamp in UTC (not local time)
[ ] Reviewer identity (user ID, not name — name can change; user ID persists)
[ ] AI output that was reviewed (exact text, score, or classification)
[ ] Model version that produced the output (not application version — model artifact version)
[ ] Reviewer decision: Approve / Reject / Escalate
[ ] Reviewer reasoning (free text — required for Reject and Escalate)
[ ] Time taken to review (seconds from display to decision)
[ ] Downstream action taken as a result of the review decision
[ ] Case or entity identifier (so you can reconstruct which individual was affected)
Log retention period: _______________________________________________
Log storage location: _______________________________________________
Tamper-evidence mechanism (hash chain, write-once storage, etc.): _______________________________________________
Part 8: Effectiveness Metrics
Define target values before you launch. Measure against them monthly. If you don't establish a baseline, you can't detect drift.
| Metric | Target | How It's Measured |
|---|---|---|
| Review completion rate | _____% of cases requiring review actually receive it | Cases reviewed / cases requiring review |
| Override rate | _____% (establish baseline after 30 days) | Cases rejected or escalated / cases reviewed |
| Time-to-decision | _____ minutes average | Mean of review duration log field |
| False negative rate | _____% | AI approved; senior retrospective reviewer disagrees |
| Audit log completeness | 100% | Log records / inference records for reviewed cases |
| Reviewer calibration pass rate | _____% | Calibration test results, monthly |
Reporting cadence: These metrics are reported to _____________________ on a [weekly / monthly] basis.
Escalation triggers — if any metric deviates beyond acceptable range, the following action is taken:
Completed Example: Contract Clause Review AI
To illustrate how this worksheet works in practice, here's a filled example for a legal technology use case.
Use case name: Contract Clause Risk Screener
Brief description: The AI reviews uploaded contract clauses and flags those with non-standard terms, missing protections, or unusual liability allocation, and produces a risk rating (Low / Medium / High) with a summary explanation.
Input: Individual contract clause text (extracted from uploaded PDFs)
Output: Risk rating (Low / Medium / High) and 2-3 sentence explanation of flagged issues
Downstream action: High-rated clauses are automatically queued for attorney review; Low-rated clauses are auto-accepted; Medium are queued for paralegal review
Consequence severity: High (legal and financial consequences for missed risky clauses)
Decision frequency: Medium (200-500 clauses per day)
Oversight mode: HITL for High-rated clauses; HOTL (10% sample) for Medium-rated clauses
Reviewer: Senior Associate (for High); Paralegal (for Medium sample review)
SLA: High-rated: 4 hours; Medium sample review: 24 hours
SLA breach action: Escalate to Partner on duty
Auto-approve threshold: Confidence > 92% AND rating = Low AND no IP or indemnification clauses flagged
Senior review threshold: Any clause where model confidence < 70% OR clause involves limitation of liability, IP ownership, or governing law
Override rate target: 15-25% for High-rated clauses (if override rate drops below 10%, reviewers may be rubber-stamping; if above 35%, model may be over-flagging)
Connecting to Your Audit Infrastructure
Part 7 of this worksheet specifies the exact log fields required for each human review decision. Those fields need to be captured by your review interface and stored in a system that can produce them on demand for regulators.
If you're building a custom review interface, log completeness should be a launch-blocking requirement — not a post-launch improvement. Missing audit records from before a logging fix are gone; they cannot be reconstructed.
Ertas Data Suite's built-in operator logging and human review tracking captures the Part 7 fields automatically for data pipeline review steps, with tamper-evident storage and structured export for regulatory submission.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Design a Human-in-the-Loop Workflow for Your AI Pipeline
A practical framework for embedding human oversight into AI systems — from risk assessment to review interface design. Goes beyond theory to what actually works in production.

What Is Human-in-the-Loop AI? A Practical Guide for Enterprise Teams
Human-in-the-loop AI keeps humans in the decision chain — but the details matter. Here's what HITL actually means in practice and why it's non-negotiable in regulated industries.

Human-in-the-Loop vs. Human-on-the-Loop vs. Human-out-of-the-Loop: What's the Difference
Three terms that sound similar but represent fundamentally different risk profiles. Understanding the distinction matters more than ever as AI moves into high-stakes decisions.