HITL Workflow Design Worksheet: Turn Any AI Use Case into a Human-in-the-Loop System

Human-in-the-loop (HITL) isn't a philosophy — it's an engineering decision. For every AI use case that requires human oversight, someone needs to specify exactly: where does the human intervene, who are they, what do they see, how long do they have, and what happens when they don't respond? Without those specifics, "we have human oversight" is a statement of intent, not a control.

Use this worksheet for each AI use case you're deploying that requires human oversight. One completed worksheet per system. Keep it with your model inventory entry.

Part 1: Use Case Profile

Fill this in before doing any technical design. If you can't answer these questions clearly, the use case isn't ready for implementation.

Use case name: _______________________________________________

System owner (name and title): _______________________________________________

Date completed: _______________________________________________

Brief description — What does the AI do, in one or two sentences?

What input does it receive?

What output does it produce? (Be specific: a score, a classification label, generated text, a recommended action, etc.)

What downstream action does the output trigger? (What happens next if the output is accepted?)

Part 2: Risk Assessment

Rate this use case on two dimensions, then use the result matrix to determine your required oversight mode.

Consequence Severity: How bad is a wrong output?

Level	Definition	Examples
High	Significant harm to an individual; difficult or impossible to reverse	Loan rejection, medical treatment recommendation, legal filing, account termination
Medium	Material impact, reversible with effort and cost	Incorrect pricing, wrong content shown to user, delayed service
Low	Minor inconvenience, easily corrected with no lasting impact	Autocomplete suggestion, tag recommendation, internal summary

Your rating: [ ] High [ ] Medium [ ] Low

Decision Frequency: How many decisions per day?

Level	Volume
High	More than 10,000 decisions per day
Medium	100 to 10,000 decisions per day
Low	Fewer than 100 decisions per day

Your rating: [ ] High [ ] Medium [ ] Low

Result Matrix

	Low Frequency	Medium Frequency	High Frequency
High Consequence	HITL required	HITL required	HITL required
Medium Consequence	HITL recommended	HOTL with sampling	HOTL with sampling
Low Consequence	HOOTL with logging	HOOTL with logging	HOOTL with logging

Your oversight mode:

[ ] HITL — Human approves every output before downstream action is taken

[ ] HOTL — Human monitors with defined sampling rate; can override. Sampling rate: _______%

[ ] HOOTL — Fully automated; aggregate metrics monitored by humans

Note: If your organization's governance policy mandates HITL for this risk tier regardless of volume, that policy takes precedence over this matrix.

Part 3: Intervention Point Design

Complete this section only for HITL and HOTL systems.

Where does the human review occur?

Mark all that apply:

[ ] Before the model runs (human reviews input data and decides whether to proceed)

[ ] After the model produces output but before downstream action (most common for HITL)

[ ] Before a specific downstream action within a multi-step workflow (specify which action below)

[ ] All of the above

Specific action gating human approval (if applicable):

Reviewer specification

Who is the designated reviewer? (role/title, not individual name — specify individual assignment in your team's RACI)

Is there a secondary reviewer for escalations? (role/title)

What is the maximum acceptable review time (SLA)?

What happens if a reviewer doesn't respond within the SLA?

[ ] Escalate to secondary reviewer

[ ] Escalate to [role]: _______________

[ ] Auto-reject (do not take downstream action)

[ ] Auto-approve (only for Tier 3 systems with documented justification)

[ ] Pause entire workflow and alert system owner

Can the reviewer request additional information before deciding?

[ ] No — reviewer must decide based on what is displayed

[ ] Yes — reviewer can request: _______________________________________________

SLA for information request response: _______________

Part 4: Review Interface Requirements

The review interface is where automation bias risk is highest. Reviewers shown only the AI output with a single approve/reject button will rubber-stamp most outputs. The interface must show enough context for the reviewer to exercise genuine judgment.

Required elements — check all that must be present in your review interface:

[ ] AI output (the specific recommendation, classification, or generated text)

[ ] Confidence score or probability if the model produces one

[ ] Alternative outputs considered by the model (top-N alternatives, if applicable)

[ ] The input data that produced this output

[ ] The model version that produced this output

[ ] Similar past cases and their outcomes (reference cases)

[ ] Relevant regulatory or policy context for this decision type

[ ] Time remaining in the review SLA (visible countdown)

[ ] Flags or anomaly indicators if the input is outside the model's training distribution

[ ] Previous decisions by this reviewer on similar cases (to support calibration)

Additional context required for this specific use case:

What must the reviewer record when rejecting?

[ ] Rejection reason (required, free text)

[ ] Rejection category (required, select from list): _______________

[ ] Recommended correction: _______________

Part 5: Escalation Thresholds

Define the thresholds that determine routing for each decision. Be specific — vague thresholds produce inconsistent behavior.

Routing	Condition
Auto-approve	Confidence > _____% AND output type = _____ AND no flags raised
Standard review	Confidence between _____% and _____% OR any of these flags: _____
Senior review	Confidence < _____% OR output includes any of: _____
Automatic reject	Output contains any of: _____ (e.g., prohibited content, out-of-scope request)

Note on setting thresholds: Don't set auto-approve thresholds before you have empirical data on the model's actual confidence calibration. Run the model in shadow mode (outputs logged but no downstream actions taken) for at least 30 days and review the distribution of confidence scores alongside human reviewer decisions before finalizing thresholds.

Part 6: Automation Bias Prevention

Automation bias — the tendency of human reviewers to defer to AI outputs without genuine independent evaluation — undermines HITL controls. These controls reduce it.

Retrospective spot-check

[ ] _____% of auto-approved decisions are reviewed retrospectively each week by a senior reviewer

Reviewer calibration

[ ] Reviewers are tested on a set of known cases (where the correct answer is established) at least monthly. Pass rate required: _____%. Reviewers failing calibration tests are retrained before returning to review duties.

Override rate monitoring

[ ] Override rates are calculated weekly. Target override rate range: _____% to _____%

[ ] Override rate deviating more than _____% from the baseline range triggers: _______________________________________________

Reviewer rotation

[ ] No single reviewer handles more than _____% of cases in any given week

[ ] Minimum number of active reviewers: _____

Feedback loop

[ ] Reviewer decisions are used to periodically evaluate model performance. When reviewers consistently override the model on a specific input type, this triggers model evaluation within _____ days.

Part 7: Audit Trail Requirements

For every human decision in this workflow, the following must be logged automatically. This is not optional for Tier 1 or Tier 2 systems — it's a regulatory requirement under SR 11-7, EU AI Act Article 30, and HIPAA (where applicable).

Required log fields — confirm each will be captured:

[ ] Timestamp in UTC (not local time)

[ ] Reviewer identity (user ID, not name — name can change; user ID persists)

[ ] AI output that was reviewed (exact text, score, or classification)

[ ] Model version that produced the output (not application version — model artifact version)

[ ] Reviewer decision: Approve / Reject / Escalate

[ ] Reviewer reasoning (free text — required for Reject and Escalate)

[ ] Time taken to review (seconds from display to decision)

[ ] Downstream action taken as a result of the review decision

[ ] Case or entity identifier (so you can reconstruct which individual was affected)

Log retention period: _______________________________________________

Log storage location: _______________________________________________

Tamper-evidence mechanism (hash chain, write-once storage, etc.): _______________________________________________

Part 8: Effectiveness Metrics

Define target values before you launch. Measure against them monthly. If you don't establish a baseline, you can't detect drift.

Metric	Target	How It's Measured
Review completion rate	_____% of cases requiring review actually receive it	Cases reviewed / cases requiring review
Override rate	_____% (establish baseline after 30 days)	Cases rejected or escalated / cases reviewed
Time-to-decision	_____ minutes average	Mean of review duration log field
False negative rate	_____%	AI approved; senior retrospective reviewer disagrees
Audit log completeness	100%	Log records / inference records for reviewed cases
Reviewer calibration pass rate	_____%	Calibration test results, monthly

Reporting cadence: These metrics are reported to _____________________ on a [weekly / monthly] basis.

Escalation triggers — if any metric deviates beyond acceptable range, the following action is taken:

Completed Example: Contract Clause Review AI

To illustrate how this worksheet works in practice, here's a filled example for a legal technology use case.

Use case name: Contract Clause Risk Screener

Brief description: The AI reviews uploaded contract clauses and flags those with non-standard terms, missing protections, or unusual liability allocation, and produces a risk rating (Low / Medium / High) with a summary explanation.

Input: Individual contract clause text (extracted from uploaded PDFs)

Output: Risk rating (Low / Medium / High) and 2-3 sentence explanation of flagged issues

Downstream action: High-rated clauses are automatically queued for attorney review; Low-rated clauses are auto-accepted; Medium are queued for paralegal review

Consequence severity: High (legal and financial consequences for missed risky clauses)

Decision frequency: Medium (200-500 clauses per day)

Oversight mode: HITL for High-rated clauses; HOTL (10% sample) for Medium-rated clauses

Reviewer: Senior Associate (for High); Paralegal (for Medium sample review)

SLA: High-rated: 4 hours; Medium sample review: 24 hours

SLA breach action: Escalate to Partner on duty

Auto-approve threshold: Confidence > 92% AND rating = Low AND no IP or indemnification clauses flagged

Senior review threshold: Any clause where model confidence < 70% OR clause involves limitation of liability, IP ownership, or governing law

Override rate target: 15-25% for High-rated clauses (if override rate drops below 10%, reviewers may be rubber-stamping; if above 35%, model may be over-flagging)

Connecting to Your Audit Infrastructure

Part 7 of this worksheet specifies the exact log fields required for each human review decision. Those fields need to be captured by your review interface and stored in a system that can produce them on demand for regulators.

If you're building a custom review interface, log completeness should be a launch-blocking requirement — not a post-launch improvement. Missing audit records from before a logging fix are gone; they cannot be reconstructed.

Ertas Data Suite's built-in operator logging and human review tracking captures the Part 7 fields automatically for data pipeline review steps, with tamper-evident storage and structured export for regulatory submission.

Book a discovery call with Ertas →