Back to blog
    HITL Workflow Design Worksheet: Turn Any AI Use Case into a Human-in-the-Loop System
    human-in-the-loopai-workflowai-governanceworksheetimplementation

    HITL Workflow Design Worksheet: Turn Any AI Use Case into a Human-in-the-Loop System

    A practical worksheet for designing human oversight into AI workflows. Covers risk assessment, intervention points, review interface requirements, escalation thresholds, and audit requirements.

    EErtas Team·

    Human-in-the-loop (HITL) isn't a philosophy — it's an engineering decision. For every AI use case that requires human oversight, someone needs to specify exactly: where does the human intervene, who are they, what do they see, how long do they have, and what happens when they don't respond? Without those specifics, "we have human oversight" is a statement of intent, not a control.

    Use this worksheet for each AI use case you're deploying that requires human oversight. One completed worksheet per system. Keep it with your model inventory entry.


    Part 1: Use Case Profile

    Fill this in before doing any technical design. If you can't answer these questions clearly, the use case isn't ready for implementation.

    Use case name: _______________________________________________

    System owner (name and title): _______________________________________________

    Date completed: _______________________________________________

    Brief description — What does the AI do, in one or two sentences?


    What input does it receive?


    What output does it produce? (Be specific: a score, a classification label, generated text, a recommended action, etc.)


    What downstream action does the output trigger? (What happens next if the output is accepted?)



    Part 2: Risk Assessment

    Rate this use case on two dimensions, then use the result matrix to determine your required oversight mode.

    Consequence Severity: How bad is a wrong output?

    LevelDefinitionExamples
    HighSignificant harm to an individual; difficult or impossible to reverseLoan rejection, medical treatment recommendation, legal filing, account termination
    MediumMaterial impact, reversible with effort and costIncorrect pricing, wrong content shown to user, delayed service
    LowMinor inconvenience, easily corrected with no lasting impactAutocomplete suggestion, tag recommendation, internal summary

    Your rating: [ ] High [ ] Medium [ ] Low

    Decision Frequency: How many decisions per day?

    LevelVolume
    HighMore than 10,000 decisions per day
    Medium100 to 10,000 decisions per day
    LowFewer than 100 decisions per day

    Your rating: [ ] High [ ] Medium [ ] Low

    Result Matrix

    Low FrequencyMedium FrequencyHigh Frequency
    High ConsequenceHITL requiredHITL requiredHITL required
    Medium ConsequenceHITL recommendedHOTL with samplingHOTL with sampling
    Low ConsequenceHOOTL with loggingHOOTL with loggingHOOTL with logging

    Your oversight mode:

    [ ] HITL — Human approves every output before downstream action is taken

    [ ] HOTL — Human monitors with defined sampling rate; can override. Sampling rate: _______%

    [ ] HOOTL — Fully automated; aggregate metrics monitored by humans

    Note: If your organization's governance policy mandates HITL for this risk tier regardless of volume, that policy takes precedence over this matrix.


    Part 3: Intervention Point Design

    Complete this section only for HITL and HOTL systems.

    Where does the human review occur?

    Mark all that apply:

    [ ] Before the model runs (human reviews input data and decides whether to proceed)

    [ ] After the model produces output but before downstream action (most common for HITL)

    [ ] Before a specific downstream action within a multi-step workflow (specify which action below)

    [ ] All of the above

    Specific action gating human approval (if applicable):


    Reviewer specification

    Who is the designated reviewer? (role/title, not individual name — specify individual assignment in your team's RACI)


    Is there a secondary reviewer for escalations? (role/title)


    What is the maximum acceptable review time (SLA)?


    What happens if a reviewer doesn't respond within the SLA?

    [ ] Escalate to secondary reviewer

    [ ] Escalate to [role]: _______________

    [ ] Auto-reject (do not take downstream action)

    [ ] Auto-approve (only for Tier 3 systems with documented justification)

    [ ] Pause entire workflow and alert system owner

    Can the reviewer request additional information before deciding?

    [ ] No — reviewer must decide based on what is displayed

    [ ] Yes — reviewer can request: _______________________________________________

    SLA for information request response: _______________


    Part 4: Review Interface Requirements

    The review interface is where automation bias risk is highest. Reviewers shown only the AI output with a single approve/reject button will rubber-stamp most outputs. The interface must show enough context for the reviewer to exercise genuine judgment.

    Required elements — check all that must be present in your review interface:

    [ ] AI output (the specific recommendation, classification, or generated text)

    [ ] Confidence score or probability if the model produces one

    [ ] Alternative outputs considered by the model (top-N alternatives, if applicable)

    [ ] The input data that produced this output

    [ ] The model version that produced this output

    [ ] Similar past cases and their outcomes (reference cases)

    [ ] Relevant regulatory or policy context for this decision type

    [ ] Time remaining in the review SLA (visible countdown)

    [ ] Flags or anomaly indicators if the input is outside the model's training distribution

    [ ] Previous decisions by this reviewer on similar cases (to support calibration)

    Additional context required for this specific use case:


    What must the reviewer record when rejecting?

    [ ] Rejection reason (required, free text)

    [ ] Rejection category (required, select from list): _______________

    [ ] Recommended correction: _______________


    Part 5: Escalation Thresholds

    Define the thresholds that determine routing for each decision. Be specific — vague thresholds produce inconsistent behavior.

    RoutingCondition
    Auto-approveConfidence > _____% AND output type = _____ AND no flags raised
    Standard reviewConfidence between _____% and _____% OR any of these flags: _____
    Senior reviewConfidence < _____% OR output includes any of: _____
    Automatic rejectOutput contains any of: _____ (e.g., prohibited content, out-of-scope request)

    Note on setting thresholds: Don't set auto-approve thresholds before you have empirical data on the model's actual confidence calibration. Run the model in shadow mode (outputs logged but no downstream actions taken) for at least 30 days and review the distribution of confidence scores alongside human reviewer decisions before finalizing thresholds.


    Part 6: Automation Bias Prevention

    Automation bias — the tendency of human reviewers to defer to AI outputs without genuine independent evaluation — undermines HITL controls. These controls reduce it.

    Retrospective spot-check

    [ ] _____% of auto-approved decisions are reviewed retrospectively each week by a senior reviewer

    Reviewer calibration

    [ ] Reviewers are tested on a set of known cases (where the correct answer is established) at least monthly. Pass rate required: _____%. Reviewers failing calibration tests are retrained before returning to review duties.

    Override rate monitoring

    [ ] Override rates are calculated weekly. Target override rate range: _____% to _____%

    [ ] Override rate deviating more than _____% from the baseline range triggers: _______________________________________________

    Reviewer rotation

    [ ] No single reviewer handles more than _____% of cases in any given week

    [ ] Minimum number of active reviewers: _____

    Feedback loop

    [ ] Reviewer decisions are used to periodically evaluate model performance. When reviewers consistently override the model on a specific input type, this triggers model evaluation within _____ days.


    Part 7: Audit Trail Requirements

    For every human decision in this workflow, the following must be logged automatically. This is not optional for Tier 1 or Tier 2 systems — it's a regulatory requirement under SR 11-7, EU AI Act Article 30, and HIPAA (where applicable).

    Required log fields — confirm each will be captured:

    [ ] Timestamp in UTC (not local time)

    [ ] Reviewer identity (user ID, not name — name can change; user ID persists)

    [ ] AI output that was reviewed (exact text, score, or classification)

    [ ] Model version that produced the output (not application version — model artifact version)

    [ ] Reviewer decision: Approve / Reject / Escalate

    [ ] Reviewer reasoning (free text — required for Reject and Escalate)

    [ ] Time taken to review (seconds from display to decision)

    [ ] Downstream action taken as a result of the review decision

    [ ] Case or entity identifier (so you can reconstruct which individual was affected)

    Log retention period: _______________________________________________

    Log storage location: _______________________________________________

    Tamper-evidence mechanism (hash chain, write-once storage, etc.): _______________________________________________


    Part 8: Effectiveness Metrics

    Define target values before you launch. Measure against them monthly. If you don't establish a baseline, you can't detect drift.

    MetricTargetHow It's Measured
    Review completion rate_____% of cases requiring review actually receive itCases reviewed / cases requiring review
    Override rate_____% (establish baseline after 30 days)Cases rejected or escalated / cases reviewed
    Time-to-decision_____ minutes averageMean of review duration log field
    False negative rate_____%AI approved; senior retrospective reviewer disagrees
    Audit log completeness100%Log records / inference records for reviewed cases
    Reviewer calibration pass rate_____%Calibration test results, monthly

    Reporting cadence: These metrics are reported to _____________________ on a [weekly / monthly] basis.

    Escalation triggers — if any metric deviates beyond acceptable range, the following action is taken:



    Completed Example: Contract Clause Review AI

    To illustrate how this worksheet works in practice, here's a filled example for a legal technology use case.

    Use case name: Contract Clause Risk Screener

    Brief description: The AI reviews uploaded contract clauses and flags those with non-standard terms, missing protections, or unusual liability allocation, and produces a risk rating (Low / Medium / High) with a summary explanation.

    Input: Individual contract clause text (extracted from uploaded PDFs)

    Output: Risk rating (Low / Medium / High) and 2-3 sentence explanation of flagged issues

    Downstream action: High-rated clauses are automatically queued for attorney review; Low-rated clauses are auto-accepted; Medium are queued for paralegal review

    Consequence severity: High (legal and financial consequences for missed risky clauses)

    Decision frequency: Medium (200-500 clauses per day)

    Oversight mode: HITL for High-rated clauses; HOTL (10% sample) for Medium-rated clauses

    Reviewer: Senior Associate (for High); Paralegal (for Medium sample review)

    SLA: High-rated: 4 hours; Medium sample review: 24 hours

    SLA breach action: Escalate to Partner on duty

    Auto-approve threshold: Confidence > 92% AND rating = Low AND no IP or indemnification clauses flagged

    Senior review threshold: Any clause where model confidence < 70% OR clause involves limitation of liability, IP ownership, or governing law

    Override rate target: 15-25% for High-rated clauses (if override rate drops below 10%, reviewers may be rubber-stamping; if above 35%, model may be over-flagging)


    Connecting to Your Audit Infrastructure

    Part 7 of this worksheet specifies the exact log fields required for each human review decision. Those fields need to be captured by your review interface and stored in a system that can produce them on demand for regulators.

    If you're building a custom review interface, log completeness should be a launch-blocking requirement — not a post-launch improvement. Missing audit records from before a logging fix are gone; they cannot be reconstructed.

    Ertas Data Suite's built-in operator logging and human review tracking captures the Part 7 fields automatically for data pipeline review steps, with tamper-evident storage and structured export for regulatory submission.

    Book a discovery call with Ertas →

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading