Back to blog
    Human-in-the-Loop for Financial AI: SR 11-7, Model Risk, and What the Fed Actually Requires
    human-in-the-loopsr-11-7model-risk-managementfinancial-aicompliance

    Human-in-the-Loop for Financial AI: SR 11-7, Model Risk, and What the Fed Actually Requires

    The Federal Reserve's SR 11-7 guidance predates LLMs but applies directly to AI systems. Here's what it actually requires for human oversight in financial model deployment.

    EErtas Team·

    The Federal Reserve's SR 11-7 guidance was published in April 2011, written for a world of quantitative risk models — VaR calculations, credit scoring algorithms, stress-testing frameworks. The guidance does not mention large language models, generative AI, or even machine learning as a distinct category. It doesn't need to. Its requirements apply to any "quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates."

    An LLM used to assess creditworthiness, summarize borrower documentation, flag suspicious transactions, or generate loan approval rationales is a model under SR 11-7. The guidance applies. Examiners are applying it. The enforcement posture is no longer theoretical.

    What SR 11-7 Actually Says

    SR 11-7 establishes a model risk management framework built on three pillars.

    Pillar 1: Model Development and Implementation

    Models must be designed with a clear purpose, appropriate methodology, and documented assumptions. For LLMs in financial applications, this means:

    • The task the model performs must be precisely defined
    • The training data and fine-tuning methodology must be documented
    • The model's known limitations and failure modes must be explicitly acknowledged
    • Performance metrics appropriate to the financial decision context must be chosen before deployment

    "We use GPT-4 for loan summaries" is not a documented model implementation. A documented implementation specifies which model version, how prompts are structured, what validation was performed on financial document types, what the expected error rate is, and what happens when the model is wrong.

    Pillar 2: Model Validation

    Independent validation is the requirement that most AI deployments are currently failing. SR 11-7 requires that models be validated by people who are independent of the development team — who did not build the model, are not incentivized by its performance, and have sufficient technical expertise to evaluate its methodology.

    Validation must cover:

    • Conceptual soundness: Does the model's approach make sense for this financial application? An LLM fine-tuned on retail banking contracts may not be appropriate for commercial real estate underwriting documentation, even if it produces plausible-looking output.
    • Ongoing monitoring: Model performance must be tracked in production. Accuracy, calibration, and output distribution must be measured against the validation baseline.
    • Outcome analysis: Where feasible, model outputs must be compared against observed outcomes. A credit risk model's predictions must eventually be measured against actual defaults.

    The validation requirement is where the HITL connection becomes explicit. Ongoing monitoring without human review of flagged anomalies is not monitoring — it's metric collection. SR 11-7 expects humans who understand the model to look at what it's doing and assess whether it's doing it correctly.

    Pillar 3: Governance, Policies, and Controls

    The governance pillar requires a model inventory, clear ownership, defined approval processes for new model deployment, and documented escalation procedures when models behave unexpectedly.

    For financial institutions deploying AI, this means:

    • Every AI system that meets the model definition must be in the model inventory
    • Each inventoried model must have a designated owner accountable for its performance
    • There must be a process for model approval that includes sign-off from a risk officer independent of the business line
    • There must be defined triggers that escalate to human review — thresholds where automatic model behavior stops and a human decides

    That last point is the HITL requirement embedded in SR 11-7's governance framework. It's not called HITL in the guidance text. It is HITL in practice.

    What "Effective Challenge" Actually Means

    SR 11-7's most demanding concept is "effective challenge" — a requirement that model assumptions, methodology, and outputs be subjected to critical analysis by qualified people who are not simply accepting the model's conclusions.

    The guidance defines effective challenge as: the critical analysis by objective, informed parties who can identify model limitations and assumptions and constructively participate in improving model performance.

    Three elements matter here:

    Objective: Reviewers who are not invested in the model's success. A business line that wants a credit AI approved and deployed is not objective. Internal model risk is closer; external validation is best.

    Informed: Reviewers who have enough technical and domain expertise to actually evaluate the model. A credit officer who doesn't understand how an LLM generates text cannot effectively challenge an LLM-based credit analysis tool.

    Constructive: The goal is improvement, not merely approval. Effective challenge identifies specific weaknesses and requires they be addressed before deployment or continued operation.

    A review process where a risk officer reads a summary prepared by the team that built the model and signs an approval form is not effective challenge. That is the box-checking version. Examiners know the difference.

    Real Examples From Financial AI Deployments

    Credit decisioning AI: Several regional banks have received MRA (Matters Requiring Attention) findings related to LLM-based credit decisioning tools deployed without independent validation. The common finding: the institution could not produce documentation showing that the model had been tested on a representative sample of their specific loan population, that adverse action notices accurately reflected the AI's reasoning, or that there was a mechanism for human override when the AI's confidence was below a defined threshold.

    AML transaction monitoring: Anti-money laundering AI that flags suspicious transactions for SAR filing must meet SR 11-7 requirements for model validation. Examiners at three large institutions in 2024 and 2025 found that LLM-assisted narrative generation for SAR filings was treated as a workflow tool rather than a model — bypassing the model inventory and validation requirements entirely.

    Fraud detection: A fraud scoring AI at a regional payments processor was found during examination to have no defined retraining schedule and no human review of flagged edge cases. The model had been live for 18 months. Its accuracy on card-present fraud had declined from 94% to 78% due to distribution shift, but no monitoring system had caught it because monitoring was limited to summary statistics that masked the decline.

    The Black Box Problem

    SR 11-7's conceptual soundness requirement creates a specific problem for LLMs: the explainability requirement.

    For human reviewers to provide effective challenge of an LLM's output, they need to understand why the model produced that output — what features of the input drove the decision, what alternatives were considered, what the model's confidence reflects. A black box that produces a credit recommendation without an explanation of its reasoning fails the effective challenge standard. The human reviewer cannot challenge what they cannot see.

    This is not a problem that "the AI has high accuracy" solves. SR 11-7 doesn't accept high aggregate accuracy as a substitute for explainability. The requirement is that qualified humans be able to evaluate individual decisions — which means the model must produce reasoning that humans can evaluate.

    LLMs that are prompted to explain their reasoning, or that are fine-tuned to produce structured rationales alongside recommendations, are better positioned for SR 11-7 compliance than models that produce a score or recommendation without visible reasoning. This is one of the areas where fine-tuned, purpose-built models for financial applications have a genuine regulatory advantage over general-purpose models accessed via API.

    The 2026 Regulatory Direction

    The OCC issued supplemental guidance in late 2025 specifically addressing LLM risk in bank operations. The guidance is explicit where SR 11-7 was inferential:

    • LLMs used in credit, AML, fraud, and customer communications functions are models under SR 11-7
    • The model inventory requirement applies to third-party AI services, not just internally developed models
    • Human review checkpoints are required for high-stakes AI outputs — credit denials, SAR filings, sanctions screening results
    • Model version pinning is required: institutions cannot use API endpoints that update automatically without a defined re-validation process

    That last point is material for every institution using cloud-based AI APIs. A "gpt-4-turbo" endpoint that silently receives model updates is not a pinned model version. SR 11-7 requires you to know what model you're running and to have validated that version. You cannot validate a moving target.

    How Ertas Supports Financial AI Compliance

    For financial institutions building AI systems that must meet SR 11-7 requirements, two things matter at the data preparation stage: audit trail and model ownership.

    Ertas Data Suite provides on-premise data preparation with a full audit trail — every annotation, every operator action, timestamped and logged. Financial training data prepared in Ertas can be documented, reviewed, and included in model validation packages because the preparation process is itself auditable.

    Ertas Fine-Tuning gives financial teams the ability to own the model weights directly. When you run your own fine-tuned model locally, you control the version. You validate it once and run it until you choose to update. The failure mode where a vendor silently changes model behavior — one of the most significant SR 11-7 compliance risks for cloud AI deployments — doesn't apply to a model you own and control.

    For the foundational HITL framework, see What Is Human-in-the-Loop AI?. For coverage of model risk management in the context of fine-tuned LLMs specifically, see our articles on model risk and fine-tuned LLMs and why banks approach general-purpose AI with caution.

    Book a discovery call with Ertas →

    SR 11-7 was written before LLMs existed. It describes their governance requirements precisely. The institutions that are reading it as applying to their AI deployments — and building HITL processes accordingly — are ahead. The ones treating AI as outside the model risk framework are accumulating exam findings they haven't received yet.

    See early bird pricing → for Ertas Fine-Tuning — purpose-built, locally-run models that you version, validate, and control.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading