Back to blog
    Model Risk Management for Fine-Tuned LLMs: SR 11-7 Compliance Guide
    financecompliancemodel-risksr-11-7fine-tuninggovernancebanking

    Model Risk Management for Fine-Tuned LLMs: SR 11-7 Compliance Guide

    A practical guide to applying the Federal Reserve's SR 11-7 model risk management framework to fine-tuned LLMs in banking. Covers documentation requirements, validation frameworks, auditor questions, and why on-premise deployment simplifies compliance.

    EErtas Team·

    SR 11-7 from the OCC and Federal Reserve governs model risk management at every US bank. Written in 2011 for credit scoring models and VaR calculations, it now applies to fine-tuned LLMs too. If your bank uses an AI model to make or influence any business decision, SR 11-7 applies. There is no exception for "it's just AI."

    The good news: SR 11-7 is principle-based, not prescriptive. It tells you what to document and validate, not how. This guide maps each SR 11-7 requirement to the specific artifacts and processes you need for fine-tuned LLMs.

    What SR 11-7 Requires

    SR 11-7 defines model risk as "the potential for adverse consequences from decisions based on incorrect or misused model outputs." It mandates three pillars:

    1. Model development and implementation — documented, with sound methodology
    2. Model validation — independent, ongoing, with clear findings
    3. Model governance — board oversight, model inventory, usage controls

    For traditional quantitative models, these requirements are well-understood. For fine-tuned LLMs, they require reinterpretation. Here is the mapping.

    Mapping SR 11-7 to Fine-Tuned LLMs

    Pillar 1: Development Documentation

    SR 11-7 requires documentation of "the model's purpose, design, and methodology, including the mathematical specification and the assumptions."

    For a fine-tuned LLM, this means:

    SR 11-7 RequirementLLM EquivalentWhat to Document
    Model purposeUse case definitionSpecific business task, input/output format, decision impact
    Mathematical specificationArchitecture + training configBase model (e.g., Llama 3.1 8B), quantization level, LoRA rank, alpha, dropout
    DataTraining data provenanceSource systems, date range, volume, preprocessing steps, PII handling
    AssumptionsPerformance boundariesWhat the model can and cannot do, known failure modes
    LimitationsScope constraintsToken limits, language support, domain boundaries, confidence thresholds
    ImplementationDeployment architectureInfrastructure, serving stack, API contracts, integration points

    Critical detail: Document the base model selection rationale. Why Llama 3.1 8B and not Mistral 7B? Why Q4 quantization and not Q8? Auditors will ask. "It was the default" is not an acceptable answer. Benchmark results on your specific task with your specific data are the acceptable answer.

    Training Data Provenance

    For every fine-tuning dataset, document:

    • Source systems: Which internal databases, document repositories, or applications generated the training examples
    • Date range: When the source data was created (not when you extracted it)
    • Volume: Number of training examples, average length, total token count
    • Preprocessing: Every transformation applied — deduplication, PII redaction, format conversion, quality filtering
    • Labeling: Who created the ground truth labels, what instructions they followed, inter-annotator agreement if applicable
    • Representation: Distribution across categories, departments, time periods. Document any known gaps or biases

    LoRA Configuration Documentation

    ParameterValue (Example)Rationale
    Base modelLlama 3.1 8B InstructBest accuracy/latency trade-off on internal benchmark (see Appendix B)
    QuantizationQ4_K_Munder 1% accuracy loss vs FP16 on eval set; 4x memory reduction
    LoRA rank16Validated via sweep (rank 8/16/32); rank 16 optimal on eval metric
    LoRA alpha32Standard 2x rank ratio; validated in sweep
    LoRA dropout0.05Reduced overfitting on validation set vs 0.0
    Target modulesq_proj, v_proj, k_proj, o_projFull attention targeting; MLP targeting showed no improvement
    Training epochs3Early stopping on validation loss; epoch 3 optimal
    Learning rate2e-4Swept 1e-4 to 5e-4; 2e-4 minimized validation loss
    Batch size8Maximum for GPU memory with gradient accumulation
    Training examples2,847Full qualified dataset after filtering

    This level of detail is what auditors expect. Every hyperparameter should have a documented rationale backed by experimental results.

    Pillar 2: Validation Framework

    SR 11-7 requires "effective challenge" through independent validation. For fine-tuned LLMs, validation means a structured 5-step process.

    Step 1: Benchmark Evaluation

    Run the fine-tuned model against a held-out test set that was never used during training or hyperparameter selection.

    MetricTargetActualStatus
    Task accuracy>92%94.1%Pass
    Precision (positive class)>90%91.7%Pass
    Recall (positive class)>88%89.3%Pass
    F1 score>89%90.5%Pass
    Hallucination rateunder 3%1.8%Pass
    Refusal rate (in-scope queries)under 2%0.9%Pass

    Step 2: Backtesting

    Apply the model to historical cases where the correct outcome is known. Compare model outputs to actual decisions.

    • Select 200-500 historical cases spanning at least 12 months
    • Run inference without any human review
    • Compare to actual outcomes or expert-reviewed ground truth
    • Document agreement rate, disagreement patterns, and failure modes

    Step 3: Adversarial Testing

    Deliberately attempt to break the model:

    • Prompt injection: Attempt to override instructions via crafted inputs
    • Boundary testing: Inputs at the edge of the model's documented scope
    • Gibberish/noise: Verify the model refuses or flags nonsensical inputs
    • Cross-domain leakage: Verify the model doesn't answer questions outside its domain
    • PII extraction: Attempt to extract training data from the model

    Document every test case, expected behavior, actual behavior, and pass/fail status.

    Step 4: Bias Audit

    For any model that influences decisions affecting customers or employees:

    • Test across demographic segments (age, gender, geography, account type)
    • Measure accuracy disparities between segments
    • Document any statistically significant differences and mitigation plans
    • Reference the bank's existing fair lending / fair treatment framework

    Step 5: Independent Review

    The validation must be performed by someone who was not involved in model development. At most banks, this is the Model Risk Management (MRM) team or a qualified third party.

    The independent reviewer should receive:

    • All documentation from Pillar 1
    • Access to the model for testing
    • The benchmark, backtest, adversarial, and bias audit results
    • The authority to block deployment if findings warrant it

    Model Card Template

    Here is a filled-in example for a document analysis model at a commercial bank.

    MODEL CARD: Commercial Loan Document Analyzer v2.1
    ===================================================
    
    Purpose:         Extract key terms from commercial loan documents
                     (covenants, rates, maturities, collateral descriptions)
    
    Base Model:      Llama 3.1 8B Instruct (Q4_K_M)
    Adapter:         LoRA rank 16, trained on 2,847 internal documents
    Training Date:   2026-01-15
    Deployed:        2026-02-01
    
    Input:           PDF text (extracted), max 4,096 tokens
    Output:          Structured JSON with extracted fields + confidence scores
    
    Performance:
      - Field extraction accuracy: 94.1% (test set, n=412)
      - Hallucination rate: 1.8% (fabricated terms not in source)
      - Latency: 1.2s median (T4 GPU)
    
    Limitations:
      - Does not handle scanned/image-based PDFs (OCR required upstream)
      - Accuracy drops below 85% for documents >3,500 tokens
      - Not validated for non-English documents
      - Not a replacement for legal review — outputs require human verification
    
    Bias Assessment: No statistically significant accuracy variation across
                     loan types (term, revolving, construction) or regions
    
    Owner:           Commercial Banking Technology
    Validator:       Model Risk Management (MRM-2026-0142)
    Next Review:     2026-08-01 (6-month cycle)
    

    10 Common Auditor Questions (and Answers)

    Auditors from OCC, Federal Reserve, or internal audit will ask pointed questions. Here are the 10 most common and how to answer them.

    1. "How do you know this model is accurate?" Point to the benchmark evaluation with held-out test set results. Show precision, recall, and F1 metrics. Show the backtest results against historical cases.

    2. "Who validated this model?" Name the independent validator (MRM team or external party). Show the validation report with findings and sign-off.

    3. "What happens when the model is wrong?" Describe the human-in-the-loop workflow. Every model output is reviewed by a qualified human before any business decision. Show the error escalation process.

    4. "How do you monitor ongoing performance?" Show the monitoring dashboard: accuracy metrics tracked weekly, drift detection alerts, volume and latency trends. Show the threshold that triggers re-validation.

    5. "Can you reproduce a specific output from 6 months ago?" Yes. Show the audit log with model version, adapter version, input hash, and output hash. Demonstrate that loading the same model version with the same input produces the same output (deterministic inference with temperature=0).

    6. "What training data was used?" Provide the data provenance documentation. Source systems, date ranges, volumes, preprocessing steps, labeling methodology.

    7. "How do you prevent the model from using customer data inappropriately?" Describe PII handling in training data (redaction or synthetic replacement). Show that the model runs on-premise with no external data transmission. Show access controls and API key management.

    8. "What is your change management process?" Walk through the workflow: propose change, retrain, validate on benchmark, independent review, staged rollout, monitoring. Show the approval chain.

    9. "Is this model in your model inventory?" Yes. Show the model inventory entry with: model name, version, owner, use case, risk tier, validation date, next review date. Every fine-tuned model and every adapter version is a separate inventory entry.

    10. "What is your succession plan if the model owner leaves?" Documentation is the succession plan. All development artifacts, validation results, and operational procedures are stored in the model registry. Any qualified engineer can operate the model using the existing documentation.

    Documentation Checklist

    For each fine-tuned model, maintain these artifacts:

    Development Artifacts

    • Use case specification (business problem, success criteria, stakeholders)
    • Base model selection rationale with benchmark comparison
    • Training data provenance documentation
    • Hyperparameter configuration with sweep results
    • Training logs (loss curves, validation metrics per epoch)
    • Final model card

    Validation Artifacts

    • Benchmark evaluation report (test set metrics)
    • Backtest report (historical case comparison)
    • Adversarial test report (attack scenarios and results)
    • Bias audit report (demographic segment analysis)
    • Independent validation report with sign-off
    • Findings and remediation tracking

    Operational Artifacts

    • Deployment architecture diagram
    • API specification and integration documentation
    • Monitoring dashboard configuration
    • Alert thresholds and escalation procedures
    • Change management workflow
    • Incident response playbook
    • Audit log retention policy

    Governance Artifacts

    • Model inventory entry
    • Risk tier classification
    • Board/committee approval (for Tier 1 models)
    • Review schedule (typically 6-month or annual)
    • Usage policy (who can use the model, for what, with what safeguards)

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Why On-Premise Simplifies SR 11-7 Compliance

    Every SR 11-7 requirement becomes easier when the model runs on your own infrastructure.

    Compliance DimensionOn-PremiseCloud API
    Data provenanceFull control — training data never leaves your systemsMust document data flows to third-party provider
    ReproducibilityPin model version, adapter version, and inference parametersProvider may update models without notice
    Audit loggingComplete logs in your SIEMDependent on provider's logging and retention
    Access controlIntegrated with existing IAMSeparate API key management, less visibility
    Change managementYou control every updateProvider controls model updates on their timeline
    Model inventoryYou own the model artifactYou have an API key to a model you don't control
    Independent validationValidate anytime against your test setsCan't run custom validation suites against hosted models
    Incident responseFull forensic capabilityLimited to provider's incident reports

    The fundamental issue: with a cloud API, you are documenting someone else's model. With on-premise fine-tuned models, you are documenting your own. The latter is what SR 11-7 was designed for.

    Banks that deploy fine-tuned models on-premise consistently report faster validation cycles, cleaner audit findings, and lower compliance overhead than those relying on cloud APIs with vendor risk assessments layered on top.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading