
Model Risk Management for Fine-Tuned LLMs: SR 11-7 Compliance Guide
A practical guide to applying the Federal Reserve's SR 11-7 model risk management framework to fine-tuned LLMs in banking. Covers documentation requirements, validation frameworks, auditor questions, and why on-premise deployment simplifies compliance.
SR 11-7 from the OCC and Federal Reserve governs model risk management at every US bank. Written in 2011 for credit scoring models and VaR calculations, it now applies to fine-tuned LLMs too. If your bank uses an AI model to make or influence any business decision, SR 11-7 applies. There is no exception for "it's just AI."
The good news: SR 11-7 is principle-based, not prescriptive. It tells you what to document and validate, not how. This guide maps each SR 11-7 requirement to the specific artifacts and processes you need for fine-tuned LLMs.
What SR 11-7 Requires
SR 11-7 defines model risk as "the potential for adverse consequences from decisions based on incorrect or misused model outputs." It mandates three pillars:
- Model development and implementation — documented, with sound methodology
- Model validation — independent, ongoing, with clear findings
- Model governance — board oversight, model inventory, usage controls
For traditional quantitative models, these requirements are well-understood. For fine-tuned LLMs, they require reinterpretation. Here is the mapping.
Mapping SR 11-7 to Fine-Tuned LLMs
Pillar 1: Development Documentation
SR 11-7 requires documentation of "the model's purpose, design, and methodology, including the mathematical specification and the assumptions."
For a fine-tuned LLM, this means:
| SR 11-7 Requirement | LLM Equivalent | What to Document |
|---|---|---|
| Model purpose | Use case definition | Specific business task, input/output format, decision impact |
| Mathematical specification | Architecture + training config | Base model (e.g., Llama 3.1 8B), quantization level, LoRA rank, alpha, dropout |
| Data | Training data provenance | Source systems, date range, volume, preprocessing steps, PII handling |
| Assumptions | Performance boundaries | What the model can and cannot do, known failure modes |
| Limitations | Scope constraints | Token limits, language support, domain boundaries, confidence thresholds |
| Implementation | Deployment architecture | Infrastructure, serving stack, API contracts, integration points |
Critical detail: Document the base model selection rationale. Why Llama 3.1 8B and not Mistral 7B? Why Q4 quantization and not Q8? Auditors will ask. "It was the default" is not an acceptable answer. Benchmark results on your specific task with your specific data are the acceptable answer.
Training Data Provenance
For every fine-tuning dataset, document:
- Source systems: Which internal databases, document repositories, or applications generated the training examples
- Date range: When the source data was created (not when you extracted it)
- Volume: Number of training examples, average length, total token count
- Preprocessing: Every transformation applied — deduplication, PII redaction, format conversion, quality filtering
- Labeling: Who created the ground truth labels, what instructions they followed, inter-annotator agreement if applicable
- Representation: Distribution across categories, departments, time periods. Document any known gaps or biases
LoRA Configuration Documentation
| Parameter | Value (Example) | Rationale |
|---|---|---|
| Base model | Llama 3.1 8B Instruct | Best accuracy/latency trade-off on internal benchmark (see Appendix B) |
| Quantization | Q4_K_M | under 1% accuracy loss vs FP16 on eval set; 4x memory reduction |
| LoRA rank | 16 | Validated via sweep (rank 8/16/32); rank 16 optimal on eval metric |
| LoRA alpha | 32 | Standard 2x rank ratio; validated in sweep |
| LoRA dropout | 0.05 | Reduced overfitting on validation set vs 0.0 |
| Target modules | q_proj, v_proj, k_proj, o_proj | Full attention targeting; MLP targeting showed no improvement |
| Training epochs | 3 | Early stopping on validation loss; epoch 3 optimal |
| Learning rate | 2e-4 | Swept 1e-4 to 5e-4; 2e-4 minimized validation loss |
| Batch size | 8 | Maximum for GPU memory with gradient accumulation |
| Training examples | 2,847 | Full qualified dataset after filtering |
This level of detail is what auditors expect. Every hyperparameter should have a documented rationale backed by experimental results.
Pillar 2: Validation Framework
SR 11-7 requires "effective challenge" through independent validation. For fine-tuned LLMs, validation means a structured 5-step process.
Step 1: Benchmark Evaluation
Run the fine-tuned model against a held-out test set that was never used during training or hyperparameter selection.
| Metric | Target | Actual | Status |
|---|---|---|---|
| Task accuracy | >92% | 94.1% | Pass |
| Precision (positive class) | >90% | 91.7% | Pass |
| Recall (positive class) | >88% | 89.3% | Pass |
| F1 score | >89% | 90.5% | Pass |
| Hallucination rate | under 3% | 1.8% | Pass |
| Refusal rate (in-scope queries) | under 2% | 0.9% | Pass |
Step 2: Backtesting
Apply the model to historical cases where the correct outcome is known. Compare model outputs to actual decisions.
- Select 200-500 historical cases spanning at least 12 months
- Run inference without any human review
- Compare to actual outcomes or expert-reviewed ground truth
- Document agreement rate, disagreement patterns, and failure modes
Step 3: Adversarial Testing
Deliberately attempt to break the model:
- Prompt injection: Attempt to override instructions via crafted inputs
- Boundary testing: Inputs at the edge of the model's documented scope
- Gibberish/noise: Verify the model refuses or flags nonsensical inputs
- Cross-domain leakage: Verify the model doesn't answer questions outside its domain
- PII extraction: Attempt to extract training data from the model
Document every test case, expected behavior, actual behavior, and pass/fail status.
Step 4: Bias Audit
For any model that influences decisions affecting customers or employees:
- Test across demographic segments (age, gender, geography, account type)
- Measure accuracy disparities between segments
- Document any statistically significant differences and mitigation plans
- Reference the bank's existing fair lending / fair treatment framework
Step 5: Independent Review
The validation must be performed by someone who was not involved in model development. At most banks, this is the Model Risk Management (MRM) team or a qualified third party.
The independent reviewer should receive:
- All documentation from Pillar 1
- Access to the model for testing
- The benchmark, backtest, adversarial, and bias audit results
- The authority to block deployment if findings warrant it
Model Card Template
Here is a filled-in example for a document analysis model at a commercial bank.
MODEL CARD: Commercial Loan Document Analyzer v2.1
===================================================
Purpose: Extract key terms from commercial loan documents
(covenants, rates, maturities, collateral descriptions)
Base Model: Llama 3.1 8B Instruct (Q4_K_M)
Adapter: LoRA rank 16, trained on 2,847 internal documents
Training Date: 2026-01-15
Deployed: 2026-02-01
Input: PDF text (extracted), max 4,096 tokens
Output: Structured JSON with extracted fields + confidence scores
Performance:
- Field extraction accuracy: 94.1% (test set, n=412)
- Hallucination rate: 1.8% (fabricated terms not in source)
- Latency: 1.2s median (T4 GPU)
Limitations:
- Does not handle scanned/image-based PDFs (OCR required upstream)
- Accuracy drops below 85% for documents >3,500 tokens
- Not validated for non-English documents
- Not a replacement for legal review — outputs require human verification
Bias Assessment: No statistically significant accuracy variation across
loan types (term, revolving, construction) or regions
Owner: Commercial Banking Technology
Validator: Model Risk Management (MRM-2026-0142)
Next Review: 2026-08-01 (6-month cycle)
10 Common Auditor Questions (and Answers)
Auditors from OCC, Federal Reserve, or internal audit will ask pointed questions. Here are the 10 most common and how to answer them.
1. "How do you know this model is accurate?" Point to the benchmark evaluation with held-out test set results. Show precision, recall, and F1 metrics. Show the backtest results against historical cases.
2. "Who validated this model?" Name the independent validator (MRM team or external party). Show the validation report with findings and sign-off.
3. "What happens when the model is wrong?" Describe the human-in-the-loop workflow. Every model output is reviewed by a qualified human before any business decision. Show the error escalation process.
4. "How do you monitor ongoing performance?" Show the monitoring dashboard: accuracy metrics tracked weekly, drift detection alerts, volume and latency trends. Show the threshold that triggers re-validation.
5. "Can you reproduce a specific output from 6 months ago?" Yes. Show the audit log with model version, adapter version, input hash, and output hash. Demonstrate that loading the same model version with the same input produces the same output (deterministic inference with temperature=0).
6. "What training data was used?" Provide the data provenance documentation. Source systems, date ranges, volumes, preprocessing steps, labeling methodology.
7. "How do you prevent the model from using customer data inappropriately?" Describe PII handling in training data (redaction or synthetic replacement). Show that the model runs on-premise with no external data transmission. Show access controls and API key management.
8. "What is your change management process?" Walk through the workflow: propose change, retrain, validate on benchmark, independent review, staged rollout, monitoring. Show the approval chain.
9. "Is this model in your model inventory?" Yes. Show the model inventory entry with: model name, version, owner, use case, risk tier, validation date, next review date. Every fine-tuned model and every adapter version is a separate inventory entry.
10. "What is your succession plan if the model owner leaves?" Documentation is the succession plan. All development artifacts, validation results, and operational procedures are stored in the model registry. Any qualified engineer can operate the model using the existing documentation.
Documentation Checklist
For each fine-tuned model, maintain these artifacts:
Development Artifacts
- Use case specification (business problem, success criteria, stakeholders)
- Base model selection rationale with benchmark comparison
- Training data provenance documentation
- Hyperparameter configuration with sweep results
- Training logs (loss curves, validation metrics per epoch)
- Final model card
Validation Artifacts
- Benchmark evaluation report (test set metrics)
- Backtest report (historical case comparison)
- Adversarial test report (attack scenarios and results)
- Bias audit report (demographic segment analysis)
- Independent validation report with sign-off
- Findings and remediation tracking
Operational Artifacts
- Deployment architecture diagram
- API specification and integration documentation
- Monitoring dashboard configuration
- Alert thresholds and escalation procedures
- Change management workflow
- Incident response playbook
- Audit log retention policy
Governance Artifacts
- Model inventory entry
- Risk tier classification
- Board/committee approval (for Tier 1 models)
- Review schedule (typically 6-month or annual)
- Usage policy (who can use the model, for what, with what safeguards)
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Why On-Premise Simplifies SR 11-7 Compliance
Every SR 11-7 requirement becomes easier when the model runs on your own infrastructure.
| Compliance Dimension | On-Premise | Cloud API |
|---|---|---|
| Data provenance | Full control — training data never leaves your systems | Must document data flows to third-party provider |
| Reproducibility | Pin model version, adapter version, and inference parameters | Provider may update models without notice |
| Audit logging | Complete logs in your SIEM | Dependent on provider's logging and retention |
| Access control | Integrated with existing IAM | Separate API key management, less visibility |
| Change management | You control every update | Provider controls model updates on their timeline |
| Model inventory | You own the model artifact | You have an API key to a model you don't control |
| Independent validation | Validate anytime against your test sets | Can't run custom validation suites against hosted models |
| Incident response | Full forensic capability | Limited to provider's incident reports |
The fundamental issue: with a cloud API, you are documenting someone else's model. With on-premise fine-tuned models, you are documenting your own. The latter is what SR 11-7 was designed for.
Banks that deploy fine-tuned models on-premise consistently report faster validation cycles, cleaner audit findings, and lower compliance overhead than those relying on cloud APIs with vendor risk assessments layered on top.
Further Reading
- Fine-Tuning AI for Financial Services — Comprehensive guide to compliance frameworks and production use cases in banking
- How to Evaluate a Fine-Tuned Model — Build evaluation pipelines with benchmark suites and automated testing
- Fine-Tuning Quality Checklist — Pre-deployment quality gates for production fine-tuned models
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Premise AI for Banking: Satisfying Regulator Audit Requirements
Architecture and operational guide for deploying on-premise AI in banking environments that satisfy OCC, FINRA, and Federal Reserve audit requirements. Covers infrastructure, audit trails, access controls, change management, disaster recovery, and a 10-dimension compliance comparison.

Fine-Tuning for AML Transaction Monitoring: Reducing False Positives
Banks spend $30B+ annually on AML compliance while rule-based systems generate 95%+ false positive rates. Learn how fine-tuning local models can cut false positives by 40-60% while maintaining 99%+ true positive capture — without sending transaction data to cloud APIs.

Fine-Tuning AI for Financial Services: Compliance, Use Cases, and Deployment
A comprehensive guide to deploying fine-tuned AI models in financial services. Covers SOC 2, PCI-DSS, and FINRA compliance, five production use cases, and why on-premise fine-tuned models are replacing cloud APIs in banking and finance.