Model Risk Management for Fine-Tuned LLMs: SR 11-7 Compliance Guide

SR 11-7 from the OCC and Federal Reserve governs model risk management at every US bank. Written in 2011 for credit scoring models and VaR calculations, it now applies to fine-tuned LLMs too. If your bank uses an AI model to make or influence any business decision, SR 11-7 applies. There is no exception for "it's just AI."

The good news: SR 11-7 is principle-based, not prescriptive. It tells you what to document and validate, not how. This guide maps each SR 11-7 requirement to the specific artifacts and processes you need for fine-tuned LLMs.

What SR 11-7 Requires

SR 11-7 defines model risk as "the potential for adverse consequences from decisions based on incorrect or misused model outputs." It mandates three pillars:

Model development and implementation — documented, with sound methodology
Model validation — independent, ongoing, with clear findings
Model governance — board oversight, model inventory, usage controls

For traditional quantitative models, these requirements are well-understood. For fine-tuned LLMs, they require reinterpretation. Here is the mapping.

Mapping SR 11-7 to Fine-Tuned LLMs

Pillar 1: Development Documentation

SR 11-7 requires documentation of "the model's purpose, design, and methodology, including the mathematical specification and the assumptions."

For a fine-tuned LLM, this means:

SR 11-7 Requirement	LLM Equivalent	What to Document
Model purpose	Use case definition	Specific business task, input/output format, decision impact
Mathematical specification	Architecture + training config	Base model (e.g., Llama 3.1 8B), quantization level, LoRA rank, alpha, dropout
Data	Training data provenance	Source systems, date range, volume, preprocessing steps, PII handling
Assumptions	Performance boundaries	What the model can and cannot do, known failure modes
Limitations	Scope constraints	Token limits, language support, domain boundaries, confidence thresholds
Implementation	Deployment architecture	Infrastructure, serving stack, API contracts, integration points

Critical detail: Document the base model selection rationale. Why Llama 3.1 8B and not Mistral 7B? Why Q4 quantization and not Q8? Auditors will ask. "It was the default" is not an acceptable answer. Benchmark results on your specific task with your specific data are the acceptable answer.

Training Data Provenance

For every fine-tuning dataset, document:

Source systems: Which internal databases, document repositories, or applications generated the training examples
Date range: When the source data was created (not when you extracted it)
Volume: Number of training examples, average length, total token count
Preprocessing: Every transformation applied — deduplication, PII redaction, format conversion, quality filtering
Labeling: Who created the ground truth labels, what instructions they followed, inter-annotator agreement if applicable
Representation: Distribution across categories, departments, time periods. Document any known gaps or biases

LoRA Configuration Documentation

Parameter	Value (Example)	Rationale
Base model	Llama 3.1 8B Instruct	Best accuracy/latency trade-off on internal benchmark (see Appendix B)
Quantization	Q4_K_M	under 1% accuracy loss vs FP16 on eval set; 4x memory reduction
LoRA rank	16	Validated via sweep (rank 8/16/32); rank 16 optimal on eval metric
LoRA alpha	32	Standard 2x rank ratio; validated in sweep
LoRA dropout	0.05	Reduced overfitting on validation set vs 0.0
Target modules	q_proj, v_proj, k_proj, o_proj	Full attention targeting; MLP targeting showed no improvement
Training epochs	3	Early stopping on validation loss; epoch 3 optimal
Learning rate	2e-4	Swept 1e-4 to 5e-4; 2e-4 minimized validation loss
Batch size	8	Maximum for GPU memory with gradient accumulation
Training examples	2,847	Full qualified dataset after filtering

This level of detail is what auditors expect. Every hyperparameter should have a documented rationale backed by experimental results.

Pillar 2: Validation Framework

SR 11-7 requires "effective challenge" through independent validation. For fine-tuned LLMs, validation means a structured 5-step process.

Step 1: Benchmark Evaluation

Run the fine-tuned model against a held-out test set that was never used during training or hyperparameter selection.

Metric	Target	Actual	Status
Task accuracy	>92%	94.1%	Pass
Precision (positive class)	>90%	91.7%	Pass
Recall (positive class)	>88%	89.3%	Pass
F1 score	>89%	90.5%	Pass
Hallucination rate	under 3%	1.8%	Pass
Refusal rate (in-scope queries)	under 2%	0.9%	Pass

Step 2: Backtesting

Apply the model to historical cases where the correct outcome is known. Compare model outputs to actual decisions.

Select 200-500 historical cases spanning at least 12 months
Run inference without any human review
Compare to actual outcomes or expert-reviewed ground truth
Document agreement rate, disagreement patterns, and failure modes

Step 3: Adversarial Testing

Deliberately attempt to break the model:

Prompt injection: Attempt to override instructions via crafted inputs
Boundary testing: Inputs at the edge of the model's documented scope
Gibberish/noise: Verify the model refuses or flags nonsensical inputs
Cross-domain leakage: Verify the model doesn't answer questions outside its domain
PII extraction: Attempt to extract training data from the model

Document every test case, expected behavior, actual behavior, and pass/fail status.

Step 4: Bias Audit

For any model that influences decisions affecting customers or employees:

Test across demographic segments (age, gender, geography, account type)
Measure accuracy disparities between segments
Document any statistically significant differences and mitigation plans
Reference the bank's existing fair lending / fair treatment framework

Step 5: Independent Review

The validation must be performed by someone who was not involved in model development. At most banks, this is the Model Risk Management (MRM) team or a qualified third party.

The independent reviewer should receive:

All documentation from Pillar 1
Access to the model for testing
The benchmark, backtest, adversarial, and bias audit results
The authority to block deployment if findings warrant it

Model Card Template

Here is a filled-in example for a document analysis model at a commercial bank.

MODEL CARD: Commercial Loan Document Analyzer v2.1
===================================================

Purpose:         Extract key terms from commercial loan documents
                 (covenants, rates, maturities, collateral descriptions)

Base Model:      Llama 3.1 8B Instruct (Q4_K_M)
Adapter:         LoRA rank 16, trained on 2,847 internal documents
Training Date:   2026-01-15
Deployed:        2026-02-01

Input:           PDF text (extracted), max 4,096 tokens
Output:          Structured JSON with extracted fields + confidence scores

Performance:
  - Field extraction accuracy: 94.1% (test set, n=412)
  - Hallucination rate: 1.8% (fabricated terms not in source)
  - Latency: 1.2s median (T4 GPU)

Limitations:
  - Does not handle scanned/image-based PDFs (OCR required upstream)
  - Accuracy drops below 85% for documents >3,500 tokens
  - Not validated for non-English documents
  - Not a replacement for legal review — outputs require human verification

Bias Assessment: No statistically significant accuracy variation across
                 loan types (term, revolving, construction) or regions

Owner:           Commercial Banking Technology
Validator:       Model Risk Management (MRM-2026-0142)
Next Review:     2026-08-01 (6-month cycle)

10 Common Auditor Questions (and Answers)

Auditors from OCC, Federal Reserve, or internal audit will ask pointed questions. Here are the 10 most common and how to answer them.

1. "How do you know this model is accurate?" Point to the benchmark evaluation with held-out test set results. Show precision, recall, and F1 metrics. Show the backtest results against historical cases.

2. "Who validated this model?" Name the independent validator (MRM team or external party). Show the validation report with findings and sign-off.

3. "What happens when the model is wrong?" Describe the human-in-the-loop workflow. Every model output is reviewed by a qualified human before any business decision. Show the error escalation process.

4. "How do you monitor ongoing performance?" Show the monitoring dashboard: accuracy metrics tracked weekly, drift detection alerts, volume and latency trends. Show the threshold that triggers re-validation.

5. "Can you reproduce a specific output from 6 months ago?" Yes. Show the audit log with model version, adapter version, input hash, and output hash. Demonstrate that loading the same model version with the same input produces the same output (deterministic inference with temperature=0).

6. "What training data was used?" Provide the data provenance documentation. Source systems, date ranges, volumes, preprocessing steps, labeling methodology.

7. "How do you prevent the model from using customer data inappropriately?" Describe PII handling in training data (redaction or synthetic replacement). Show that the model runs on-premise with no external data transmission. Show access controls and API key management.

8. "What is your change management process?" Walk through the workflow: propose change, retrain, validate on benchmark, independent review, staged rollout, monitoring. Show the approval chain.

9. "Is this model in your model inventory?" Yes. Show the model inventory entry with: model name, version, owner, use case, risk tier, validation date, next review date. Every fine-tuned model and every adapter version is a separate inventory entry.

10. "What is your succession plan if the model owner leaves?" Documentation is the succession plan. All development artifacts, validation results, and operational procedures are stored in the model registry. Any qualified engineer can operate the model using the existing documentation.

Documentation Checklist

For each fine-tuned model, maintain these artifacts:

Development Artifacts

Use case specification (business problem, success criteria, stakeholders)
Base model selection rationale with benchmark comparison
Training data provenance documentation
Hyperparameter configuration with sweep results
Training logs (loss curves, validation metrics per epoch)
Final model card

Validation Artifacts

Benchmark evaluation report (test set metrics)
Backtest report (historical case comparison)
Adversarial test report (attack scenarios and results)
Bias audit report (demographic segment analysis)
Independent validation report with sign-off
Findings and remediation tracking

Operational Artifacts

Deployment architecture diagram
API specification and integration documentation
Monitoring dashboard configuration
Alert thresholds and escalation procedures
Change management workflow
Incident response playbook
Audit log retention policy

Governance Artifacts

Model inventory entry
Risk tier classification
Board/committee approval (for Tier 1 models)
Review schedule (typically 6-month or annual)
Usage policy (who can use the model, for what, with what safeguards)

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Why On-Premise Simplifies SR 11-7 Compliance

Every SR 11-7 requirement becomes easier when the model runs on your own infrastructure.

Compliance Dimension	On-Premise	Cloud API
Data provenance	Full control — training data never leaves your systems	Must document data flows to third-party provider
Reproducibility	Pin model version, adapter version, and inference parameters	Provider may update models without notice
Audit logging	Complete logs in your SIEM	Dependent on provider's logging and retention
Access control	Integrated with existing IAM	Separate API key management, less visibility
Change management	You control every update	Provider controls model updates on their timeline
Model inventory	You own the model artifact	You have an API key to a model you don't control
Independent validation	Validate anytime against your test sets	Can't run custom validation suites against hosted models
Incident response	Full forensic capability	Limited to provider's incident reports

The fundamental issue: with a cloud API, you are documenting someone else's model. With on-premise fine-tuned models, you are documenting your own. The latter is what SR 11-7 was designed for.

Banks that deploy fine-tuned models on-premise consistently report faster validation cycles, cleaner audit findings, and lower compliance overhead than those relying on cloud APIs with vendor risk assessments layered on top.