
On-Premise AI for Banking: Satisfying Regulator Audit Requirements
Architecture and operational guide for deploying on-premise AI in banking environments that satisfy OCC, FINRA, and Federal Reserve audit requirements. Covers infrastructure, audit trails, access controls, change management, disaster recovery, and a 10-dimension compliance comparison.
Your bank's CTO wants AI. Compliance wants audit trails. The CISO wants data on-premise. The head of operations wants it deployed before Q3.
These are not conflicting requirements. They are design constraints — and they all point to the same architecture: fine-tuned models running on-premise, with comprehensive audit logging, access controls, and change management baked into the infrastructure from day one.
This guide covers the complete architecture, from GPU cluster to audit log retention, that lets you deploy AI and pass the examiner review.
Architecture Overview
The deployment follows a four-stage pipeline. Each stage has a gate that produces auditable artifacts.
[Stage 1: Training] Air-gapped GPU environment
|
Validation Gate ---------> Benchmark results, bias audit, model card
|
[Stage 2: Validation] Independent review by MRM team
|
Approval Gate ------------> Sign-off document, risk tier classification
|
[Stage 3: Production Inference] On-premise inference cluster
|
Continuous Monitoring -----> Drift alerts, accuracy tracking, audit logs
|
[Stage 4: Audit Logging] Immutable log store, 7-year retention
Every inference, every model change, and every access event produces a record that flows into Stage 4. Nothing is optional.
Infrastructure Requirements
Banking AI infrastructure splits into three functional tiers.
Tier Breakdown
| Tier | Purpose | Hardware | Network |
|---|---|---|---|
| Training | Fine-tuning and adapter creation | 1-2x NVIDIA A100 40GB or 4x T4 16GB | Air-gapped or isolated VLAN |
| Inference | Production model serving | 2x NVIDIA T4 16GB (HA pair) or 2x 32-core CPU servers | Isolated VLAN, internal access only |
| Storage & Logging | Model registry, audit logs, backups | 2TB NVMe + networked storage (NAS/SAN) | Same VLAN as inference, replicated |
Training Tier
Fine-tuning happens infrequently — typically quarterly or when a new use case is onboarded. The training environment should be:
- Air-gapped or strictly isolated. Training data includes sensitive financial documents. No internet access. No shared infrastructure.
- GPU-equipped. Fine-tuning a 7B-8B parameter model with LoRA requires 16-24GB VRAM. A single A100 40GB handles this comfortably. A pair of T4 16GB GPUs works with gradient accumulation.
- Temporary. Training runs take 1-4 hours. The environment can be powered down between runs. If using cloud GPU instances for training (with appropriate data controls), the cost is $5-20 per training run.
Cost: $15,000-25,000 for a dedicated training server, or $5-20 per run on reserved cloud GPU instances (if compliance permits controlled, encrypted data transfer).
Inference Tier
Production inference runs 24/7. This is the tier that handles real requests from banking applications.
| Specification | GPU Path (Recommended) | CPU-Only Path |
|---|---|---|
| Servers | 2 (active-active HA) | 2 (active-active HA) |
| CPU | 16-core Xeon Silver per server | 32-core Xeon Gold per server |
| RAM | 64GB per server | 128GB per server |
| GPU | 1x NVIDIA T4 16GB per server | None |
| Storage | 500GB NVMe SSD per server | 500GB NVMe SSD per server |
| Throughput | 15-40 tokens/sec per server | 3-8 tokens/sec per server |
| Concurrent requests | 10-20 per server | 2-5 per server |
High availability: Run two inference servers in active-active mode behind an internal load balancer. If one fails, the other handles full load at reduced throughput. RTO: zero (automatic failover). RPO: zero (stateless inference).
Cost per server: $8,000-12,000 (with T4 GPU). Two servers for HA: $16,000-24,000.
Storage & Logging Tier
| Storage Component | Size | Growth | Retention |
|---|---|---|---|
| Model files (base + adapters) | 20-60GB | ~10GB/quarter (new adapters) | All versions, indefinitely |
| Audit logs | 15-50GB/year | Linear with inference volume | 7 years minimum |
| Training artifacts | 5-10GB per training run | Quarterly | All runs, indefinitely |
| Evaluation datasets | 2-5GB | Quarterly updates | All versions, indefinitely |
| Backups (encrypted) | Mirror of above | Matches primary | Same as primary |
Total first-year storage: 100-200GB. A 2TB NVMe array handles 7+ years of growth.
Audit Trail Architecture
This is the section examiners care about most. Every inference must produce a complete, immutable audit record.
Per-Inference Log Record
| Field | Type | Example | Why Examiners Want It |
|---|---|---|---|
timestamp | ISO 8601 | 2026-02-26T09:14:33.127Z | Temporal correlation with business events |
request_id | UUID v4 | 8f3a2b1c-... | Unique reference for investigation |
model_version | String | llama-3.1-8b-q4km-v2.1 | Reproducibility |
adapter_version | String | loan-analysis-v3.2 | Reproducibility |
input_hash | SHA-256 | a3f2c7... | Integrity proof without storing raw data |
output_hash | SHA-256 | b7c1d9... | Integrity proof without storing raw data |
department | String | commercial-lending | Usage attribution |
user_id | String | svc-loan-origination | Access attribution |
confidence | Float | 0.94 | Decision quality evidence |
token_count_in | Integer | 1,247 | Resource tracking |
token_count_out | Integer | 342 | Resource tracking |
latency_ms | Integer | 1,180 | SLA compliance |
status | Enum | success | Operations monitoring |
error_code | String | null | Incident investigation |
Log Immutability
Audit logs must be tamper-evident. Options:
- Write-once storage: WORM (Write Once Read Many) volumes. NetApp SnapLock, Dell PowerStore immutable snapshots, or similar.
- Append-only database: PostgreSQL with row-level security preventing UPDATE/DELETE on audit tables. Combined with regular hash-chain verification.
- Log forwarding: Real-time replication to a separate SIEM (Splunk, Elastic, QRadar) with independent retention policies.
The most practical approach for most banks: PostgreSQL append-only tables with nightly hash-chain verification, replicated to the existing SIEM. This integrates with your current audit infrastructure without introducing new systems.
Retention Requirements
OCC examination guidance expects 5-7 years of records for model-related decisions. For AI audit logs:
- Inference logs: 7 years from the date of the inference
- Model versions: Indefinite (you need the ability to load any historical version for investigation)
- Training artifacts: Indefinite (training data provenance, hyperparameters, training logs)
- Validation reports: Indefinite (tied to model versions)
Storage cost for 7-year retention: At 30GB/year of audit logs, 7 years is 210GB. Compressed and archived, this fits on a single NAS shelf. The cost is trivial — under $500 for the storage hardware.
Access Controls
RBAC Model
| Role | Permissions | Typical Users |
|---|---|---|
| Model Developer | Train models, upload adapters to staging | AI/ML team (2-3 people) |
| Model Validator | Read-only access to models + training artifacts, run validation suites | MRM team |
| Deployment Approver | Promote models from staging to production | Technology risk committee |
| API Consumer | Invoke inference API for authorized use cases | Application service accounts |
| Auditor | Read-only access to all logs, model cards, validation reports | Internal audit, examiners |
| Infrastructure Admin | Server management, patching, backup/restore | DevOps team |
API Key Management
Each consuming application gets a dedicated API key with scoped permissions:
- Key rotation: Every 90 days, automated. Old keys remain valid for a 7-day grace period.
- Rate limiting: Per-key rate limits based on the approved use case. Loan origination: 500 requests/day. Customer service: 2,000 requests/day.
- Usage monitoring: Real-time dashboards showing per-key volume, latency, and error rates. Alerts on anomalous patterns (sudden volume spike, requests outside business hours).
Per-Department Usage Monitoring
| Department | Use Case | Daily Volume | Monthly Cost (On-Prem) | Monthly Cost (Cloud API) |
|---|---|---|---|---|
| Commercial Lending | Loan document analysis | 200-400 | $0 (fixed infra) | $1,800-3,600 |
| Retail Banking | Customer inquiry classification | 800-1,500 | $0 (fixed infra) | $7,200-13,500 |
| Compliance | SAR narrative drafting | 50-100 | $0 (fixed infra) | $450-900 |
| Risk Management | Credit memo summarization | 100-200 | $0 (fixed infra) | $900-1,800 |
| Total | 1,150-2,200 | $0 marginal | $10,350-19,800/mo |
On-premise inference has zero marginal cost per request. The infrastructure cost is fixed regardless of volume. This changes the economics of AI adoption entirely — departments can experiment without budget approval for each new use case.
Change Management Workflow
Every model change follows a documented, auditable workflow.
Six-Step Process
1. PROPOSE --> Change request with business justification
Submitted by: Model Developer
Approved by: Use case owner + Technology risk
2. DEVELOP --> Fine-tune or update adapter
Environment: Air-gapped training tier
Artifacts: Training logs, new model card
3. VALIDATE --> Run benchmark suite + backtest + adversarial tests
Performed by: Model Developer
Artifacts: Evaluation report
4. REVIEW --> Independent validation
Performed by: MRM team or external validator
Artifacts: Validation report with findings
5. APPROVE --> Deployment approval
Approved by: Deployment Approver (risk committee)
Artifacts: Signed approval, risk tier classification
6. DEPLOY --> Staged rollout to production
Performed by: Infrastructure Admin
Stages: Canary (5%) → Partial (25%) → Full (100%)
Monitoring: 48-hour observation at each stage
Rollback Procedure
If monitoring detects quality degradation after deployment:
- Automatic rollback trigger: Accuracy drops below threshold for 15 consecutive minutes
- Manual rollback: Any authorized operator can revert to the previous model version in under 2 minutes
- Incident documentation: Every rollback triggers an incident report documenting what changed, what failed, and root cause analysis
The previous model version stays loaded in memory on the standby inference server. Rollback is a load balancer configuration change — not a model reload.
Disaster Recovery
RTO and RPO Targets
| Scenario | RTO | RPO | Recovery Method |
|---|---|---|---|
| Single GPU failure | 0 (automatic) | 0 | Failover to HA partner server |
| Single server failure | 0 (automatic) | 0 | Load balancer removes failed node |
| Both servers fail | 4 hours | 0 | Restore from backup to replacement hardware |
| Model file corruption | 30 minutes | 0 | Restore from model registry backup |
| Audit database failure | 15 minutes | 5 minutes | Failover to replica, restore from WAL |
| Data center failure | 8-24 hours | 1 hour | Restore at DR site from replicated backups |
CPU Failover
If all GPUs fail, the inference stack falls back to CPU-only operation:
- Throughput drops from 30 tokens/sec to 5 tokens/sec per server
- Maximum concurrent requests drops from 20 to 4
- Priority queue activates: Compliance and risk requests first, other departments queued
- Automated notification to all consuming applications: "AI system operating in degraded mode, expect higher latency"
Degraded-Mode Operation
When AI is unavailable entirely:
- All consuming applications must have a non-AI fallback path
- Loan document analysis: Manual review (existing pre-AI workflow)
- Customer classification: Rule-based routing (existing pre-AI workflow)
- SAR drafting: Manual drafting by compliance analysts
This is a regulatory requirement, not just good practice. Examiners will ask: "What happens if this system goes down?" The answer must be: "We revert to our pre-AI process, which is documented and tested quarterly."
Regulator Exam Preparation
When OCC or FINRA examiners arrive, present the following package.
Examiner Briefing Package
- Model inventory: Complete list of all AI models in production, with risk tier, owner, validation date, and next review date
- Architecture diagram: The four-stage pipeline from training to audit logging
- Sample audit trail: Pull 5 inference records from the previous month, showing the complete log chain from request to response
- Validation report: Most recent independent validation for each model, including findings and remediation
- Change log: All model changes in the past 12 months, with approval documentation
- Incident log: Any model-related incidents, including rollbacks, with root cause analysis
- Access control documentation: RBAC configuration, API key inventory, usage reports by department
- Disaster recovery test results: Most recent DR test, including failover time and data integrity verification
Common Examiner Concerns (and Pre-Emptive Answers)
"Is customer data being sent to third parties?" No. All inference runs on-premise. No customer data leaves the bank's network. Show the network architecture diagram with VLAN isolation.
"Can you reproduce a historical model output?" Yes. Show the model version registry, demonstrate loading a historical version, and reproduce an output from the audit log using the recorded input hash.
"How do you detect model drift?" Weekly accuracy monitoring against a benchmark set. Automated alerts when accuracy drops below threshold. Quarterly full re-validation. Show the monitoring dashboard.
"What is the board's involvement?" Model risk governance reports to the board risk committee quarterly. The committee approves the model risk appetite statement and reviews Tier 1 model deployments. Show the last quarterly report.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Compliance Comparison: On-Premise vs Cloud
Here is how on-premise and cloud deployments compare across 10 audit dimensions.
| Dimension | On-Premise | Cloud API | Examiner Preference |
|---|---|---|---|
| 1. Data residency | All data stays in bank's network | Data transits to provider's infrastructure | On-premise |
| 2. Audit logging | Complete, bank-controlled | Dependent on provider's logging capabilities | On-premise |
| 3. Reproducibility | Full — pin model version, replay inputs | Limited — provider may update models | On-premise |
| 4. Access control | Integrated with bank IAM | Separate API key management | On-premise |
| 5. Change management | Bank controls all changes | Provider controls model updates | On-premise |
| 6. Vendor risk | No third-party model provider | Requires vendor risk assessment, ongoing monitoring | On-premise |
| 7. Incident response | Full forensic capability | Limited to provider's incident reports | On-premise |
| 8. Model validation | Validate anytime with internal test suites | Cannot run arbitrary tests against hosted models | On-premise |
| 9. DR testing | Bank controls DR strategy and testing | Dependent on provider's SLA and DR capabilities | On-premise |
| 10. Cost predictability | Fixed infrastructure cost | Variable, usage-based, subject to price increases | On-premise |
This is not a close comparison. On-premise wins on every dimension that matters to regulators. The only dimension where cloud APIs have an advantage is time-to-first-inference — you can start using a cloud API in hours, while on-premise takes weeks to set up.
But setup time is a one-time cost. Audit compliance is ongoing, every quarter, for the life of the model. Invest the weeks upfront.
Total Cost of Ownership
3-Year Comparison (Mid-Size Bank, 4 Use Cases)
| Component | On-Premise | Cloud API (Enterprise BAA Tier) |
|---|---|---|
| Infrastructure (year 0) | $40,000 | $0 |
| Annual operations | $25,000/yr | $8,000/yr |
| Annual API costs | $0 | $120,000-240,000/yr |
| Vendor risk assessment | $0 | $15,000/yr |
| Compliance overhead | $5,000/yr | $20,000/yr |
| 3-Year Total | $130,000 | $504,000-804,000 |
The math is not subtle. On-premise costs 75-85% less over three years for a bank running four AI use cases at typical volumes (1,000-2,000 inferences/day total).
More importantly, the compliance posture is categorically better. You are not documenting someone else's infrastructure — you are documenting your own.
Further Reading
- On-Premise AI for Law Firms: Compliance Checklist — Similar compliance framework applied to legal industry requirements
- GPU Cost Comparison for Self-Hosting AI in 2026 — Detailed hardware benchmarks and pricing for inference and training workloads
- Fine-Tuning AI for Financial Services — Comprehensive guide to compliance, use cases, and deployment in banking and finance
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Premise Healthcare AI: Architecture and Infrastructure Guide
A practical infrastructure guide for deploying AI on-premise in healthcare environments. Covers hardware requirements, network architecture, air-gapped deployment, HIPAA audit logging, model update strategies, and real cost comparisons against cloud APIs.

Model Risk Management for Fine-Tuned LLMs: SR 11-7 Compliance Guide
A practical guide to applying the Federal Reserve's SR 11-7 model risk management framework to fine-tuned LLMs in banking. Covers documentation requirements, validation frameworks, auditor questions, and why on-premise deployment simplifies compliance.

SOC 2 and AI: Why Financial Firms Need On-Premise Model Deployment
Every AI API you add expands your SOC 2 audit scope. On-premise model deployment keeps AI capabilities within your existing security boundary — no new vendors, no new risk assessments, no scope creep. Here is how to deploy AI that your auditors will approve.