On-Premise AI for Banking: Satisfying Regulator Audit Requirements

Your bank's CTO wants AI. Compliance wants audit trails. The CISO wants data on-premise. The head of operations wants it deployed before Q3.

These are not conflicting requirements. They are design constraints — and they all point to the same architecture: fine-tuned models running on-premise, with comprehensive audit logging, access controls, and change management baked into the infrastructure from day one.

This guide covers the complete architecture, from GPU cluster to audit log retention, that lets you deploy AI and pass the examiner review.

Architecture Overview

The deployment follows a four-stage pipeline. Each stage has a gate that produces auditable artifacts.

[Stage 1: Training]              Air-gapped GPU environment
        |
    Validation Gate ---------> Benchmark results, bias audit, model card
        |
[Stage 2: Validation]           Independent review by MRM team
        |
    Approval Gate ------------> Sign-off document, risk tier classification
        |
[Stage 3: Production Inference]  On-premise inference cluster
        |
    Continuous Monitoring -----> Drift alerts, accuracy tracking, audit logs
        |
[Stage 4: Audit Logging]        Immutable log store, 7-year retention

Every inference, every model change, and every access event produces a record that flows into Stage 4. Nothing is optional.

Infrastructure Requirements

Banking AI infrastructure splits into three functional tiers.

Tier Breakdown

Tier	Purpose	Hardware	Network
Training	Fine-tuning and adapter creation	1-2x NVIDIA A100 40GB or 4x T4 16GB	Air-gapped or isolated VLAN
Inference	Production model serving	2x NVIDIA T4 16GB (HA pair) or 2x 32-core CPU servers	Isolated VLAN, internal access only
Storage & Logging	Model registry, audit logs, backups	2TB NVMe + networked storage (NAS/SAN)	Same VLAN as inference, replicated

Training Tier

Fine-tuning happens infrequently — typically quarterly or when a new use case is onboarded. The training environment should be:

Air-gapped or strictly isolated. Training data includes sensitive financial documents. No internet access. No shared infrastructure.
GPU-equipped. Fine-tuning a 7B-8B parameter model with LoRA requires 16-24GB VRAM. A single A100 40GB handles this comfortably. A pair of T4 16GB GPUs works with gradient accumulation.
Temporary. Training runs take 1-4 hours. The environment can be powered down between runs. If using cloud GPU instances for training (with appropriate data controls), the cost is $5-20 per training run.

Cost: $15,000-25,000 for a dedicated training server, or $5-20 per run on reserved cloud GPU instances (if compliance permits controlled, encrypted data transfer).

Inference Tier

Production inference runs 24/7. This is the tier that handles real requests from banking applications.

Specification	GPU Path (Recommended)	CPU-Only Path
Servers	2 (active-active HA)	2 (active-active HA)
CPU	16-core Xeon Silver per server	32-core Xeon Gold per server
RAM	64GB per server	128GB per server
GPU	1x NVIDIA T4 16GB per server	None
Storage	500GB NVMe SSD per server	500GB NVMe SSD per server
Throughput	15-40 tokens/sec per server	3-8 tokens/sec per server
Concurrent requests	10-20 per server	2-5 per server

High availability: Run two inference servers in active-active mode behind an internal load balancer. If one fails, the other handles full load at reduced throughput. RTO: zero (automatic failover). RPO: zero (stateless inference).

Cost per server: $8,000-12,000 (with T4 GPU). Two servers for HA: $16,000-24,000.

Storage & Logging Tier

Storage Component	Size	Growth	Retention
Model files (base + adapters)	20-60GB	~10GB/quarter (new adapters)	All versions, indefinitely
Audit logs	15-50GB/year	Linear with inference volume	7 years minimum
Training artifacts	5-10GB per training run	Quarterly	All runs, indefinitely
Evaluation datasets	2-5GB	Quarterly updates	All versions, indefinitely
Backups (encrypted)	Mirror of above	Matches primary	Same as primary

Total first-year storage: 100-200GB. A 2TB NVMe array handles 7+ years of growth.

Audit Trail Architecture

This is the section examiners care about most. Every inference must produce a complete, immutable audit record.

Per-Inference Log Record

Field	Type	Example	Why Examiners Want It
`timestamp`	ISO 8601	2026-02-26T09:14:33.127Z	Temporal correlation with business events
`request_id`	UUID v4	8f3a2b1c-...	Unique reference for investigation
`model_version`	String	llama-3.1-8b-q4km-v2.1	Reproducibility
`adapter_version`	String	loan-analysis-v3.2	Reproducibility
`input_hash`	SHA-256	a3f2c7...	Integrity proof without storing raw data
`output_hash`	SHA-256	b7c1d9...	Integrity proof without storing raw data
`department`	String	commercial-lending	Usage attribution
`user_id`	String	svc-loan-origination	Access attribution
`confidence`	Float	0.94	Decision quality evidence
`token_count_in`	Integer	1,247	Resource tracking
`token_count_out`	Integer	342	Resource tracking
`latency_ms`	Integer	1,180	SLA compliance
`status`	Enum	success	Operations monitoring
`error_code`	String	null	Incident investigation

Log Immutability

Audit logs must be tamper-evident. Options:

Write-once storage: WORM (Write Once Read Many) volumes. NetApp SnapLock, Dell PowerStore immutable snapshots, or similar.
Append-only database: PostgreSQL with row-level security preventing UPDATE/DELETE on audit tables. Combined with regular hash-chain verification.
Log forwarding: Real-time replication to a separate SIEM (Splunk, Elastic, QRadar) with independent retention policies.

The most practical approach for most banks: PostgreSQL append-only tables with nightly hash-chain verification, replicated to the existing SIEM. This integrates with your current audit infrastructure without introducing new systems.

Retention Requirements

OCC examination guidance expects 5-7 years of records for model-related decisions. For AI audit logs:

Inference logs: 7 years from the date of the inference
Model versions: Indefinite (you need the ability to load any historical version for investigation)
Training artifacts: Indefinite (training data provenance, hyperparameters, training logs)
Validation reports: Indefinite (tied to model versions)

Storage cost for 7-year retention: At 30GB/year of audit logs, 7 years is 210GB. Compressed and archived, this fits on a single NAS shelf. The cost is trivial — under $500 for the storage hardware.

Access Controls

RBAC Model

Role	Permissions	Typical Users
Model Developer	Train models, upload adapters to staging	AI/ML team (2-3 people)
Model Validator	Read-only access to models + training artifacts, run validation suites	MRM team
Deployment Approver	Promote models from staging to production	Technology risk committee
API Consumer	Invoke inference API for authorized use cases	Application service accounts
Auditor	Read-only access to all logs, model cards, validation reports	Internal audit, examiners
Infrastructure Admin	Server management, patching, backup/restore	DevOps team

API Key Management

Each consuming application gets a dedicated API key with scoped permissions:

Key rotation: Every 90 days, automated. Old keys remain valid for a 7-day grace period.
Rate limiting: Per-key rate limits based on the approved use case. Loan origination: 500 requests/day. Customer service: 2,000 requests/day.
Usage monitoring: Real-time dashboards showing per-key volume, latency, and error rates. Alerts on anomalous patterns (sudden volume spike, requests outside business hours).

Per-Department Usage Monitoring

Department	Use Case	Daily Volume	Monthly Cost (On-Prem)	Monthly Cost (Cloud API)
Commercial Lending	Loan document analysis	200-400	$0 (fixed infra)	$1,800-3,600
Retail Banking	Customer inquiry classification	800-1,500	$0 (fixed infra)	$7,200-13,500
Compliance	SAR narrative drafting	50-100	$0 (fixed infra)	$450-900
Risk Management	Credit memo summarization	100-200	$0 (fixed infra)	$900-1,800
Total		1,150-2,200	$0 marginal	$10,350-19,800/mo

On-premise inference has zero marginal cost per request. The infrastructure cost is fixed regardless of volume. This changes the economics of AI adoption entirely — departments can experiment without budget approval for each new use case.

Change Management Workflow

Every model change follows a documented, auditable workflow.

Six-Step Process

1. PROPOSE   --> Change request with business justification
                 Submitted by: Model Developer
                 Approved by: Use case owner + Technology risk

2. DEVELOP   --> Fine-tune or update adapter
                 Environment: Air-gapped training tier
                 Artifacts: Training logs, new model card

3. VALIDATE  --> Run benchmark suite + backtest + adversarial tests
                 Performed by: Model Developer
                 Artifacts: Evaluation report

4. REVIEW    --> Independent validation
                 Performed by: MRM team or external validator
                 Artifacts: Validation report with findings

5. APPROVE   --> Deployment approval
                 Approved by: Deployment Approver (risk committee)
                 Artifacts: Signed approval, risk tier classification

6. DEPLOY    --> Staged rollout to production
                 Performed by: Infrastructure Admin
                 Stages: Canary (5%) → Partial (25%) → Full (100%)
                 Monitoring: 48-hour observation at each stage

Rollback Procedure

If monitoring detects quality degradation after deployment:

Automatic rollback trigger: Accuracy drops below threshold for 15 consecutive minutes
Manual rollback: Any authorized operator can revert to the previous model version in under 2 minutes
Incident documentation: Every rollback triggers an incident report documenting what changed, what failed, and root cause analysis

The previous model version stays loaded in memory on the standby inference server. Rollback is a load balancer configuration change — not a model reload.

Disaster Recovery

RTO and RPO Targets

Scenario	RTO	RPO	Recovery Method
Single GPU failure	0 (automatic)	0	Failover to HA partner server
Single server failure	0 (automatic)	0	Load balancer removes failed node
Both servers fail	4 hours	0	Restore from backup to replacement hardware
Model file corruption	30 minutes	0	Restore from model registry backup
Audit database failure	15 minutes	5 minutes	Failover to replica, restore from WAL
Data center failure	8-24 hours	1 hour	Restore at DR site from replicated backups

CPU Failover

If all GPUs fail, the inference stack falls back to CPU-only operation:

Throughput drops from 30 tokens/sec to 5 tokens/sec per server
Maximum concurrent requests drops from 20 to 4
Priority queue activates: Compliance and risk requests first, other departments queued
Automated notification to all consuming applications: "AI system operating in degraded mode, expect higher latency"

Degraded-Mode Operation

When AI is unavailable entirely:

All consuming applications must have a non-AI fallback path
Loan document analysis: Manual review (existing pre-AI workflow)
Customer classification: Rule-based routing (existing pre-AI workflow)
SAR drafting: Manual drafting by compliance analysts

This is a regulatory requirement, not just good practice. Examiners will ask: "What happens if this system goes down?" The answer must be: "We revert to our pre-AI process, which is documented and tested quarterly."

Regulator Exam Preparation

When OCC or FINRA examiners arrive, present the following package.

Examiner Briefing Package

Model inventory: Complete list of all AI models in production, with risk tier, owner, validation date, and next review date
Architecture diagram: The four-stage pipeline from training to audit logging
Sample audit trail: Pull 5 inference records from the previous month, showing the complete log chain from request to response
Validation report: Most recent independent validation for each model, including findings and remediation
Change log: All model changes in the past 12 months, with approval documentation
Incident log: Any model-related incidents, including rollbacks, with root cause analysis
Access control documentation: RBAC configuration, API key inventory, usage reports by department
Disaster recovery test results: Most recent DR test, including failover time and data integrity verification

Common Examiner Concerns (and Pre-Emptive Answers)

"Is customer data being sent to third parties?" No. All inference runs on-premise. No customer data leaves the bank's network. Show the network architecture diagram with VLAN isolation.

"Can you reproduce a historical model output?" Yes. Show the model version registry, demonstrate loading a historical version, and reproduce an output from the audit log using the recorded input hash.

"How do you detect model drift?" Weekly accuracy monitoring against a benchmark set. Automated alerts when accuracy drops below threshold. Quarterly full re-validation. Show the monitoring dashboard.

"What is the board's involvement?" Model risk governance reports to the board risk committee quarterly. The committee approves the model risk appetite statement and reviews Tier 1 model deployments. Show the last quarterly report.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Compliance Comparison: On-Premise vs Cloud

Here is how on-premise and cloud deployments compare across 10 audit dimensions.

Dimension	On-Premise	Cloud API	Examiner Preference
1. Data residency	All data stays in bank's network	Data transits to provider's infrastructure	On-premise
2. Audit logging	Complete, bank-controlled	Dependent on provider's logging capabilities	On-premise
3. Reproducibility	Full — pin model version, replay inputs	Limited — provider may update models	On-premise
4. Access control	Integrated with bank IAM	Separate API key management	On-premise
5. Change management	Bank controls all changes	Provider controls model updates	On-premise
6. Vendor risk	No third-party model provider	Requires vendor risk assessment, ongoing monitoring	On-premise
7. Incident response	Full forensic capability	Limited to provider's incident reports	On-premise
8. Model validation	Validate anytime with internal test suites	Cannot run arbitrary tests against hosted models	On-premise
9. DR testing	Bank controls DR strategy and testing	Dependent on provider's SLA and DR capabilities	On-premise
10. Cost predictability	Fixed infrastructure cost	Variable, usage-based, subject to price increases	On-premise

This is not a close comparison. On-premise wins on every dimension that matters to regulators. The only dimension where cloud APIs have an advantage is time-to-first-inference — you can start using a cloud API in hours, while on-premise takes weeks to set up.

But setup time is a one-time cost. Audit compliance is ongoing, every quarter, for the life of the model. Invest the weeks upfront.

Total Cost of Ownership

3-Year Comparison (Mid-Size Bank, 4 Use Cases)

Component	On-Premise	Cloud API (Enterprise BAA Tier)
Infrastructure (year 0)	$40,000	$0
Annual operations	$25,000/yr	$8,000/yr
Annual API costs	$0	$120,000-240,000/yr
Vendor risk assessment	$0	$15,000/yr
Compliance overhead	$5,000/yr	$20,000/yr
3-Year Total	$130,000	$504,000-804,000

The math is not subtle. On-premise costs 75-85% less over three years for a bank running four AI use cases at typical volumes (1,000-2,000 inferences/day total).

More importantly, the compliance posture is categorically better. You are not documenting someone else's infrastructure — you are documenting your own.