Back to blog
    On-Premise AI for Banking: Satisfying Regulator Audit Requirements
    financebankingon-premisecomplianceauditdeploymentinfrastructure

    On-Premise AI for Banking: Satisfying Regulator Audit Requirements

    Architecture and operational guide for deploying on-premise AI in banking environments that satisfy OCC, FINRA, and Federal Reserve audit requirements. Covers infrastructure, audit trails, access controls, change management, disaster recovery, and a 10-dimension compliance comparison.

    EErtas Team·

    Your bank's CTO wants AI. Compliance wants audit trails. The CISO wants data on-premise. The head of operations wants it deployed before Q3.

    These are not conflicting requirements. They are design constraints — and they all point to the same architecture: fine-tuned models running on-premise, with comprehensive audit logging, access controls, and change management baked into the infrastructure from day one.

    This guide covers the complete architecture, from GPU cluster to audit log retention, that lets you deploy AI and pass the examiner review.

    Architecture Overview

    The deployment follows a four-stage pipeline. Each stage has a gate that produces auditable artifacts.

    [Stage 1: Training]              Air-gapped GPU environment
            |
        Validation Gate ---------> Benchmark results, bias audit, model card
            |
    [Stage 2: Validation]           Independent review by MRM team
            |
        Approval Gate ------------> Sign-off document, risk tier classification
            |
    [Stage 3: Production Inference]  On-premise inference cluster
            |
        Continuous Monitoring -----> Drift alerts, accuracy tracking, audit logs
            |
    [Stage 4: Audit Logging]        Immutable log store, 7-year retention
    

    Every inference, every model change, and every access event produces a record that flows into Stage 4. Nothing is optional.

    Infrastructure Requirements

    Banking AI infrastructure splits into three functional tiers.

    Tier Breakdown

    TierPurposeHardwareNetwork
    TrainingFine-tuning and adapter creation1-2x NVIDIA A100 40GB or 4x T4 16GBAir-gapped or isolated VLAN
    InferenceProduction model serving2x NVIDIA T4 16GB (HA pair) or 2x 32-core CPU serversIsolated VLAN, internal access only
    Storage & LoggingModel registry, audit logs, backups2TB NVMe + networked storage (NAS/SAN)Same VLAN as inference, replicated

    Training Tier

    Fine-tuning happens infrequently — typically quarterly or when a new use case is onboarded. The training environment should be:

    • Air-gapped or strictly isolated. Training data includes sensitive financial documents. No internet access. No shared infrastructure.
    • GPU-equipped. Fine-tuning a 7B-8B parameter model with LoRA requires 16-24GB VRAM. A single A100 40GB handles this comfortably. A pair of T4 16GB GPUs works with gradient accumulation.
    • Temporary. Training runs take 1-4 hours. The environment can be powered down between runs. If using cloud GPU instances for training (with appropriate data controls), the cost is $5-20 per training run.

    Cost: $15,000-25,000 for a dedicated training server, or $5-20 per run on reserved cloud GPU instances (if compliance permits controlled, encrypted data transfer).

    Inference Tier

    Production inference runs 24/7. This is the tier that handles real requests from banking applications.

    SpecificationGPU Path (Recommended)CPU-Only Path
    Servers2 (active-active HA)2 (active-active HA)
    CPU16-core Xeon Silver per server32-core Xeon Gold per server
    RAM64GB per server128GB per server
    GPU1x NVIDIA T4 16GB per serverNone
    Storage500GB NVMe SSD per server500GB NVMe SSD per server
    Throughput15-40 tokens/sec per server3-8 tokens/sec per server
    Concurrent requests10-20 per server2-5 per server

    High availability: Run two inference servers in active-active mode behind an internal load balancer. If one fails, the other handles full load at reduced throughput. RTO: zero (automatic failover). RPO: zero (stateless inference).

    Cost per server: $8,000-12,000 (with T4 GPU). Two servers for HA: $16,000-24,000.

    Storage & Logging Tier

    Storage ComponentSizeGrowthRetention
    Model files (base + adapters)20-60GB~10GB/quarter (new adapters)All versions, indefinitely
    Audit logs15-50GB/yearLinear with inference volume7 years minimum
    Training artifacts5-10GB per training runQuarterlyAll runs, indefinitely
    Evaluation datasets2-5GBQuarterly updatesAll versions, indefinitely
    Backups (encrypted)Mirror of aboveMatches primarySame as primary

    Total first-year storage: 100-200GB. A 2TB NVMe array handles 7+ years of growth.

    Audit Trail Architecture

    This is the section examiners care about most. Every inference must produce a complete, immutable audit record.

    Per-Inference Log Record

    FieldTypeExampleWhy Examiners Want It
    timestampISO 86012026-02-26T09:14:33.127ZTemporal correlation with business events
    request_idUUID v48f3a2b1c-...Unique reference for investigation
    model_versionStringllama-3.1-8b-q4km-v2.1Reproducibility
    adapter_versionStringloan-analysis-v3.2Reproducibility
    input_hashSHA-256a3f2c7...Integrity proof without storing raw data
    output_hashSHA-256b7c1d9...Integrity proof without storing raw data
    departmentStringcommercial-lendingUsage attribution
    user_idStringsvc-loan-originationAccess attribution
    confidenceFloat0.94Decision quality evidence
    token_count_inInteger1,247Resource tracking
    token_count_outInteger342Resource tracking
    latency_msInteger1,180SLA compliance
    statusEnumsuccessOperations monitoring
    error_codeStringnullIncident investigation

    Log Immutability

    Audit logs must be tamper-evident. Options:

    1. Write-once storage: WORM (Write Once Read Many) volumes. NetApp SnapLock, Dell PowerStore immutable snapshots, or similar.
    2. Append-only database: PostgreSQL with row-level security preventing UPDATE/DELETE on audit tables. Combined with regular hash-chain verification.
    3. Log forwarding: Real-time replication to a separate SIEM (Splunk, Elastic, QRadar) with independent retention policies.

    The most practical approach for most banks: PostgreSQL append-only tables with nightly hash-chain verification, replicated to the existing SIEM. This integrates with your current audit infrastructure without introducing new systems.

    Retention Requirements

    OCC examination guidance expects 5-7 years of records for model-related decisions. For AI audit logs:

    • Inference logs: 7 years from the date of the inference
    • Model versions: Indefinite (you need the ability to load any historical version for investigation)
    • Training artifacts: Indefinite (training data provenance, hyperparameters, training logs)
    • Validation reports: Indefinite (tied to model versions)

    Storage cost for 7-year retention: At 30GB/year of audit logs, 7 years is 210GB. Compressed and archived, this fits on a single NAS shelf. The cost is trivial — under $500 for the storage hardware.

    Access Controls

    RBAC Model

    RolePermissionsTypical Users
    Model DeveloperTrain models, upload adapters to stagingAI/ML team (2-3 people)
    Model ValidatorRead-only access to models + training artifacts, run validation suitesMRM team
    Deployment ApproverPromote models from staging to productionTechnology risk committee
    API ConsumerInvoke inference API for authorized use casesApplication service accounts
    AuditorRead-only access to all logs, model cards, validation reportsInternal audit, examiners
    Infrastructure AdminServer management, patching, backup/restoreDevOps team

    API Key Management

    Each consuming application gets a dedicated API key with scoped permissions:

    • Key rotation: Every 90 days, automated. Old keys remain valid for a 7-day grace period.
    • Rate limiting: Per-key rate limits based on the approved use case. Loan origination: 500 requests/day. Customer service: 2,000 requests/day.
    • Usage monitoring: Real-time dashboards showing per-key volume, latency, and error rates. Alerts on anomalous patterns (sudden volume spike, requests outside business hours).

    Per-Department Usage Monitoring

    DepartmentUse CaseDaily VolumeMonthly Cost (On-Prem)Monthly Cost (Cloud API)
    Commercial LendingLoan document analysis200-400$0 (fixed infra)$1,800-3,600
    Retail BankingCustomer inquiry classification800-1,500$0 (fixed infra)$7,200-13,500
    ComplianceSAR narrative drafting50-100$0 (fixed infra)$450-900
    Risk ManagementCredit memo summarization100-200$0 (fixed infra)$900-1,800
    Total1,150-2,200$0 marginal$10,350-19,800/mo

    On-premise inference has zero marginal cost per request. The infrastructure cost is fixed regardless of volume. This changes the economics of AI adoption entirely — departments can experiment without budget approval for each new use case.

    Change Management Workflow

    Every model change follows a documented, auditable workflow.

    Six-Step Process

    1. PROPOSE   --> Change request with business justification
                     Submitted by: Model Developer
                     Approved by: Use case owner + Technology risk
    
    2. DEVELOP   --> Fine-tune or update adapter
                     Environment: Air-gapped training tier
                     Artifacts: Training logs, new model card
    
    3. VALIDATE  --> Run benchmark suite + backtest + adversarial tests
                     Performed by: Model Developer
                     Artifacts: Evaluation report
    
    4. REVIEW    --> Independent validation
                     Performed by: MRM team or external validator
                     Artifacts: Validation report with findings
    
    5. APPROVE   --> Deployment approval
                     Approved by: Deployment Approver (risk committee)
                     Artifacts: Signed approval, risk tier classification
    
    6. DEPLOY    --> Staged rollout to production
                     Performed by: Infrastructure Admin
                     Stages: Canary (5%) → Partial (25%) → Full (100%)
                     Monitoring: 48-hour observation at each stage
    

    Rollback Procedure

    If monitoring detects quality degradation after deployment:

    1. Automatic rollback trigger: Accuracy drops below threshold for 15 consecutive minutes
    2. Manual rollback: Any authorized operator can revert to the previous model version in under 2 minutes
    3. Incident documentation: Every rollback triggers an incident report documenting what changed, what failed, and root cause analysis

    The previous model version stays loaded in memory on the standby inference server. Rollback is a load balancer configuration change — not a model reload.

    Disaster Recovery

    RTO and RPO Targets

    ScenarioRTORPORecovery Method
    Single GPU failure0 (automatic)0Failover to HA partner server
    Single server failure0 (automatic)0Load balancer removes failed node
    Both servers fail4 hours0Restore from backup to replacement hardware
    Model file corruption30 minutes0Restore from model registry backup
    Audit database failure15 minutes5 minutesFailover to replica, restore from WAL
    Data center failure8-24 hours1 hourRestore at DR site from replicated backups

    CPU Failover

    If all GPUs fail, the inference stack falls back to CPU-only operation:

    • Throughput drops from 30 tokens/sec to 5 tokens/sec per server
    • Maximum concurrent requests drops from 20 to 4
    • Priority queue activates: Compliance and risk requests first, other departments queued
    • Automated notification to all consuming applications: "AI system operating in degraded mode, expect higher latency"

    Degraded-Mode Operation

    When AI is unavailable entirely:

    • All consuming applications must have a non-AI fallback path
    • Loan document analysis: Manual review (existing pre-AI workflow)
    • Customer classification: Rule-based routing (existing pre-AI workflow)
    • SAR drafting: Manual drafting by compliance analysts

    This is a regulatory requirement, not just good practice. Examiners will ask: "What happens if this system goes down?" The answer must be: "We revert to our pre-AI process, which is documented and tested quarterly."

    Regulator Exam Preparation

    When OCC or FINRA examiners arrive, present the following package.

    Examiner Briefing Package

    1. Model inventory: Complete list of all AI models in production, with risk tier, owner, validation date, and next review date
    2. Architecture diagram: The four-stage pipeline from training to audit logging
    3. Sample audit trail: Pull 5 inference records from the previous month, showing the complete log chain from request to response
    4. Validation report: Most recent independent validation for each model, including findings and remediation
    5. Change log: All model changes in the past 12 months, with approval documentation
    6. Incident log: Any model-related incidents, including rollbacks, with root cause analysis
    7. Access control documentation: RBAC configuration, API key inventory, usage reports by department
    8. Disaster recovery test results: Most recent DR test, including failover time and data integrity verification

    Common Examiner Concerns (and Pre-Emptive Answers)

    "Is customer data being sent to third parties?" No. All inference runs on-premise. No customer data leaves the bank's network. Show the network architecture diagram with VLAN isolation.

    "Can you reproduce a historical model output?" Yes. Show the model version registry, demonstrate loading a historical version, and reproduce an output from the audit log using the recorded input hash.

    "How do you detect model drift?" Weekly accuracy monitoring against a benchmark set. Automated alerts when accuracy drops below threshold. Quarterly full re-validation. Show the monitoring dashboard.

    "What is the board's involvement?" Model risk governance reports to the board risk committee quarterly. The committee approves the model risk appetite statement and reviews Tier 1 model deployments. Show the last quarterly report.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Compliance Comparison: On-Premise vs Cloud

    Here is how on-premise and cloud deployments compare across 10 audit dimensions.

    DimensionOn-PremiseCloud APIExaminer Preference
    1. Data residencyAll data stays in bank's networkData transits to provider's infrastructureOn-premise
    2. Audit loggingComplete, bank-controlledDependent on provider's logging capabilitiesOn-premise
    3. ReproducibilityFull — pin model version, replay inputsLimited — provider may update modelsOn-premise
    4. Access controlIntegrated with bank IAMSeparate API key managementOn-premise
    5. Change managementBank controls all changesProvider controls model updatesOn-premise
    6. Vendor riskNo third-party model providerRequires vendor risk assessment, ongoing monitoringOn-premise
    7. Incident responseFull forensic capabilityLimited to provider's incident reportsOn-premise
    8. Model validationValidate anytime with internal test suitesCannot run arbitrary tests against hosted modelsOn-premise
    9. DR testingBank controls DR strategy and testingDependent on provider's SLA and DR capabilitiesOn-premise
    10. Cost predictabilityFixed infrastructure costVariable, usage-based, subject to price increasesOn-premise

    This is not a close comparison. On-premise wins on every dimension that matters to regulators. The only dimension where cloud APIs have an advantage is time-to-first-inference — you can start using a cloud API in hours, while on-premise takes weeks to set up.

    But setup time is a one-time cost. Audit compliance is ongoing, every quarter, for the life of the model. Invest the weeks upfront.

    Total Cost of Ownership

    3-Year Comparison (Mid-Size Bank, 4 Use Cases)

    ComponentOn-PremiseCloud API (Enterprise BAA Tier)
    Infrastructure (year 0)$40,000$0
    Annual operations$25,000/yr$8,000/yr
    Annual API costs$0$120,000-240,000/yr
    Vendor risk assessment$0$15,000/yr
    Compliance overhead$5,000/yr$20,000/yr
    3-Year Total$130,000$504,000-804,000

    The math is not subtle. On-premise costs 75-85% less over three years for a bank running four AI use cases at typical volumes (1,000-2,000 inferences/day total).

    More importantly, the compliance posture is categorically better. You are not documenting someone else's infrastructure — you are documenting your own.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading