Back to blog
    On-Premise Healthcare AI: Architecture and Infrastructure Guide
    healthcareon-premiseinfrastructuredeploymenthipaaarchitectureself-hosted

    On-Premise Healthcare AI: Architecture and Infrastructure Guide

    A practical infrastructure guide for deploying AI on-premise in healthcare environments. Covers hardware requirements, network architecture, air-gapped deployment, HIPAA audit logging, model update strategies, and real cost comparisons against cloud APIs.

    EErtas Team·

    Your hospital's IT team says "we can't use cloud AI." They're right. PHI leaving your network is a compliance event. Every API call to OpenAI or Anthropic with patient data creates audit liability, BAA complexity, and breach risk.

    But here's the part they might not know yet: on-premise AI is now practical and affordable. A single NVIDIA T4 GPU costs less than a mid-range workstation. Open-source models run clinical NLP tasks at production quality. The infrastructure patterns are well-established.

    This guide covers exactly what you need — hardware, network architecture, model serving, storage, monitoring, updates, and disaster recovery — to run AI on-premise in a healthcare environment.

    Hardware Requirements

    The first decision is GPU vs CPU inference. This depends on your volume and latency requirements.

    GPU vs CPU Inference for Healthcare Volumes

    FactorGPU (NVIDIA T4)CPU-Only (Xeon/EPYC)
    Hardware cost$2,000-3,000 per card$0 additional (use existing servers)
    Throughput15-40 tokens/sec (7B model, Q4)3-8 tokens/sec (7B model, Q4)
    Concurrent users10-20 simultaneous requests2-5 simultaneous requests
    Best for>500 inferences/day, real-time triageunder 200 inferences/day, batch processing
    Power draw70W per T4Included in server baseline
    Rack space1U per 2-4 GPUsExisting server infrastructure

    For most mid-size hospitals (200-500 beds): Start with a single T4 GPU. It handles clinical note summarization, diagnostic coding assistance, and patient triage at volumes that cover 3-5 departments. Total hardware cost: $8,000-12,000 for a complete inference server (CPU + RAM + T4 + storage).

    For smaller clinics (under 100 beds): CPU-only inference is viable. A modern 32-core Xeon server with 64GB RAM runs quantized 7B models at acceptable latency for non-real-time tasks like overnight batch processing of clinical notes or weekly report generation.

    Minimum Server Specifications

    ComponentGPU PathCPU-Only Path
    CPU16+ cores (Xeon Silver or EPYC)32+ cores (Xeon Gold or EPYC)
    RAM32GB minimum, 64GB recommended64GB minimum, 128GB recommended
    GPUNVIDIA T4 16GB (or A2000 12GB)None
    Storage500GB NVMe SSD500GB NVMe SSD
    Network1GbE minimum, 10GbE recommended1GbE minimum
    OSUbuntu 22.04 LTS or RHEL 9Ubuntu 22.04 LTS or RHEL 9

    Network Architecture

    Healthcare AI deployments fall into three network patterns, each with different security profiles.

    Pattern 1: Air-Gapped Deployment

    The strictest option. The inference server has zero internet connectivity.

    [Clinical Systems] <---> [Internal API Gateway] <---> [AI Inference Server]
                                  |
                             [Audit Log DB]
    
    No external network connection. Model updates via secure media.
    

    When to use: Highest-security environments. Facilities handling military health records, psychiatric records, substance abuse treatment records (42 CFR Part 2), or research data under strict IRB protocols.

    Trade-off: Model updates require physical media (encrypted USB) or a dedicated internal artifact registry. No remote monitoring. Higher operational overhead.

    Pattern 2: DMZ Deployment

    The inference server sits in a DMZ with controlled outbound access for updates only. No inbound connections from the internet.

    [Internet] --X-- [Firewall] --- [DMZ: Update Proxy] --- [Firewall] --- [AI Inference Server]
                                                                                  |
    [Clinical Systems] <-----------------------------------------> [Internal API Gateway]
    

    When to use: Most hospital deployments. Allows automated model updates through a controlled proxy while keeping PHI processing fully internal.

    Trade-off: Requires careful firewall rules. The update proxy must be hardened and audited.

    Pattern 3: VLAN Isolation

    The AI infrastructure runs on a dedicated VLAN, segmented from general hospital network traffic but accessible to authorized clinical systems.

    VLAN 100 (Clinical):     [EHR] [PACS] [Clinical Apps]
                                  |
                             [L3 Switch / Firewall Rules]
                                  |
    VLAN 200 (AI Infra):    [API Gateway] [Inference Server] [Audit DB]
    

    When to use: Facilities that need departmental access control. Radiology gets access to the imaging model. Pathology gets access to the report generation model. Emergency department gets access to triage assist. Each VLAN-to-VLAN rule is documented and auditable.

    Model Serving Stack

    The production stack for healthcare AI inference is straightforward.

    Core Components

    1. Inference engine: Ollama or llama.cpp. Ollama provides a REST API out of the box. llama.cpp offers lower-level control and slightly better performance.
    2. API gateway: Nginx or Envoy as a reverse proxy in front of the inference engine. Handles authentication, rate limiting, and TLS termination.
    3. mTLS between services: Every connection between the API gateway, inference engine, and audit database uses mutual TLS. No exceptions. This is a HIPAA requirement for ePHI in transit.

    Request Flow

    [Clinical App] --> [mTLS] --> [API Gateway (Nginx)]
        --> [Auth check: API key + department ID]
        --> [Rate limit check]
        --> [mTLS] --> [Ollama/llama.cpp]
        --> [Response logged to audit DB]
        --> [mTLS] --> [Clinical App]
    

    API Key Management

    Each department gets its own API key. This enables per-department usage tracking, rate limiting, and access control. Rotate keys quarterly. Store them in HashiCorp Vault or your existing secrets management system.

    Storage Requirements

    Healthcare AI storage breaks into three categories with very different sizing profiles.

    Storage TypeSizeGrowth RateRetention
    Base model files4-14GB per model (quantized)Static per versionKeep current + 1 previous version
    LoRA adapter files50-200MB per specialty adapter~1-2 new adapters/quarterKeep all versions (audit trail)
    Audit logs10-50GB/yearScales with usage6-7 years (HIPAA minimum 6)
    Evaluation datasets1-5GBQuarterly updatesKeep all versions

    Total first-year storage: 30-70GB for models and adapters, plus audit log growth. A 1TB NVMe SSD handles 5+ years of operation with room to spare.

    Backup strategy: Encrypted backups to a secondary on-premise location. Never back up to cloud storage unless the cloud provider has a signed BAA and your risk assessment explicitly approves it.

    Monitoring and Logging

    HIPAA requires audit logging for any system that processes ePHI. For AI inference, this means every single request.

    What to Log Per Inference

    FieldExamplePurpose
    Timestamp2026-02-26T14:32:01ZAudit trail
    Request IDuuid-v4Correlation
    Model versionllama-3.1-8b-q4_K_M + radiology-v2.3Reproducibility
    DepartmentradiologyAccess control audit
    User/service IDehr-integration-svcAttribution
    Input hash (SHA-256)a3f2...Integrity verification without storing PHI
    Output hash (SHA-256)b7c1...Integrity verification
    Token count (in/out)342 / 128Usage tracking
    Latency (ms)1,240Performance monitoring
    Statussuccess / errorOperations

    Critical detail: Log input and output hashes, not raw content. This lets you verify integrity and prove which model version produced which output without storing additional copies of PHI in the audit database.

    HIPAA Access Logging

    Beyond inference logging, you need standard HIPAA access logs:

    • Who accessed the AI system and when
    • Authentication successes and failures
    • Configuration changes (model updates, adapter swaps, parameter changes)
    • Administrative access to the inference server itself

    Use your existing SIEM (Splunk, Elastic, etc.) to aggregate these logs. The AI infrastructure should feed into the same logging pipeline as the rest of your clinical systems.

    Model Update Strategy

    Getting new model versions onto air-gapped or isolated systems is the biggest operational challenge.

    Option 1: Secure USB Transfer (Air-Gapped)

    1. Download model files on an internet-connected workstation in a secure room
    2. Verify checksums against published hashes
    3. Transfer to encrypted USB drive (FIPS 140-2 compliant)
    4. Transport via authorized personnel with chain-of-custody documentation
    5. Load onto inference server, verify checksums again
    6. Run validation suite before switching production traffic

    Time per update: 2-4 hours including verification and validation.

    Option 2: Internal Artifact Registry (DMZ)

    1. Automated pull from external model registry (Hugging Face, Ollama registry) through the DMZ proxy
    2. Model files land in an internal artifact registry (Nexus, Artifactory, or a simple Nginx file server)
    3. Inference server pulls from the internal registry on a scheduled basis
    4. Automated validation suite runs before traffic is switched

    Time per update: 30-60 minutes, mostly automated.

    Staged Rollout

    Regardless of delivery method, follow a staged rollout:

    1. Canary (5% traffic): Route a small percentage of non-critical requests to the new model
    2. Validation (24-48 hours): Compare output quality metrics against the previous version
    3. Full rollout: Switch all traffic to the new version
    4. Rollback window: Keep the previous version loaded and ready for instant rollback for 7 days

    Disaster Recovery

    AI system failures in clinical environments need clear fallback procedures.

    Failure Modes and Responses

    FailureRTO TargetResponse
    GPU failure4 hoursFailover to CPU inference (degraded throughput)
    Inference server crash15 minutesRestart service, auto-recovery
    Model file corruption1 hourRestore from local backup, re-verify checksums
    Complete server failure8 hoursRestore from backup to standby hardware
    Network partitionImmediateClinical apps fall back to non-AI workflows

    CPU Failback

    Every GPU-accelerated deployment should have a tested CPU fallback path. If the GPU fails:

    1. Ollama/llama.cpp automatically falls back to CPU inference
    2. Throughput drops from ~30 tokens/sec to ~5 tokens/sec
    3. Reduce concurrent request limit from 10 to 2
    4. Prioritize real-time clinical use cases, queue batch jobs

    This degraded mode keeps AI available while hardware is replaced. No clinical workflow should have a hard dependency on AI — it should always be assistive, with human fallback.

    Cost Comparison: On-Premise vs Cloud API

    The math favours on-premise at healthcare volumes.

    3-Year Total Cost of Ownership

    Cost ComponentOn-Premise (T4 GPU)Cloud API (GPT-4-class, BAA)
    Hardware (year 0)$10,000$0
    Software/licensing$0 (open-source stack)$0
    API costs (year 1)$0$36,000-72,000
    API costs (year 2)$0$36,000-72,000
    API costs (year 3)$0$36,000-72,000
    Power/cooling (3 years)$1,800$0
    DevOps time (3 years)$15,000 (part-time)$5,000 (integration only)
    BAA/compliance costs$0 (internal)$5,000-15,000 (vendor assessment)
    3-Year Total$26,800$82,000-231,000

    Assumptions: 1,000 inferences/day, average 500 input + 200 output tokens. Cloud pricing at $0.01-0.03/1K tokens (BAA-covered tier, typically 2-3x standard pricing). DevOps at $75/hour, 4 hours/week on-premise vs 1 hour/week cloud.

    The breakeven point is typically around 200-300 inferences per day. Below that, cloud APIs with a BAA may be more cost-effective. Above that, on-premise wins and the gap widens every month.

    Team Requirements

    You do not need a dedicated ML team. You need:

    • 1 DevOps/infrastructure engineer (part-time, ~4 hours/week): Handles server maintenance, model updates, monitoring alerts, and security patching. This person already exists on your IT team.
    • 1 clinical champion per department: A clinician who owns the use case, validates outputs, and provides feedback for fine-tuning. Not a technical role.
    • Vendor support (optional): Ertas or similar platform for fine-tuning, adapter management, and deployment tooling. Eliminates the need for ML expertise.

    The most common mistake is over-staffing. On-premise AI inference is operationally similar to running any other internal service. If your team can manage an internal database server, they can manage an AI inference server.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Putting It Together

    Here is the complete architecture for a mid-size hospital deployment:

    Internet (updates only)
        |
    [DMZ: Update Proxy]
        |
    [Internal Network - VLAN 200: AI Infrastructure]
        |
    [Artifact Registry] --> [Inference Server: T4 GPU + Ollama]
                                  |               |
                        [API Gateway (Nginx)]  [Audit DB (PostgreSQL)]
                                  |
                        [mTLS + API Key Auth]
                                  |
    [VLAN 100: Clinical Systems]
        |           |           |
      [EHR]     [PACS]    [Clinical Apps]
    

    Day-one deployment: Single T4 server, one department, one use case (clinical note summarization). Total cost under $12,000. Time to production: 2-3 weeks with existing IT staff.

    Scale path: Add LoRA adapters for new departments. Add a second T4 for higher throughput. Add specialty models for radiology, pathology, coding. Each expansion is incremental — no rearchitecture needed.

    The infrastructure is the easy part. The model serving stack is proven. The network patterns are well-understood. What matters is getting the first use case into production and proving value to the clinical staff who will use it every day.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading