On-Premise Healthcare AI: Architecture and Infrastructure Guide

Your hospital's IT team says "we can't use cloud AI." They're right. PHI leaving your network is a compliance event. Every API call to OpenAI or Anthropic with patient data creates audit liability, BAA complexity, and breach risk.

But here's the part they might not know yet: on-premise AI is now practical and affordable. A single NVIDIA T4 GPU costs less than a mid-range workstation. Open-source models run clinical NLP tasks at production quality. The infrastructure patterns are well-established.

This guide covers exactly what you need — hardware, network architecture, model serving, storage, monitoring, updates, and disaster recovery — to run AI on-premise in a healthcare environment.

Hardware Requirements

The first decision is GPU vs CPU inference. This depends on your volume and latency requirements.

GPU vs CPU Inference for Healthcare Volumes

Factor	GPU (NVIDIA T4)	CPU-Only (Xeon/EPYC)
Hardware cost	$2,000-3,000 per card	$0 additional (use existing servers)
Throughput	15-40 tokens/sec (7B model, Q4)	3-8 tokens/sec (7B model, Q4)
Concurrent users	10-20 simultaneous requests	2-5 simultaneous requests
Best for	>500 inferences/day, real-time triage	under 200 inferences/day, batch processing
Power draw	70W per T4	Included in server baseline
Rack space	1U per 2-4 GPUs	Existing server infrastructure

For most mid-size hospitals (200-500 beds): Start with a single T4 GPU. It handles clinical note summarization, diagnostic coding assistance, and patient triage at volumes that cover 3-5 departments. Total hardware cost: $8,000-12,000 for a complete inference server (CPU + RAM + T4 + storage).

For smaller clinics (under 100 beds): CPU-only inference is viable. A modern 32-core Xeon server with 64GB RAM runs quantized 7B models at acceptable latency for non-real-time tasks like overnight batch processing of clinical notes or weekly report generation.

Minimum Server Specifications

Component	GPU Path	CPU-Only Path
CPU	16+ cores (Xeon Silver or EPYC)	32+ cores (Xeon Gold or EPYC)
RAM	32GB minimum, 64GB recommended	64GB minimum, 128GB recommended
GPU	NVIDIA T4 16GB (or A2000 12GB)	None
Storage	500GB NVMe SSD	500GB NVMe SSD
Network	1GbE minimum, 10GbE recommended	1GbE minimum
OS	Ubuntu 22.04 LTS or RHEL 9	Ubuntu 22.04 LTS or RHEL 9

Network Architecture

Healthcare AI deployments fall into three network patterns, each with different security profiles.

Pattern 1: Air-Gapped Deployment

The strictest option. The inference server has zero internet connectivity.

[Clinical Systems] <---> [Internal API Gateway] <---> [AI Inference Server]
                              |
                         [Audit Log DB]

No external network connection. Model updates via secure media.

When to use: Highest-security environments. Facilities handling military health records, psychiatric records, substance abuse treatment records (42 CFR Part 2), or research data under strict IRB protocols.

Trade-off: Model updates require physical media (encrypted USB) or a dedicated internal artifact registry. No remote monitoring. Higher operational overhead.

Pattern 2: DMZ Deployment

The inference server sits in a DMZ with controlled outbound access for updates only. No inbound connections from the internet.

[Internet] --X-- [Firewall] --- [DMZ: Update Proxy] --- [Firewall] --- [AI Inference Server]
                                                                              |
[Clinical Systems] <-----------------------------------------> [Internal API Gateway]

When to use: Most hospital deployments. Allows automated model updates through a controlled proxy while keeping PHI processing fully internal.

Trade-off: Requires careful firewall rules. The update proxy must be hardened and audited.

Pattern 3: VLAN Isolation

The AI infrastructure runs on a dedicated VLAN, segmented from general hospital network traffic but accessible to authorized clinical systems.

VLAN 100 (Clinical):     [EHR] [PACS] [Clinical Apps]
                              |
                         [L3 Switch / Firewall Rules]
                              |
VLAN 200 (AI Infra):    [API Gateway] [Inference Server] [Audit DB]

When to use: Facilities that need departmental access control. Radiology gets access to the imaging model. Pathology gets access to the report generation model. Emergency department gets access to triage assist. Each VLAN-to-VLAN rule is documented and auditable.

Model Serving Stack

The production stack for healthcare AI inference is straightforward.

Core Components

Inference engine: Ollama or llama.cpp. Ollama provides a REST API out of the box. llama.cpp offers lower-level control and slightly better performance.
API gateway: Nginx or Envoy as a reverse proxy in front of the inference engine. Handles authentication, rate limiting, and TLS termination.
mTLS between services: Every connection between the API gateway, inference engine, and audit database uses mutual TLS. No exceptions. This is a HIPAA requirement for ePHI in transit.

Request Flow

[Clinical App] --> [mTLS] --> [API Gateway (Nginx)]
    --> [Auth check: API key + department ID]
    --> [Rate limit check]
    --> [mTLS] --> [Ollama/llama.cpp]
    --> [Response logged to audit DB]
    --> [mTLS] --> [Clinical App]

API Key Management

Each department gets its own API key. This enables per-department usage tracking, rate limiting, and access control. Rotate keys quarterly. Store them in HashiCorp Vault or your existing secrets management system.

Storage Requirements

Healthcare AI storage breaks into three categories with very different sizing profiles.

Storage Type	Size	Growth Rate	Retention
Base model files	4-14GB per model (quantized)	Static per version	Keep current + 1 previous version
LoRA adapter files	50-200MB per specialty adapter	~1-2 new adapters/quarter	Keep all versions (audit trail)
Audit logs	10-50GB/year	Scales with usage	6-7 years (HIPAA minimum 6)
Evaluation datasets	1-5GB	Quarterly updates	Keep all versions

Total first-year storage: 30-70GB for models and adapters, plus audit log growth. A 1TB NVMe SSD handles 5+ years of operation with room to spare.

Backup strategy: Encrypted backups to a secondary on-premise location. Never back up to cloud storage unless the cloud provider has a signed BAA and your risk assessment explicitly approves it.

Monitoring and Logging

HIPAA requires audit logging for any system that processes ePHI. For AI inference, this means every single request.

What to Log Per Inference

Field	Example	Purpose
Timestamp	2026-02-26T14:32:01Z	Audit trail
Request ID	uuid-v4	Correlation
Model version	llama-3.1-8b-q4_K_M + radiology-v2.3	Reproducibility
Department	radiology	Access control audit
User/service ID	ehr-integration-svc	Attribution
Input hash (SHA-256)	a3f2...	Integrity verification without storing PHI
Output hash (SHA-256)	b7c1...	Integrity verification
Token count (in/out)	342 / 128	Usage tracking
Latency (ms)	1,240	Performance monitoring
Status	success / error	Operations

Critical detail: Log input and output hashes, not raw content. This lets you verify integrity and prove which model version produced which output without storing additional copies of PHI in the audit database.

HIPAA Access Logging

Beyond inference logging, you need standard HIPAA access logs:

Who accessed the AI system and when
Authentication successes and failures
Configuration changes (model updates, adapter swaps, parameter changes)
Administrative access to the inference server itself

Use your existing SIEM (Splunk, Elastic, etc.) to aggregate these logs. The AI infrastructure should feed into the same logging pipeline as the rest of your clinical systems.

Model Update Strategy

Getting new model versions onto air-gapped or isolated systems is the biggest operational challenge.

Option 1: Secure USB Transfer (Air-Gapped)

Download model files on an internet-connected workstation in a secure room
Verify checksums against published hashes
Transfer to encrypted USB drive (FIPS 140-2 compliant)
Transport via authorized personnel with chain-of-custody documentation
Load onto inference server, verify checksums again
Run validation suite before switching production traffic

Time per update: 2-4 hours including verification and validation.

Option 2: Internal Artifact Registry (DMZ)

Automated pull from external model registry (Hugging Face, Ollama registry) through the DMZ proxy
Model files land in an internal artifact registry (Nexus, Artifactory, or a simple Nginx file server)
Inference server pulls from the internal registry on a scheduled basis
Automated validation suite runs before traffic is switched

Time per update: 30-60 minutes, mostly automated.

Staged Rollout

Regardless of delivery method, follow a staged rollout:

Canary (5% traffic): Route a small percentage of non-critical requests to the new model
Validation (24-48 hours): Compare output quality metrics against the previous version
Full rollout: Switch all traffic to the new version
Rollback window: Keep the previous version loaded and ready for instant rollback for 7 days

Disaster Recovery

AI system failures in clinical environments need clear fallback procedures.

Failure Modes and Responses

Failure	RTO Target	Response
GPU failure	4 hours	Failover to CPU inference (degraded throughput)
Inference server crash	15 minutes	Restart service, auto-recovery
Model file corruption	1 hour	Restore from local backup, re-verify checksums
Complete server failure	8 hours	Restore from backup to standby hardware
Network partition	Immediate	Clinical apps fall back to non-AI workflows

CPU Failback

Every GPU-accelerated deployment should have a tested CPU fallback path. If the GPU fails:

Ollama/llama.cpp automatically falls back to CPU inference
Throughput drops from ~30 tokens/sec to ~5 tokens/sec
Reduce concurrent request limit from 10 to 2
Prioritize real-time clinical use cases, queue batch jobs

This degraded mode keeps AI available while hardware is replaced. No clinical workflow should have a hard dependency on AI — it should always be assistive, with human fallback.

Cost Comparison: On-Premise vs Cloud API

The math favours on-premise at healthcare volumes.

3-Year Total Cost of Ownership

Cost Component	On-Premise (T4 GPU)	Cloud API (GPT-4-class, BAA)
Hardware (year 0)	$10,000	$0
Software/licensing	$0 (open-source stack)	$0
API costs (year 1)	$0	$36,000-72,000
API costs (year 2)	$0	$36,000-72,000
API costs (year 3)	$0	$36,000-72,000
Power/cooling (3 years)	$1,800	$0
DevOps time (3 years)	$15,000 (part-time)	$5,000 (integration only)
BAA/compliance costs	$0 (internal)	$5,000-15,000 (vendor assessment)
3-Year Total	$26,800	$82,000-231,000

Assumptions: 1,000 inferences/day, average 500 input + 200 output tokens. Cloud pricing at $0.01-0.03/1K tokens (BAA-covered tier, typically 2-3x standard pricing). DevOps at $75/hour, 4 hours/week on-premise vs 1 hour/week cloud.

The breakeven point is typically around 200-300 inferences per day. Below that, cloud APIs with a BAA may be more cost-effective. Above that, on-premise wins and the gap widens every month.

Team Requirements

You do not need a dedicated ML team. You need:

1 DevOps/infrastructure engineer (part-time, ~4 hours/week): Handles server maintenance, model updates, monitoring alerts, and security patching. This person already exists on your IT team.
1 clinical champion per department: A clinician who owns the use case, validates outputs, and provides feedback for fine-tuning. Not a technical role.
Vendor support (optional): Ertas or similar platform for fine-tuning, adapter management, and deployment tooling. Eliminates the need for ML expertise.

The most common mistake is over-staffing. On-premise AI inference is operationally similar to running any other internal service. If your team can manage an internal database server, they can manage an AI inference server.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Putting It Together

Here is the complete architecture for a mid-size hospital deployment:

Internet (updates only)
    |
[DMZ: Update Proxy]
    |
[Internal Network - VLAN 200: AI Infrastructure]
    |
[Artifact Registry] --> [Inference Server: T4 GPU + Ollama]
                              |               |
                    [API Gateway (Nginx)]  [Audit DB (PostgreSQL)]
                              |
                    [mTLS + API Key Auth]
                              |
[VLAN 100: Clinical Systems]
    |           |           |
  [EHR]     [PACS]    [Clinical Apps]

Day-one deployment: Single T4 server, one department, one use case (clinical note summarization). Total cost under $12,000. Time to production: 2-3 weeks with existing IT staff.

Scale path: Add LoRA adapters for new departments. Add a second T4 for higher throughput. Add specialty models for radiology, pathology, coding. Each expansion is incremental — no rearchitecture needed.

The infrastructure is the easy part. The model serving stack is proven. The network patterns are well-understood. What matters is getting the first use case into production and proving value to the clinical staff who will use it every day.