
On-Premise Healthcare AI: Architecture and Infrastructure Guide
A practical infrastructure guide for deploying AI on-premise in healthcare environments. Covers hardware requirements, network architecture, air-gapped deployment, HIPAA audit logging, model update strategies, and real cost comparisons against cloud APIs.
Your hospital's IT team says "we can't use cloud AI." They're right. PHI leaving your network is a compliance event. Every API call to OpenAI or Anthropic with patient data creates audit liability, BAA complexity, and breach risk.
But here's the part they might not know yet: on-premise AI is now practical and affordable. A single NVIDIA T4 GPU costs less than a mid-range workstation. Open-source models run clinical NLP tasks at production quality. The infrastructure patterns are well-established.
This guide covers exactly what you need — hardware, network architecture, model serving, storage, monitoring, updates, and disaster recovery — to run AI on-premise in a healthcare environment.
Hardware Requirements
The first decision is GPU vs CPU inference. This depends on your volume and latency requirements.
GPU vs CPU Inference for Healthcare Volumes
| Factor | GPU (NVIDIA T4) | CPU-Only (Xeon/EPYC) |
|---|---|---|
| Hardware cost | $2,000-3,000 per card | $0 additional (use existing servers) |
| Throughput | 15-40 tokens/sec (7B model, Q4) | 3-8 tokens/sec (7B model, Q4) |
| Concurrent users | 10-20 simultaneous requests | 2-5 simultaneous requests |
| Best for | >500 inferences/day, real-time triage | under 200 inferences/day, batch processing |
| Power draw | 70W per T4 | Included in server baseline |
| Rack space | 1U per 2-4 GPUs | Existing server infrastructure |
For most mid-size hospitals (200-500 beds): Start with a single T4 GPU. It handles clinical note summarization, diagnostic coding assistance, and patient triage at volumes that cover 3-5 departments. Total hardware cost: $8,000-12,000 for a complete inference server (CPU + RAM + T4 + storage).
For smaller clinics (under 100 beds): CPU-only inference is viable. A modern 32-core Xeon server with 64GB RAM runs quantized 7B models at acceptable latency for non-real-time tasks like overnight batch processing of clinical notes or weekly report generation.
Minimum Server Specifications
| Component | GPU Path | CPU-Only Path |
|---|---|---|
| CPU | 16+ cores (Xeon Silver or EPYC) | 32+ cores (Xeon Gold or EPYC) |
| RAM | 32GB minimum, 64GB recommended | 64GB minimum, 128GB recommended |
| GPU | NVIDIA T4 16GB (or A2000 12GB) | None |
| Storage | 500GB NVMe SSD | 500GB NVMe SSD |
| Network | 1GbE minimum, 10GbE recommended | 1GbE minimum |
| OS | Ubuntu 22.04 LTS or RHEL 9 | Ubuntu 22.04 LTS or RHEL 9 |
Network Architecture
Healthcare AI deployments fall into three network patterns, each with different security profiles.
Pattern 1: Air-Gapped Deployment
The strictest option. The inference server has zero internet connectivity.
[Clinical Systems] <---> [Internal API Gateway] <---> [AI Inference Server]
|
[Audit Log DB]
No external network connection. Model updates via secure media.
When to use: Highest-security environments. Facilities handling military health records, psychiatric records, substance abuse treatment records (42 CFR Part 2), or research data under strict IRB protocols.
Trade-off: Model updates require physical media (encrypted USB) or a dedicated internal artifact registry. No remote monitoring. Higher operational overhead.
Pattern 2: DMZ Deployment
The inference server sits in a DMZ with controlled outbound access for updates only. No inbound connections from the internet.
[Internet] --X-- [Firewall] --- [DMZ: Update Proxy] --- [Firewall] --- [AI Inference Server]
|
[Clinical Systems] <-----------------------------------------> [Internal API Gateway]
When to use: Most hospital deployments. Allows automated model updates through a controlled proxy while keeping PHI processing fully internal.
Trade-off: Requires careful firewall rules. The update proxy must be hardened and audited.
Pattern 3: VLAN Isolation
The AI infrastructure runs on a dedicated VLAN, segmented from general hospital network traffic but accessible to authorized clinical systems.
VLAN 100 (Clinical): [EHR] [PACS] [Clinical Apps]
|
[L3 Switch / Firewall Rules]
|
VLAN 200 (AI Infra): [API Gateway] [Inference Server] [Audit DB]
When to use: Facilities that need departmental access control. Radiology gets access to the imaging model. Pathology gets access to the report generation model. Emergency department gets access to triage assist. Each VLAN-to-VLAN rule is documented and auditable.
Model Serving Stack
The production stack for healthcare AI inference is straightforward.
Core Components
- Inference engine: Ollama or llama.cpp. Ollama provides a REST API out of the box. llama.cpp offers lower-level control and slightly better performance.
- API gateway: Nginx or Envoy as a reverse proxy in front of the inference engine. Handles authentication, rate limiting, and TLS termination.
- mTLS between services: Every connection between the API gateway, inference engine, and audit database uses mutual TLS. No exceptions. This is a HIPAA requirement for ePHI in transit.
Request Flow
[Clinical App] --> [mTLS] --> [API Gateway (Nginx)]
--> [Auth check: API key + department ID]
--> [Rate limit check]
--> [mTLS] --> [Ollama/llama.cpp]
--> [Response logged to audit DB]
--> [mTLS] --> [Clinical App]
API Key Management
Each department gets its own API key. This enables per-department usage tracking, rate limiting, and access control. Rotate keys quarterly. Store them in HashiCorp Vault or your existing secrets management system.
Storage Requirements
Healthcare AI storage breaks into three categories with very different sizing profiles.
| Storage Type | Size | Growth Rate | Retention |
|---|---|---|---|
| Base model files | 4-14GB per model (quantized) | Static per version | Keep current + 1 previous version |
| LoRA adapter files | 50-200MB per specialty adapter | ~1-2 new adapters/quarter | Keep all versions (audit trail) |
| Audit logs | 10-50GB/year | Scales with usage | 6-7 years (HIPAA minimum 6) |
| Evaluation datasets | 1-5GB | Quarterly updates | Keep all versions |
Total first-year storage: 30-70GB for models and adapters, plus audit log growth. A 1TB NVMe SSD handles 5+ years of operation with room to spare.
Backup strategy: Encrypted backups to a secondary on-premise location. Never back up to cloud storage unless the cloud provider has a signed BAA and your risk assessment explicitly approves it.
Monitoring and Logging
HIPAA requires audit logging for any system that processes ePHI. For AI inference, this means every single request.
What to Log Per Inference
| Field | Example | Purpose |
|---|---|---|
| Timestamp | 2026-02-26T14:32:01Z | Audit trail |
| Request ID | uuid-v4 | Correlation |
| Model version | llama-3.1-8b-q4_K_M + radiology-v2.3 | Reproducibility |
| Department | radiology | Access control audit |
| User/service ID | ehr-integration-svc | Attribution |
| Input hash (SHA-256) | a3f2... | Integrity verification without storing PHI |
| Output hash (SHA-256) | b7c1... | Integrity verification |
| Token count (in/out) | 342 / 128 | Usage tracking |
| Latency (ms) | 1,240 | Performance monitoring |
| Status | success / error | Operations |
Critical detail: Log input and output hashes, not raw content. This lets you verify integrity and prove which model version produced which output without storing additional copies of PHI in the audit database.
HIPAA Access Logging
Beyond inference logging, you need standard HIPAA access logs:
- Who accessed the AI system and when
- Authentication successes and failures
- Configuration changes (model updates, adapter swaps, parameter changes)
- Administrative access to the inference server itself
Use your existing SIEM (Splunk, Elastic, etc.) to aggregate these logs. The AI infrastructure should feed into the same logging pipeline as the rest of your clinical systems.
Model Update Strategy
Getting new model versions onto air-gapped or isolated systems is the biggest operational challenge.
Option 1: Secure USB Transfer (Air-Gapped)
- Download model files on an internet-connected workstation in a secure room
- Verify checksums against published hashes
- Transfer to encrypted USB drive (FIPS 140-2 compliant)
- Transport via authorized personnel with chain-of-custody documentation
- Load onto inference server, verify checksums again
- Run validation suite before switching production traffic
Time per update: 2-4 hours including verification and validation.
Option 2: Internal Artifact Registry (DMZ)
- Automated pull from external model registry (Hugging Face, Ollama registry) through the DMZ proxy
- Model files land in an internal artifact registry (Nexus, Artifactory, or a simple Nginx file server)
- Inference server pulls from the internal registry on a scheduled basis
- Automated validation suite runs before traffic is switched
Time per update: 30-60 minutes, mostly automated.
Staged Rollout
Regardless of delivery method, follow a staged rollout:
- Canary (5% traffic): Route a small percentage of non-critical requests to the new model
- Validation (24-48 hours): Compare output quality metrics against the previous version
- Full rollout: Switch all traffic to the new version
- Rollback window: Keep the previous version loaded and ready for instant rollback for 7 days
Disaster Recovery
AI system failures in clinical environments need clear fallback procedures.
Failure Modes and Responses
| Failure | RTO Target | Response |
|---|---|---|
| GPU failure | 4 hours | Failover to CPU inference (degraded throughput) |
| Inference server crash | 15 minutes | Restart service, auto-recovery |
| Model file corruption | 1 hour | Restore from local backup, re-verify checksums |
| Complete server failure | 8 hours | Restore from backup to standby hardware |
| Network partition | Immediate | Clinical apps fall back to non-AI workflows |
CPU Failback
Every GPU-accelerated deployment should have a tested CPU fallback path. If the GPU fails:
- Ollama/llama.cpp automatically falls back to CPU inference
- Throughput drops from ~30 tokens/sec to ~5 tokens/sec
- Reduce concurrent request limit from 10 to 2
- Prioritize real-time clinical use cases, queue batch jobs
This degraded mode keeps AI available while hardware is replaced. No clinical workflow should have a hard dependency on AI — it should always be assistive, with human fallback.
Cost Comparison: On-Premise vs Cloud API
The math favours on-premise at healthcare volumes.
3-Year Total Cost of Ownership
| Cost Component | On-Premise (T4 GPU) | Cloud API (GPT-4-class, BAA) |
|---|---|---|
| Hardware (year 0) | $10,000 | $0 |
| Software/licensing | $0 (open-source stack) | $0 |
| API costs (year 1) | $0 | $36,000-72,000 |
| API costs (year 2) | $0 | $36,000-72,000 |
| API costs (year 3) | $0 | $36,000-72,000 |
| Power/cooling (3 years) | $1,800 | $0 |
| DevOps time (3 years) | $15,000 (part-time) | $5,000 (integration only) |
| BAA/compliance costs | $0 (internal) | $5,000-15,000 (vendor assessment) |
| 3-Year Total | $26,800 | $82,000-231,000 |
Assumptions: 1,000 inferences/day, average 500 input + 200 output tokens. Cloud pricing at $0.01-0.03/1K tokens (BAA-covered tier, typically 2-3x standard pricing). DevOps at $75/hour, 4 hours/week on-premise vs 1 hour/week cloud.
The breakeven point is typically around 200-300 inferences per day. Below that, cloud APIs with a BAA may be more cost-effective. Above that, on-premise wins and the gap widens every month.
Team Requirements
You do not need a dedicated ML team. You need:
- 1 DevOps/infrastructure engineer (part-time, ~4 hours/week): Handles server maintenance, model updates, monitoring alerts, and security patching. This person already exists on your IT team.
- 1 clinical champion per department: A clinician who owns the use case, validates outputs, and provides feedback for fine-tuning. Not a technical role.
- Vendor support (optional): Ertas or similar platform for fine-tuning, adapter management, and deployment tooling. Eliminates the need for ML expertise.
The most common mistake is over-staffing. On-premise AI inference is operationally similar to running any other internal service. If your team can manage an internal database server, they can manage an AI inference server.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Putting It Together
Here is the complete architecture for a mid-size hospital deployment:
Internet (updates only)
|
[DMZ: Update Proxy]
|
[Internal Network - VLAN 200: AI Infrastructure]
|
[Artifact Registry] --> [Inference Server: T4 GPU + Ollama]
| |
[API Gateway (Nginx)] [Audit DB (PostgreSQL)]
|
[mTLS + API Key Auth]
|
[VLAN 100: Clinical Systems]
| | |
[EHR] [PACS] [Clinical Apps]
Day-one deployment: Single T4 server, one department, one use case (clinical note summarization). Total cost under $12,000. Time to production: 2-3 weeks with existing IT staff.
Scale path: Add LoRA adapters for new departments. Add a second T4 for higher throughput. Add specialty models for radiology, pathology, coding. Each expansion is incremental — no rearchitecture needed.
The infrastructure is the easy part. The model serving stack is proven. The network patterns are well-understood. What matters is getting the first use case into production and proving value to the clinical staff who will use it every day.
Further Reading
- GPU Cost Comparison for Self-Hosting AI in 2026 — Detailed hardware benchmarks and pricing for inference workloads
- Self-Hosted AI ROI Calculator — Build a business case with real numbers for your specific volume
- HIPAA-Compliant AI: On-Premise vs Cloud API — Deep dive into compliance requirements and architectural trade-offs
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Premise AI for Banking: Satisfying Regulator Audit Requirements
Architecture and operational guide for deploying on-premise AI in banking environments that satisfy OCC, FINRA, and Federal Reserve audit requirements. Covers infrastructure, audit trails, access controls, change management, disaster recovery, and a 10-dimension compliance comparison.

Fine-Tuning AI for Healthcare: HIPAA-Compliant Pipeline from Data to Deployment
A comprehensive guide to building HIPAA-compliant fine-tuning pipelines for healthcare AI — covering de-identification methods, training data structures for five clinical use cases, model selection, and cost analysis of on-premise vs cloud deployment.

LoRA Adapters Per Healthcare Specialty: Radiology, Pathology, Primary Care
How to serve multiple hospital departments from a single base model using specialty-specific LoRA adapters. Covers architecture, training data requirements, storage math, adapter management, and performance benchmarks.