How to Build an Air-Gapped AI Pipeline for Regulated Industries

You have decided that your AI pipeline must run air-gapped — physically isolated from the internet with no exceptions. Maybe your data is classified. Maybe your regulator requires it. Maybe your security team conducted a risk assessment and concluded that any external connectivity is unacceptable for this particular workload.

This article is not about whether you need air-gapped operation. (If you are unsure, see our guide to air-gapped vs on-premise vs self-hosted deployment for the decision framework.) This article covers the architecture decisions you need to make at each pipeline stage when building an AI system that will never touch the internet.

The pipeline has five stages. Each stage has different hardware requirements, different tool constraints, and different failure modes when connectivity is removed. We will walk through each one.

Pipeline Architecture Overview

An air-gapped AI pipeline has the same logical stages as any other ML pipeline. The difference is that every component at every stage must function with zero external connectivity — no API calls, no license servers, no CDN-hosted assets, no telemetry, no dependency downloads.

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────┐   ┌──────────┐
│  Ingest  │──▶│  Clean   │──▶│  Label   │──▶│  Augment     │──▶│  Export  │
│          │   │          │   │          │   │  (Synthetic)  │   │          │
│ OCR      │   │ PII/PHI  │   │ NER      │   │ Local LLM    │   │ JSONL    │
│ PDF      │   │ Redact   │   │ Classify │   │ Inference     │   │ COCO     │
│ Layout   │   │ Normalize│   │ BBox     │   │              │   │ CSV      │
└──────────┘   └──────────┘   └──────────┘   └──────────────┘   └──────────┘
       │              │              │              │                   │
       ▼              ▼              ▼              ▼                   ▼
   [Audit Log]   [Audit Log]   [Audit Log]   [Audit Log]         [Audit Log]

Every stage writes to a local audit log. In air-gapped environments, the audit trail is your only evidence of what happened to the data. There is no cloud logging service to fall back on.

Stage 1: Data Ingestion

Data ingestion converts raw enterprise files — PDFs, Word documents, scanned images, spreadsheets, emails — into machine-readable text and structured content. In an air-gapped environment, this means all parsing and OCR must be embedded in the application.

Architecture decisions

OCR engine selection: You need OCR that runs entirely locally, with no external API calls and no internet-dependent model downloads.

OCR engine	Air-gapped compatible	Language support	Accuracy on clean docs	Accuracy on scanned docs	GPU acceleration
Tesseract 5.x	Yes — fully local, open source	100+ languages via offline language packs	Good	Moderate	No
PaddleOCR	Yes — fully local, open source	80+ languages, strong CJK support	Very good	Good	Yes (optional)
EasyOCR	Yes — fully local, open source	80+ languages	Good	Moderate	Yes (optional)
Google Document AI	No — cloud API	N/A	N/A	N/A	N/A
Azure Document Intelligence	No — cloud API	N/A	N/A	N/A	N/A
AWS Textract	No — cloud API	N/A	N/A	N/A	N/A

For air-gapped environments, Tesseract and PaddleOCR are the primary options. Tesseract is more widely deployed and has better documentation for offline installation. PaddleOCR typically produces better results on complex layouts (multi-column, tables, mixed text/image) but requires more careful dependency management for offline installation.

PDF parsing: PDF parsing has two modes — text extraction (for digitally-created PDFs) and OCR extraction (for scanned PDFs). Most enterprise document collections contain both.

PDF parser	Air-gapped compatible	Handles scanned PDFs	Table extraction	Layout preservation
PyMuPDF (fitz)	Yes	With embedded OCR	Basic	Good
pdfplumber	Yes	No (text-only)	Good	Good
Docling	Yes (self-hosted)	With embedded OCR	Very good (97.9%)	Very good
Camelot	Yes	No (text-only)	Very good (tables specifically)	Limited
Marker	Yes	With embedded OCR	Good	Very good
Adobe Acrobat API	No — cloud service	N/A	N/A	N/A

Recommendation for air-gapped: Docling (IBM Research, open source) for primary parsing, with PyMuPDF as a fallback for simpler documents. Docling's table extraction accuracy (97.9% on benchmarks) is important for enterprise documents where tables contain critical structured data.

Hardware requirements for ingestion

Workload	CPU	RAM	GPU	Storage
Text PDF extraction (no OCR)	4+ cores	8 GB	Not required	2x source document volume
OCR on scanned documents	8+ cores	16 GB	Optional (speeds PaddleOCR 3-5x)	3x source document volume
High-volume ingestion (10K+ docs)	16+ cores	32 GB	Recommended	3-5x source document volume

Storage multiplier accounts for both the original documents and the extracted structured output (JSON, text, metadata).

Stage 2: Cleaning and De-Identification

Cleaning transforms raw extracted text into normalized, consistent content. De-identification detects and redacts personally identifiable information (PII) and protected health information (PHI). In air-gapped environments, all NLP models for entity detection must run locally.

Architecture decisions

PII/PHI detection approach: You have two options — rule-based pattern matching, or NLP model-based named entity recognition (NER). In practice, you need both.

Detection method	What it catches	False positive rate	Air-gapped compatible
Regex pattern matching	SSNs, phone numbers, emails, credit cards, dates in standard formats, medical record numbers	Low (patterns are precise)	Yes — no dependencies
spaCy NER (local models)	Names, organizations, locations, dates in non-standard formats	Moderate (requires tuning)	Yes — model weights loaded from local storage
Hugging Face NER (GGUF/ONNX)	Names, organizations, domain-specific entities	Low-to-moderate	Yes — quantized models run locally
AWS Comprehend Medical	PHI in clinical text	Low	No — cloud API
Google Healthcare NLP	PHI in clinical text	Low	No — cloud API

Recommended air-gapped approach: Layer both methods. Use regex patterns for structured identifiers (SSNs, phone numbers, emails, medical record numbers, dates). Use a locally loaded NER model (spaCy or quantized transformer) for unstructured identifiers (names, organizations, locations in free text).

For HIPAA-regulated data specifically, the de-identification must satisfy the Safe Harbor method (removal of 18 specific identifier categories) or the Expert Determination method. Regex catches most structured identifiers. NER catches the unstructured ones. A human review stage after automated de-identification is standard practice for HIPAA compliance.

Data normalization: Air-gapped environments often process documents accumulated over decades — different encoding schemes, inconsistent date formats, legacy character sets. Normalization converts these to consistent UTF-8 encoding, standardized date formats, and consistent whitespace handling. This is computationally cheap and has no connectivity requirements.

Hardware requirements for cleaning

Workload	CPU	RAM	GPU	Notes
Regex-only PII detection	4+ cores	8 GB	Not required	Fast, handles millions of records
spaCy NER models	4+ cores	16 GB	Not required (CPU inference)	Slower than regex, more thorough
Transformer NER (quantized)	8+ cores	16 GB	8+ GB VRAM recommended	Best accuracy, requires GPU for reasonable speed
Combined pipeline	8+ cores	32 GB	16+ GB VRAM	Regex first pass, NER second pass, human review final pass

Stage 3: Labeling and Annotation

Labeling is where domain experts assign categories, entities, bounding boxes, or quality scores to processed data. In air-gapped environments, the labeling interface must serve entirely from localhost — no external CDN assets, no cloud-synced projects, no browser-based tools that load scripts from remote servers.

Architecture decisions

Annotation tool selection: Most modern annotation tools are web applications that assume internet connectivity. Even self-hosted versions often load JavaScript libraries from CDNs, analytics scripts, or font files from external servers.

Annotation tool	Air-gapped compatible	Modalities	Desktop native	Domain-expert accessible
Prodigy (Explosion AI)	Yes — fully local, perpetual license	NLP, CV, audio	Python-based (runs locally)	Moderate (requires terminal)
Label Studio (self-hosted)	Partial — check for external asset loading	NLP, CV, audio, video	No (Docker/K8s web app)	Yes (browser UI)
CVAT (self-hosted)	Partial — web app with potential external dependencies	CV only	No (Docker web app)	Yes (browser UI)
Labelbox	No — cloud SaaS	NLP, CV	No	Yes
Scale AI	No — cloud SaaS	NLP, CV	No	Yes

The Label Studio caveat: Label Studio can be self-hosted, but the self-hosted version must be audited for external calls. Previous versions loaded Google Fonts from external CDN, included analytics scripts, and made calls to check for updates. In an air-gapped environment, these calls fail silently or cause errors. You need to verify — by inspecting network traffic — that your self-hosted Label Studio instance makes zero external HTTP requests.

Recommendation for air-gapped: For NLP annotation, Prodigy is the most reliably air-gapped option — it is a Python library with no web dependencies, serving its UI entirely from localhost. The trade-off is that it requires a Python environment, which limits accessibility for non-technical domain experts.

For organizations where domain experts (doctors, lawyers, engineers) need direct access to the labeling interface, a native desktop annotation tool that requires no terminal, no Python, and no browser connectivity is the best option. This is the approach Ertas Data Suite takes — a native desktop app where the entire annotation interface runs locally with zero network dependencies.

Hardware requirements for labeling

Labeling is the least compute-intensive stage. It is primarily a human activity with software assistance.

Workload	CPU	RAM	GPU	Notes
Text annotation (NER, classification)	2+ cores	8 GB	Not required	Primarily UI-bound, not compute-bound
Image annotation (bounding boxes, segmentation)	4+ cores	16 GB	Optional (speeds rendering)	Large images need more RAM
AI-assisted labeling (model suggestions)	8+ cores	16 GB	8+ GB VRAM	Local model provides label suggestions for human review

Stage 4: Synthetic Data Augmentation

Synthetic data augmentation uses LLMs to generate additional training examples from existing labeled data. In an air-gapped environment, this requires running LLM inference locally — no cloud APIs, no external model endpoints.

Architecture decisions

Local LLM runtime selection:

Runtime	Air-gapped compatible	Model format	GPU support	Multi-model serving
Ollama	Yes — offline installation available	GGUF	NVIDIA, AMD, Apple Silicon	Yes
llama.cpp	Yes — compile from source, no dependencies	GGUF	NVIDIA, AMD, Apple Silicon, Vulkan	No (single model)
vLLM	Yes — but complex offline dependency installation	SafeTensors, GPTQ	NVIDIA (primarily)	Yes
Microsoft Foundry Local	Yes — designed for disconnected operation	ONNX	NVIDIA, AMD, Intel, Qualcomm, Apple Silicon	Yes
Hugging Face Inference API	No — cloud endpoint	N/A	N/A	N/A

Recommended for air-gapped: Ollama for general-purpose augmentation. It supports a wide range of GGUF models, has straightforward offline installation (copy the binary + model files), and serves an OpenAI-compatible API on localhost. For environments where Microsoft's ecosystem is preferred, Foundry Local is the alternative — with the trade-off of a narrower model selection.

Model selection for augmentation:

Model	Parameters	VRAM required (Q4 quantized)	Augmentation quality	Air-gapped installation complexity
Phi-4-mini	3.8B	~4 GB	Good for simple tasks	Low (small download, fast transfer)
Llama 3.1 8B	8B	~6 GB	Good for general augmentation	Low
Mistral 7B	7B	~6 GB	Good for structured output	Low
Qwen 2.5 14B	14B	~10 GB	Very good	Moderate (larger transfer)
Llama 3.1 70B	70B	~40 GB	Excellent	High (large download, requires high-VRAM GPU)

For most enterprise augmentation tasks — generating paraphrases, creating classification variants, expanding entity examples — an 8B-14B quantized model is the practical sweet spot. Quality is sufficient, hardware requirements are manageable, and the model files (4-10 GB) are feasible to transfer via removable media.

Hardware requirements for augmentation

Workload	CPU	RAM	GPU	Throughput
7-8B model augmentation	8+ cores	32 GB	16 GB VRAM (RTX 4080 or equivalent)	~30-50 tokens/sec
14B model augmentation	8+ cores	32 GB	24 GB VRAM (RTX 4090 or equivalent)	~20-35 tokens/sec
70B model augmentation	16+ cores	64 GB	48+ GB VRAM (A6000 or 2x RTX 4090)	~10-20 tokens/sec
CPU-only augmentation (7B)	16+ cores	64 GB	None	~3-8 tokens/sec (slow but functional)

GPU is strongly recommended for augmentation. CPU-only inference on 7B models works but generates data 5-10x slower, which matters when you need to produce thousands of synthetic training examples.

Stage 5: Export

Export converts processed, labeled, and augmented data into formats consumable by downstream training and deployment systems. In an air-gapped environment, export targets local storage — never cloud object storage.

Architecture decisions

Export format selection depends on downstream use case:

Use case	Export format	File structure
LLM fine-tuning	JSONL (instruction, input, output)	One JSON object per line
RAG / retrieval	Chunked text with metadata	JSONL or structured JSON
Computer vision (object detection)	YOLO or COCO format	Images + annotation files
Computer vision (classification)	Directory structure with class folders	image/class_name/file.jpg
Classical ML	CSV with features and labels	Standard tabular format
DPO fine-tuning	JSONL with chosen/rejected pairs	Preference pairs per line

Audit trail export: In regulated environments, the training data alone is not sufficient. You must also export:

Data lineage (which source document produced which training example)
Transformation log (every cleaning, redaction, and modification with timestamps)
Operator log (who labeled what, when, and what they changed)
Quality metrics (inter-annotator agreement, confidence scores)

For EU AI Act Article 30 compliance, this audit documentation must accompany the training data and be available for inspection. For HIPAA, the de-identification audit trail must demonstrate that PHI was properly removed before data was used for training.

Hardware requirements for export

Workload	CPU	RAM	GPU	Notes
JSONL/CSV export	2+ cores	8 GB	Not required	I/O-bound, not compute-bound
Large-scale export (100K+ records)	4+ cores	16 GB	Not required	Disk speed matters more than CPU
Export with audit trail generation	4+ cores	16 GB	Not required	Audit trail can be larger than the data itself

Transfer Mechanisms: Getting Software and Models into Air-Gapped Environments

The most overlooked aspect of air-gapped AI is initial setup. You cannot install software from the internet. You cannot download model weights. Everything must be transferred through approved physical channels.

Physical media transfer

The standard approach for classified and air-gapped environments:

Prepare on a connected machine: Download all software installers, dependencies, model weights, and configuration files onto a clean, formatted drive
Security scan: Run the media through your organization's malware scanning and security review process
Chain of custody: Document who prepared the media, what it contains, and when it was transferred
Install on the air-gapped machine: Copy files from approved media to the target system
Verify integrity: Compare checksums (SHA-256) of installed files against the prepared manifest

For model weights specifically: a 7B GGUF model is roughly 4-6 GB. A 70B model is 35-45 GB. USB drives or portable SSDs handle these sizes easily. Larger datasets (hundreds of GB of source documents) may require portable NAS devices or multiple drives.

One-way data diodes

For organizations with more sophisticated air-gapped networks, hardware data diodes provide a one-way transfer mechanism. Data flows into the air-gapped network but cannot flow out. This is used in defense and critical infrastructure environments where removable media is also restricted.

Data diodes allow automated, scheduled transfers of model updates and software patches into the air-gapped environment without creating any outbound data path.

What must be pre-staged

Before isolating the machine, transfer all of the following:

Category	Specific items	Typical size
Application installers	AI pipeline software, annotation tools, inference runtime	1-5 GB
Runtime dependencies	Python packages (wheel files), system libraries	2-10 GB
OCR language packs	Tesseract language data, PaddleOCR models	0.5-2 GB
NER models	spaCy models, quantized transformer models for PII detection	1-5 GB
LLM weights	GGUF models for augmentation and AI-assisted labeling	4-45 GB per model
Configuration files	Pipeline configs, export templates, audit trail schemas	<100 MB

Total pre-staging for a complete air-gapped AI pipeline: approximately 10-70 GB, depending on how many LLM models you include.

Compliance Mapping: Who Actually Requires Air-Gapped?

Not every regulation requires air-gapped operation. Understanding which regulations require which deployment model prevents over-engineering.

Regulation / Context	Air-gapped required?	On-premise sufficient?	Notes
US classified systems (ITAR, classified data)	Yes	No	Physical isolation required by policy
US CMMC Level 3+ (DoD contractors)	Often yes	Depends on data type	Controlled Unclassified Information handling
HIPAA (healthcare)	No (but recommended for PHI training data)	Yes	HIPAA requires safeguards, not specific deployment models
GDPR (EU)	No	Often sufficient	Requires data residency + processing controls; on-premise with audit trail satisfies most requirements
EU AI Act (high-risk systems)	No	Often sufficient	Requires documentation and audit trail; deployment model is not prescribed
India DPDP Act	No	May be required for significant data fiduciaries	Data localization for certain categories
Saudi Arabia PDPL	No	Effectively required for personal data	Processing within the Kingdom
Financial regulations (SOX, PCI-DSS)	No (except for specific high-security environments)	Yes	Strong access controls required; deployment model flexible
Critical infrastructure (NERC CIP)	Often yes for OT networks	Yes for IT networks	OT/IT segmentation is standard

The practical guideline: Air-gapped is required for classified/defense data and critical infrastructure OT networks. On-premise is sufficient for most regulated industries (healthcare, finance, legal). Sovereign cloud (domestic provider) is acceptable for data that requires jurisdictional control but not physical isolation.

Putting It Together: Reference Architecture

A complete air-gapped AI pipeline for a regulated enterprise:

Hardware:

Workstation or server: 16+ cores, 64 GB RAM, NVIDIA RTX 4090 (24 GB VRAM) or A6000 (48 GB VRAM)
Local storage: 2+ TB NVMe SSD for active projects, plus NAS for archival
Removable media station: for initial setup and periodic model/software updates

Software stack:

OS: Linux (Ubuntu/RHEL) or Windows, fully updated before isolation
Ingestion: Docling + PyMuPDF + Tesseract/PaddleOCR
Cleaning: spaCy NER + regex patterns + custom rules
Labeling: Native desktop annotation tool (no Docker, no browser dependencies)
Augmentation: Ollama + Llama 3.1 8B (GGUF Q4)
Export: JSONL + audit trail generator
Inference runtime: Ollama, llama.cpp, or Foundry Local

Estimated hardware cost: $8,000-$15,000 for a workstation build (RTX 4090 class), or $20,000-$40,000 for a server build (A6000 class). Compare to cloud GPU costs of $2-$4/hour for equivalent compute — the on-premise hardware pays for itself in 6-18 months of continuous use.

This architecture handles the complete pipeline from raw documents to AI-ready training data, entirely within an air-gapped perimeter, with full audit trail at every stage.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

Air-Gapped Machine Learning: How to Build AI Data Pipelines Without Internet Access — Conceptual overview of air-gapped vs on-premise vs self-hosted deployment, with tool analysis for each pipeline stage.
Sovereign AI for Enterprise: What It Means and Why It Matters in 2026 — The three layers of AI sovereignty and why they matter for regulated enterprises.
Sovereign AI vs Cloud AI: Data Residency Requirements by Country and Region — Country-by-country reference guide to data residency requirements for AI systems.

How to Build an Air-Gapped AI Pipeline for Regulated Industries

Pipeline Architecture Overview

Stage 1: Data Ingestion

Architecture decisions

Hardware requirements for ingestion

Stage 2: Cleaning and De-Identification

Architecture decisions

Hardware requirements for cleaning

Stage 3: Labeling and Annotation

Architecture decisions

Hardware requirements for labeling

Stage 4: Synthetic Data Augmentation

Architecture decisions

Hardware requirements for augmentation

Stage 5: Export

Architecture decisions

Hardware requirements for export

Transfer Mechanisms: Getting Software and Models into Air-Gapped Environments

Physical media transfer

One-way data diodes

What must be pre-staged

Compliance Mapping: Who Actually Requires Air-Gapped?

Putting It Together: Reference Architecture

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Best RAG Pipeline for Financial Services: Air-Gapped Retrieval for PII-Heavy Data

Sovereign AI for Enterprise: What It Means and Why It Matters in 2026

On-Premise AI for Government: Meeting National Security Data Requirements

Pipeline Architecture Overview

Stage 1: Data Ingestion

Architecture decisions

Hardware requirements for ingestion

Stage 2: Cleaning and De-Identification

Architecture decisions

Hardware requirements for cleaning

Stage 3: Labeling and Annotation

Architecture decisions

Hardware requirements for labeling

Stage 4: Synthetic Data Augmentation

Architecture decisions

Hardware requirements for augmentation

Stage 5: Export

Architecture decisions

Hardware requirements for export

Transfer Mechanisms: Getting Software and Models into Air-Gapped Environments

Physical media transfer

One-way data diodes

What must be pre-staged

Compliance Mapping: Who Actually Requires Air-Gapped?

Putting It Together: Reference Architecture

Related Reading

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Best RAG Pipeline for Financial Services: Air-Gapped Retrieval for PII-Heavy Data

Sovereign AI for Enterprise: What It Means and Why It Matters in 2026

On-Premise AI for Government: Meeting National Security Data Requirements