
How to Build an Air-Gapped AI Pipeline for Regulated Industries
A decision-stage technical guide to building an AI pipeline with zero internet connectivity. Covers pipeline architecture at each stage — data ingestion, cleaning, labeling, augmentation, and export — with hardware requirements, tool comparisons, and transfer mechanisms for air-gapped environments.
You have decided that your AI pipeline must run air-gapped — physically isolated from the internet with no exceptions. Maybe your data is classified. Maybe your regulator requires it. Maybe your security team conducted a risk assessment and concluded that any external connectivity is unacceptable for this particular workload.
This article is not about whether you need air-gapped operation. (If you are unsure, see our guide to air-gapped vs on-premise vs self-hosted deployment for the decision framework.) This article covers the architecture decisions you need to make at each pipeline stage when building an AI system that will never touch the internet.
The pipeline has five stages. Each stage has different hardware requirements, different tool constraints, and different failure modes when connectivity is removed. We will walk through each one.
Pipeline Architecture Overview
An air-gapped AI pipeline has the same logical stages as any other ML pipeline. The difference is that every component at every stage must function with zero external connectivity — no API calls, no license servers, no CDN-hosted assets, no telemetry, no dependency downloads.
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Ingest │──▶│ Clean │──▶│ Label │──▶│ Augment │──▶│ Export │
│ │ │ │ │ │ │ (Synthetic) │ │ │
│ OCR │ │ PII/PHI │ │ NER │ │ Local LLM │ │ JSONL │
│ PDF │ │ Redact │ │ Classify │ │ Inference │ │ COCO │
│ Layout │ │ Normalize│ │ BBox │ │ │ │ CSV │
└──────────┘ └──────────┘ └──────────┘ └──────────────┘ └──────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[Audit Log] [Audit Log] [Audit Log] [Audit Log] [Audit Log]
Every stage writes to a local audit log. In air-gapped environments, the audit trail is your only evidence of what happened to the data. There is no cloud logging service to fall back on.
Stage 1: Data Ingestion
Data ingestion converts raw enterprise files — PDFs, Word documents, scanned images, spreadsheets, emails — into machine-readable text and structured content. In an air-gapped environment, this means all parsing and OCR must be embedded in the application.
Architecture decisions
OCR engine selection: You need OCR that runs entirely locally, with no external API calls and no internet-dependent model downloads.
| OCR engine | Air-gapped compatible | Language support | Accuracy on clean docs | Accuracy on scanned docs | GPU acceleration |
|---|---|---|---|---|---|
| Tesseract 5.x | Yes — fully local, open source | 100+ languages via offline language packs | Good | Moderate | No |
| PaddleOCR | Yes — fully local, open source | 80+ languages, strong CJK support | Very good | Good | Yes (optional) |
| EasyOCR | Yes — fully local, open source | 80+ languages | Good | Moderate | Yes (optional) |
| Google Document AI | No — cloud API | N/A | N/A | N/A | N/A |
| Azure Document Intelligence | No — cloud API | N/A | N/A | N/A | N/A |
| AWS Textract | No — cloud API | N/A | N/A | N/A | N/A |
For air-gapped environments, Tesseract and PaddleOCR are the primary options. Tesseract is more widely deployed and has better documentation for offline installation. PaddleOCR typically produces better results on complex layouts (multi-column, tables, mixed text/image) but requires more careful dependency management for offline installation.
PDF parsing: PDF parsing has two modes — text extraction (for digitally-created PDFs) and OCR extraction (for scanned PDFs). Most enterprise document collections contain both.
| PDF parser | Air-gapped compatible | Handles scanned PDFs | Table extraction | Layout preservation |
|---|---|---|---|---|
| PyMuPDF (fitz) | Yes | With embedded OCR | Basic | Good |
| pdfplumber | Yes | No (text-only) | Good | Good |
| Docling | Yes (self-hosted) | With embedded OCR | Very good (97.9%) | Very good |
| Camelot | Yes | No (text-only) | Very good (tables specifically) | Limited |
| Marker | Yes | With embedded OCR | Good | Very good |
| Adobe Acrobat API | No — cloud service | N/A | N/A | N/A |
Recommendation for air-gapped: Docling (IBM Research, open source) for primary parsing, with PyMuPDF as a fallback for simpler documents. Docling's table extraction accuracy (97.9% on benchmarks) is important for enterprise documents where tables contain critical structured data.
Hardware requirements for ingestion
| Workload | CPU | RAM | GPU | Storage |
|---|---|---|---|---|
| Text PDF extraction (no OCR) | 4+ cores | 8 GB | Not required | 2x source document volume |
| OCR on scanned documents | 8+ cores | 16 GB | Optional (speeds PaddleOCR 3-5x) | 3x source document volume |
| High-volume ingestion (10K+ docs) | 16+ cores | 32 GB | Recommended | 3-5x source document volume |
Storage multiplier accounts for both the original documents and the extracted structured output (JSON, text, metadata).
Stage 2: Cleaning and De-Identification
Cleaning transforms raw extracted text into normalized, consistent content. De-identification detects and redacts personally identifiable information (PII) and protected health information (PHI). In air-gapped environments, all NLP models for entity detection must run locally.
Architecture decisions
PII/PHI detection approach: You have two options — rule-based pattern matching, or NLP model-based named entity recognition (NER). In practice, you need both.
| Detection method | What it catches | False positive rate | Air-gapped compatible |
|---|---|---|---|
| Regex pattern matching | SSNs, phone numbers, emails, credit cards, dates in standard formats, medical record numbers | Low (patterns are precise) | Yes — no dependencies |
| spaCy NER (local models) | Names, organizations, locations, dates in non-standard formats | Moderate (requires tuning) | Yes — model weights loaded from local storage |
| Hugging Face NER (GGUF/ONNX) | Names, organizations, domain-specific entities | Low-to-moderate | Yes — quantized models run locally |
| AWS Comprehend Medical | PHI in clinical text | Low | No — cloud API |
| Google Healthcare NLP | PHI in clinical text | Low | No — cloud API |
Recommended air-gapped approach: Layer both methods. Use regex patterns for structured identifiers (SSNs, phone numbers, emails, medical record numbers, dates). Use a locally loaded NER model (spaCy or quantized transformer) for unstructured identifiers (names, organizations, locations in free text).
For HIPAA-regulated data specifically, the de-identification must satisfy the Safe Harbor method (removal of 18 specific identifier categories) or the Expert Determination method. Regex catches most structured identifiers. NER catches the unstructured ones. A human review stage after automated de-identification is standard practice for HIPAA compliance.
Data normalization: Air-gapped environments often process documents accumulated over decades — different encoding schemes, inconsistent date formats, legacy character sets. Normalization converts these to consistent UTF-8 encoding, standardized date formats, and consistent whitespace handling. This is computationally cheap and has no connectivity requirements.
Hardware requirements for cleaning
| Workload | CPU | RAM | GPU | Notes |
|---|---|---|---|---|
| Regex-only PII detection | 4+ cores | 8 GB | Not required | Fast, handles millions of records |
| spaCy NER models | 4+ cores | 16 GB | Not required (CPU inference) | Slower than regex, more thorough |
| Transformer NER (quantized) | 8+ cores | 16 GB | 8+ GB VRAM recommended | Best accuracy, requires GPU for reasonable speed |
| Combined pipeline | 8+ cores | 32 GB | 16+ GB VRAM | Regex first pass, NER second pass, human review final pass |
Stage 3: Labeling and Annotation
Labeling is where domain experts assign categories, entities, bounding boxes, or quality scores to processed data. In air-gapped environments, the labeling interface must serve entirely from localhost — no external CDN assets, no cloud-synced projects, no browser-based tools that load scripts from remote servers.
Architecture decisions
Annotation tool selection: Most modern annotation tools are web applications that assume internet connectivity. Even self-hosted versions often load JavaScript libraries from CDNs, analytics scripts, or font files from external servers.
| Annotation tool | Air-gapped compatible | Modalities | Desktop native | Domain-expert accessible |
|---|---|---|---|---|
| Prodigy (Explosion AI) | Yes — fully local, perpetual license | NLP, CV, audio | Python-based (runs locally) | Moderate (requires terminal) |
| Label Studio (self-hosted) | Partial — check for external asset loading | NLP, CV, audio, video | No (Docker/K8s web app) | Yes (browser UI) |
| CVAT (self-hosted) | Partial — web app with potential external dependencies | CV only | No (Docker web app) | Yes (browser UI) |
| Labelbox | No — cloud SaaS | NLP, CV | No | Yes |
| Scale AI | No — cloud SaaS | NLP, CV | No | Yes |
The Label Studio caveat: Label Studio can be self-hosted, but the self-hosted version must be audited for external calls. Previous versions loaded Google Fonts from external CDN, included analytics scripts, and made calls to check for updates. In an air-gapped environment, these calls fail silently or cause errors. You need to verify — by inspecting network traffic — that your self-hosted Label Studio instance makes zero external HTTP requests.
Recommendation for air-gapped: For NLP annotation, Prodigy is the most reliably air-gapped option — it is a Python library with no web dependencies, serving its UI entirely from localhost. The trade-off is that it requires a Python environment, which limits accessibility for non-technical domain experts.
For organizations where domain experts (doctors, lawyers, engineers) need direct access to the labeling interface, a native desktop annotation tool that requires no terminal, no Python, and no browser connectivity is the best option. This is the approach Ertas Data Suite takes — a native desktop app where the entire annotation interface runs locally with zero network dependencies.
Hardware requirements for labeling
Labeling is the least compute-intensive stage. It is primarily a human activity with software assistance.
| Workload | CPU | RAM | GPU | Notes |
|---|---|---|---|---|
| Text annotation (NER, classification) | 2+ cores | 8 GB | Not required | Primarily UI-bound, not compute-bound |
| Image annotation (bounding boxes, segmentation) | 4+ cores | 16 GB | Optional (speeds rendering) | Large images need more RAM |
| AI-assisted labeling (model suggestions) | 8+ cores | 16 GB | 8+ GB VRAM | Local model provides label suggestions for human review |
Stage 4: Synthetic Data Augmentation
Synthetic data augmentation uses LLMs to generate additional training examples from existing labeled data. In an air-gapped environment, this requires running LLM inference locally — no cloud APIs, no external model endpoints.
Architecture decisions
Local LLM runtime selection:
| Runtime | Air-gapped compatible | Model format | GPU support | Multi-model serving |
|---|---|---|---|---|
| Ollama | Yes — offline installation available | GGUF | NVIDIA, AMD, Apple Silicon | Yes |
| llama.cpp | Yes — compile from source, no dependencies | GGUF | NVIDIA, AMD, Apple Silicon, Vulkan | No (single model) |
| vLLM | Yes — but complex offline dependency installation | SafeTensors, GPTQ | NVIDIA (primarily) | Yes |
| Microsoft Foundry Local | Yes — designed for disconnected operation | ONNX | NVIDIA, AMD, Intel, Qualcomm, Apple Silicon | Yes |
| Hugging Face Inference API | No — cloud endpoint | N/A | N/A | N/A |
Recommended for air-gapped: Ollama for general-purpose augmentation. It supports a wide range of GGUF models, has straightforward offline installation (copy the binary + model files), and serves an OpenAI-compatible API on localhost. For environments where Microsoft's ecosystem is preferred, Foundry Local is the alternative — with the trade-off of a narrower model selection.
Model selection for augmentation:
| Model | Parameters | VRAM required (Q4 quantized) | Augmentation quality | Air-gapped installation complexity |
|---|---|---|---|---|
| Phi-4-mini | 3.8B | ~4 GB | Good for simple tasks | Low (small download, fast transfer) |
| Llama 3.1 8B | 8B | ~6 GB | Good for general augmentation | Low |
| Mistral 7B | 7B | ~6 GB | Good for structured output | Low |
| Qwen 2.5 14B | 14B | ~10 GB | Very good | Moderate (larger transfer) |
| Llama 3.1 70B | 70B | ~40 GB | Excellent | High (large download, requires high-VRAM GPU) |
For most enterprise augmentation tasks — generating paraphrases, creating classification variants, expanding entity examples — an 8B-14B quantized model is the practical sweet spot. Quality is sufficient, hardware requirements are manageable, and the model files (4-10 GB) are feasible to transfer via removable media.
Hardware requirements for augmentation
| Workload | CPU | RAM | GPU | Throughput |
|---|---|---|---|---|
| 7-8B model augmentation | 8+ cores | 32 GB | 16 GB VRAM (RTX 4080 or equivalent) | ~30-50 tokens/sec |
| 14B model augmentation | 8+ cores | 32 GB | 24 GB VRAM (RTX 4090 or equivalent) | ~20-35 tokens/sec |
| 70B model augmentation | 16+ cores | 64 GB | 48+ GB VRAM (A6000 or 2x RTX 4090) | ~10-20 tokens/sec |
| CPU-only augmentation (7B) | 16+ cores | 64 GB | None | ~3-8 tokens/sec (slow but functional) |
GPU is strongly recommended for augmentation. CPU-only inference on 7B models works but generates data 5-10x slower, which matters when you need to produce thousands of synthetic training examples.
Stage 5: Export
Export converts processed, labeled, and augmented data into formats consumable by downstream training and deployment systems. In an air-gapped environment, export targets local storage — never cloud object storage.
Architecture decisions
Export format selection depends on downstream use case:
| Use case | Export format | File structure |
|---|---|---|
| LLM fine-tuning | JSONL (instruction, input, output) | One JSON object per line |
| RAG / retrieval | Chunked text with metadata | JSONL or structured JSON |
| Computer vision (object detection) | YOLO or COCO format | Images + annotation files |
| Computer vision (classification) | Directory structure with class folders | image/class_name/file.jpg |
| Classical ML | CSV with features and labels | Standard tabular format |
| DPO fine-tuning | JSONL with chosen/rejected pairs | Preference pairs per line |
Audit trail export: In regulated environments, the training data alone is not sufficient. You must also export:
- Data lineage (which source document produced which training example)
- Transformation log (every cleaning, redaction, and modification with timestamps)
- Operator log (who labeled what, when, and what they changed)
- Quality metrics (inter-annotator agreement, confidence scores)
For EU AI Act Article 30 compliance, this audit documentation must accompany the training data and be available for inspection. For HIPAA, the de-identification audit trail must demonstrate that PHI was properly removed before data was used for training.
Hardware requirements for export
| Workload | CPU | RAM | GPU | Notes |
|---|---|---|---|---|
| JSONL/CSV export | 2+ cores | 8 GB | Not required | I/O-bound, not compute-bound |
| Large-scale export (100K+ records) | 4+ cores | 16 GB | Not required | Disk speed matters more than CPU |
| Export with audit trail generation | 4+ cores | 16 GB | Not required | Audit trail can be larger than the data itself |
Transfer Mechanisms: Getting Software and Models into Air-Gapped Environments
The most overlooked aspect of air-gapped AI is initial setup. You cannot install software from the internet. You cannot download model weights. Everything must be transferred through approved physical channels.
Physical media transfer
The standard approach for classified and air-gapped environments:
- Prepare on a connected machine: Download all software installers, dependencies, model weights, and configuration files onto a clean, formatted drive
- Security scan: Run the media through your organization's malware scanning and security review process
- Chain of custody: Document who prepared the media, what it contains, and when it was transferred
- Install on the air-gapped machine: Copy files from approved media to the target system
- Verify integrity: Compare checksums (SHA-256) of installed files against the prepared manifest
For model weights specifically: a 7B GGUF model is roughly 4-6 GB. A 70B model is 35-45 GB. USB drives or portable SSDs handle these sizes easily. Larger datasets (hundreds of GB of source documents) may require portable NAS devices or multiple drives.
One-way data diodes
For organizations with more sophisticated air-gapped networks, hardware data diodes provide a one-way transfer mechanism. Data flows into the air-gapped network but cannot flow out. This is used in defense and critical infrastructure environments where removable media is also restricted.
Data diodes allow automated, scheduled transfers of model updates and software patches into the air-gapped environment without creating any outbound data path.
What must be pre-staged
Before isolating the machine, transfer all of the following:
| Category | Specific items | Typical size |
|---|---|---|
| Application installers | AI pipeline software, annotation tools, inference runtime | 1-5 GB |
| Runtime dependencies | Python packages (wheel files), system libraries | 2-10 GB |
| OCR language packs | Tesseract language data, PaddleOCR models | 0.5-2 GB |
| NER models | spaCy models, quantized transformer models for PII detection | 1-5 GB |
| LLM weights | GGUF models for augmentation and AI-assisted labeling | 4-45 GB per model |
| Configuration files | Pipeline configs, export templates, audit trail schemas | <100 MB |
Total pre-staging for a complete air-gapped AI pipeline: approximately 10-70 GB, depending on how many LLM models you include.
Compliance Mapping: Who Actually Requires Air-Gapped?
Not every regulation requires air-gapped operation. Understanding which regulations require which deployment model prevents over-engineering.
| Regulation / Context | Air-gapped required? | On-premise sufficient? | Notes |
|---|---|---|---|
| US classified systems (ITAR, classified data) | Yes | No | Physical isolation required by policy |
| US CMMC Level 3+ (DoD contractors) | Often yes | Depends on data type | Controlled Unclassified Information handling |
| HIPAA (healthcare) | No (but recommended for PHI training data) | Yes | HIPAA requires safeguards, not specific deployment models |
| GDPR (EU) | No | Often sufficient | Requires data residency + processing controls; on-premise with audit trail satisfies most requirements |
| EU AI Act (high-risk systems) | No | Often sufficient | Requires documentation and audit trail; deployment model is not prescribed |
| India DPDP Act | No | May be required for significant data fiduciaries | Data localization for certain categories |
| Saudi Arabia PDPL | No | Effectively required for personal data | Processing within the Kingdom |
| Financial regulations (SOX, PCI-DSS) | No (except for specific high-security environments) | Yes | Strong access controls required; deployment model flexible |
| Critical infrastructure (NERC CIP) | Often yes for OT networks | Yes for IT networks | OT/IT segmentation is standard |
The practical guideline: Air-gapped is required for classified/defense data and critical infrastructure OT networks. On-premise is sufficient for most regulated industries (healthcare, finance, legal). Sovereign cloud (domestic provider) is acceptable for data that requires jurisdictional control but not physical isolation.
Putting It Together: Reference Architecture
A complete air-gapped AI pipeline for a regulated enterprise:
Hardware:
- Workstation or server: 16+ cores, 64 GB RAM, NVIDIA RTX 4090 (24 GB VRAM) or A6000 (48 GB VRAM)
- Local storage: 2+ TB NVMe SSD for active projects, plus NAS for archival
- Removable media station: for initial setup and periodic model/software updates
Software stack:
- OS: Linux (Ubuntu/RHEL) or Windows, fully updated before isolation
- Ingestion: Docling + PyMuPDF + Tesseract/PaddleOCR
- Cleaning: spaCy NER + regex patterns + custom rules
- Labeling: Native desktop annotation tool (no Docker, no browser dependencies)
- Augmentation: Ollama + Llama 3.1 8B (GGUF Q4)
- Export: JSONL + audit trail generator
- Inference runtime: Ollama, llama.cpp, or Foundry Local
Estimated hardware cost: $8,000-$15,000 for a workstation build (RTX 4090 class), or $20,000-$40,000 for a server build (A6000 class). Compare to cloud GPU costs of $2-$4/hour for equivalent compute — the on-premise hardware pays for itself in 6-18 months of continuous use.
This architecture handles the complete pipeline from raw documents to AI-ready training data, entirely within an air-gapped perimeter, with full audit trail at every stage.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- Air-Gapped Machine Learning: How to Build AI Data Pipelines Without Internet Access — Conceptual overview of air-gapped vs on-premise vs self-hosted deployment, with tool analysis for each pipeline stage.
- Sovereign AI for Enterprise: What It Means and Why It Matters in 2026 — The three layers of AI sovereignty and why they matter for regulated enterprises.
- Sovereign AI vs Cloud AI: Data Residency Requirements by Country and Region — Country-by-country reference guide to data residency requirements for AI systems.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Best RAG Pipeline for Financial Services: Air-Gapped Retrieval for PII-Heavy Data
Financial institutions handle PII-dense documents that cannot touch cloud infrastructure. Here is how to build an air-gapped RAG pipeline that meets SOC 2, GDPR, and internal audit requirements while keeping retrieval fast.

Sovereign AI for Enterprise: What It Means and Why It Matters in 2026
Sovereign AI is the capability to develop, deploy, and control AI systems without dependency on foreign infrastructure, vendors, or legal jurisdictions. This guide covers the three layers of sovereignty, the regulations driving adoption, real-world implementations, and an enterprise buyer's checklist.

On-Premise AI for Government: Meeting National Security Data Requirements
A vertical guide for government and defense buyers evaluating on-premise AI infrastructure — covering FedRAMP, ITAR, NIST 800-171, classified network compatibility, air-gapped operations, and the data preparation challenge most vendors ignore.