Air-Gapped Machine Learning: How to Build AI Data Pipelines Without Internet Access

"Air-gapped" is a term that gets used loosely in enterprise AI discussions. It often means "we do not want data leaving our network" or "we would prefer on-premise." These are legitimate requirements, but they are not the same as genuine air-gapped operation. In true air-gapped environments — classified government systems, critical infrastructure networks, high-security financial systems — there is no internet connection at all. Not restricted. Not monitored. Absent.

Building AI data preparation pipelines for these environments requires a different architecture than typical on-premise deployments. Every component must function without phoning home, checking for license updates, downloading model weights, or accessing external APIs. Most modern software fails this test in ways that are not obvious at installation time.

This guide covers the three deployment models (air-gapped, on-premise, self-hosted), who actually needs genuine air-gapped operation, what a complete ML data pipeline looks like without connectivity, and which tools fail in air-gapped environments.

Three Models: Air-Gapped, On-Premise, Self-Hosted

These terms are used interchangeably in vendor marketing. They are not the same.

Model	Infrastructure	Internet at runtime	Data stays in org	Regulatory use
SaaS / Cloud	Vendor's cloud	Yes	No	Rarely compliant
Self-hosted	Your servers, any location	Optional	Yes (with controls)	Conditionally compliant
On-premise	Hardware you own, in your building	Optional	Yes	Often compliant
Air-gapped	Hardware you own, physically isolated network	No	Yes	Fully isolated

Self-hosted means you run the software on your own servers — but those servers may be in a cloud data center, and the software may still make external connections (for license validation, telemetry, model downloads, or dependency updates). Self-hosted is not air-gapped.

On-premise typically means software running on hardware in your facility. It may still make outbound connections for updates or telemetry. "On-premise" in vendor documentation often just means "you install it yourself."

Air-gapped means the host machine has no network connection to the internet and, in strict implementations, no connection to any external network. Software in an air-gapped environment cannot reach external services under any circumstances — not by accident and not by design.

The compliance implications differ:

Self-hosted on a cloud provider's infrastructure: still subject to that provider's legal obligations and potential government access requests
On-premise with internet access: can still exfiltrate data (intentionally or via a compromised component); does not satisfy "no data egress" requirements for highest-security environments
Air-gapped: physically isolated; only attack vector is removable media or physical access; satisfies the most demanding data sovereignty requirements

Who Actually Needs Air-Gapped Operation

Genuine air-gapped requirements appear in specific contexts:

Defense and intelligence: Government contractors and agencies working with classified information operate under strict network segmentation requirements. AI development tools must be certified for operation on classified networks.

Critical infrastructure: Power grid operators, water treatment facilities, and similar operators are increasingly deploying AI for predictive maintenance and anomaly detection. Their operational technology (OT) networks are often isolated from corporate IT networks and have no internet connectivity.

Financial institutions and trading firms: High-frequency trading systems and certain risk models operate on isolated networks to prevent information leakage and ensure latency control. Some financial regulators require data used in certain models to remain in specific network environments.

Legal and regulatory proceedings: Law firms and litigation support teams working with privileged or court-sealed documents may be required to process those documents in environments with no external connectivity.

Healthcare with strict data governance: While HIPAA does not specifically require air-gapped operation, some healthcare organizations operating under state-level or contractual data handling requirements have chosen air-gapped environments as the only way to guarantee data isolation.

Cybersecurity operations: Security operations centers working with threat intelligence and incident data may operate on isolated networks to prevent adversary access to analysis tools.

A cybersecurity firm told us directly: "Most AI tools process inference over the cloud, making the data essentially public." For organizations where the training data is itself sensitive threat intelligence or classified information, that is an unacceptable risk — and air-gapped operation is the only alternative.

The Full Pipeline: What Each Stage Requires Without Connectivity

Stage 1: Document Ingestion

Document parsing in an air-gapped environment means all parsing logic — including OCR — must be bundled with the application and operate without external calls.

What fails: Cloud OCR APIs (Google Document AI, Azure Document Intelligence, AWS Textract). Any library that proxies OCR to an external service. Document parsers that check for model updates at runtime.

What works: Embedded OCR engines (Tesseract, EasyOCR, PaddleOCR) bundled with the application. Layout analysis models (for multi-column PDFs, tables, headers) loaded from local model files. Image preprocessing for scan quality enhancement running locally.

The practical challenge: embedded OCR is slower and sometimes less accurate than cloud API OCR. For a regulated enterprise where data cannot leave the network, this is the acceptable trade-off. Accuracy can be improved by pre-processing scan quality and using domain-specific OCR configurations.

Stage 2: Cleaning and De-Identification

PII/PHI detection and redaction requires NLP models that can run locally. Named entity recognition for identifying names, dates, organizations, medical record numbers, and other sensitive entities must use locally loaded model weights.

What fails: Cloud NLP APIs (AWS Comprehend Medical, Google Healthcare Natural Language API, Azure Text Analytics for Health). Any PII detection tool that sends documents to an external endpoint.

What works: spaCy with locally loaded NER models, Hugging Face Transformers with GGUF-quantized models loaded from local storage, rule-based pattern matching for structured identifiers (phone numbers, SSNs, medical record numbers).

For air-gapped environments, model weights must be transferred via approved removable media during the initial setup phase. After that, the system operates entirely from local storage.

Stage 3: Annotation

Annotation — the human labeling of documents for NER, classification, bounding boxes, or Q&A pairs — does not inherently require internet connectivity. The challenge is that most annotation platforms are web-based SaaS tools that require an active connection.

What fails: Label Studio Cloud, Scale AI, Amazon SageMaker Ground Truth, Labelbox, any browser-based annotation tool backed by external servers.

What works: Self-installable annotation tools with no external dependencies; annotation workflows built into local desktop applications; browser-based tools that can serve entirely from localhost with no external asset loading.

The annotation stage is where many air-gapped pipelines break down — teams assume they can "just use Label Studio self-hosted" without checking whether the self-hosted version makes external calls for analytics, CDN assets, or license validation.

Stage 4: Synthetic Data Augmentation

Generating synthetic training data using an LLM is one of the most internet-dependent operations in a modern AI pipeline. Cloud LLM APIs (OpenAI, Anthropic, Google, Cohere) are simply not available in an air-gapped environment.

What fails: Any augmentation workflow that calls an external LLM API. Distilabel and similar libraries when configured with cloud endpoints. Hugging Face Inference API.

What works: Locally hosted LLMs using Ollama or llama.cpp. GGUF-quantized models (Llama 3, Mistral, Qwen, and others) loaded from local storage. Inference running on local GPU resources.

The practical requirements:

A machine with sufficient GPU VRAM (16GB minimum for useful 7B models; 48GB for 30B+ models)
Model weights pre-downloaded and transferred via removable media to the air-gapped machine
Ollama or llama.cpp installed without package manager internet access (offline installation packages required)

For most document augmentation use cases, a 7B or 13B quantized model running on a workstation GPU is sufficient. Quality is lower than frontier cloud models but adequate for generating training variants of structured documents.

Stage 5: Export

Export — producing JSONL, YOLO/COCO, CSV, or chunked text from the annotated dataset — is the least connectivity-dependent stage. It is also where the audit trail must be finalized and exported alongside the training data.

What fails: Export pipelines that sync to cloud storage (S3, Azure Blob) as part of the export step. Versioning tools that use cloud-based artifact registries.

What works: Local file export to attached storage or an air-gapped internal network share. Local artifact versioning using git or similar tools without remote push.

Requirements for a Truly Air-Gapped ML Setup

Setting up an air-gapped AI data preparation environment requires planning before the machine is isolated. After isolation, you cannot download dependencies.

Pre-installation checklist:

All application installers transferred via approved removable media
All runtime dependencies (Python packages, system libraries) bundled or pre-installed
All ML model weights downloaded and transferred (NER models, OCR models, LLMs)
Annotation interface serving all assets from localhost (no external CDN references)
License validation configured for offline operation or perpetual license
Internal documentation and update procedures established

Hardware requirements:

GPU workstation or server for LLM augmentation (16–48GB VRAM depending on model size)
Sufficient local storage for source documents, processed data, and model weights
Internal network share for multi-user access (not internet-connected)

Operational procedures:

Software updates via removable media review process
Model updates reviewed and approved before transfer to isolated network
Audit log backup to internal archive storage

Tools That Fail in Air-Gapped Environments

Tool	Why It Fails
Unstructured.io cloud API	Cloud-only document parsing
Adobe Acrobat AI features	Cloud LLM processing
Label Studio Cloud	SaaS platform
Scale AI / Labelbox	Cloud annotation platforms
Cleanlab / Dataiku cloud	Cloud processing for quality scoring
Distilabel with cloud LLMs	Requires external LLM API
Hugging Face Inference API	Cloud inference endpoint
GitHub Copilot / any coding AI	Requires internet connection

How Ertas Data Suite Works in Air-Gapped Environments

Ertas Data Suite was designed for air-gapped operation from the ground up. It installs as a native desktop application — no Docker, no package manager internet access required during installation. All OCR, NER, and processing models are bundled. The annotation interface runs locally. The Augment module uses Ollama with locally hosted models and makes no external calls.

The entire pipeline — Ingest, Clean, Label, Augment, Export — runs without an internet connection at any stage. The audit trail is written to local storage and exported with the dataset. Software activation supports offline licensing for environments where license servers are not accessible.

For organizations with genuine air-gapped requirements, this architecture is not a feature — it is the minimum viable requirement. Any tool that makes an undocumented external call in an air-gapped environment is not just inconvenient; it is a security incident.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

Data Sovereignty in AI: Why Regulated Industries Can't Use Cloud Data Prep Tools — The legal and regulatory requirements that drive air-gapped and on-premise AI deployments.
On-Premise AI Data Preparation: The Compliance Guide for Regulated Industries — Full compliance overview for GDPR, HIPAA, EU AI Act, and data sovereignty.
On-Premise vs Self-Hosted vs Air-Gapped AI: What the Difference Actually Means — Detailed comparison of deployment models and their compliance implications.

Air-Gapped Machine Learning: How to Build AI Data Pipelines Without Internet Access

Three Models: Air-Gapped, On-Premise, Self-Hosted

Who Actually Needs Air-Gapped Operation