
Air-Gapped Machine Learning: How to Build AI Data Pipelines Without Internet Access
A practical guide to building AI data preparation and training pipelines in air-gapped environments — from document ingestion to model export — with no internet connectivity required at any stage.
"Air-gapped" is a term that gets used loosely in enterprise AI discussions. It often means "we do not want data leaving our network" or "we would prefer on-premise." These are legitimate requirements, but they are not the same as genuine air-gapped operation. In true air-gapped environments — classified government systems, critical infrastructure networks, high-security financial systems — there is no internet connection at all. Not restricted. Not monitored. Absent.
Building AI data preparation pipelines for these environments requires a different architecture than typical on-premise deployments. Every component must function without phoning home, checking for license updates, downloading model weights, or accessing external APIs. Most modern software fails this test in ways that are not obvious at installation time.
This guide covers the three deployment models (air-gapped, on-premise, self-hosted), who actually needs genuine air-gapped operation, what a complete ML data pipeline looks like without connectivity, and which tools fail in air-gapped environments.
Three Models: Air-Gapped, On-Premise, Self-Hosted
These terms are used interchangeably in vendor marketing. They are not the same.
| Model | Infrastructure | Internet at runtime | Data stays in org | Regulatory use |
|---|---|---|---|---|
| SaaS / Cloud | Vendor's cloud | Yes | No | Rarely compliant |
| Self-hosted | Your servers, any location | Optional | Yes (with controls) | Conditionally compliant |
| On-premise | Hardware you own, in your building | Optional | Yes | Often compliant |
| Air-gapped | Hardware you own, physically isolated network | No | Yes | Fully isolated |
Self-hosted means you run the software on your own servers — but those servers may be in a cloud data center, and the software may still make external connections (for license validation, telemetry, model downloads, or dependency updates). Self-hosted is not air-gapped.
On-premise typically means software running on hardware in your facility. It may still make outbound connections for updates or telemetry. "On-premise" in vendor documentation often just means "you install it yourself."
Air-gapped means the host machine has no network connection to the internet and, in strict implementations, no connection to any external network. Software in an air-gapped environment cannot reach external services under any circumstances — not by accident and not by design.
The compliance implications differ:
- Self-hosted on a cloud provider's infrastructure: still subject to that provider's legal obligations and potential government access requests
- On-premise with internet access: can still exfiltrate data (intentionally or via a compromised component); does not satisfy "no data egress" requirements for highest-security environments
- Air-gapped: physically isolated; only attack vector is removable media or physical access; satisfies the most demanding data sovereignty requirements
Who Actually Needs Air-Gapped Operation
Genuine air-gapped requirements appear in specific contexts:
Defense and intelligence: Government contractors and agencies working with classified information operate under strict network segmentation requirements. AI development tools must be certified for operation on classified networks.
Critical infrastructure: Power grid operators, water treatment facilities, and similar operators are increasingly deploying AI for predictive maintenance and anomaly detection. Their operational technology (OT) networks are often isolated from corporate IT networks and have no internet connectivity.
Financial institutions and trading firms: High-frequency trading systems and certain risk models operate on isolated networks to prevent information leakage and ensure latency control. Some financial regulators require data used in certain models to remain in specific network environments.
Legal and regulatory proceedings: Law firms and litigation support teams working with privileged or court-sealed documents may be required to process those documents in environments with no external connectivity.
Healthcare with strict data governance: While HIPAA does not specifically require air-gapped operation, some healthcare organizations operating under state-level or contractual data handling requirements have chosen air-gapped environments as the only way to guarantee data isolation.
Cybersecurity operations: Security operations centers working with threat intelligence and incident data may operate on isolated networks to prevent adversary access to analysis tools.
A cybersecurity firm told us directly: "Most AI tools process inference over the cloud, making the data essentially public." For organizations where the training data is itself sensitive threat intelligence or classified information, that is an unacceptable risk — and air-gapped operation is the only alternative.
The Full Pipeline: What Each Stage Requires Without Connectivity
Stage 1: Document Ingestion
Document parsing in an air-gapped environment means all parsing logic — including OCR — must be bundled with the application and operate without external calls.
What fails: Cloud OCR APIs (Google Document AI, Azure Document Intelligence, AWS Textract). Any library that proxies OCR to an external service. Document parsers that check for model updates at runtime.
What works: Embedded OCR engines (Tesseract, EasyOCR, PaddleOCR) bundled with the application. Layout analysis models (for multi-column PDFs, tables, headers) loaded from local model files. Image preprocessing for scan quality enhancement running locally.
The practical challenge: embedded OCR is slower and sometimes less accurate than cloud API OCR. For a regulated enterprise where data cannot leave the network, this is the acceptable trade-off. Accuracy can be improved by pre-processing scan quality and using domain-specific OCR configurations.
Stage 2: Cleaning and De-Identification
PII/PHI detection and redaction requires NLP models that can run locally. Named entity recognition for identifying names, dates, organizations, medical record numbers, and other sensitive entities must use locally loaded model weights.
What fails: Cloud NLP APIs (AWS Comprehend Medical, Google Healthcare Natural Language API, Azure Text Analytics for Health). Any PII detection tool that sends documents to an external endpoint.
What works: spaCy with locally loaded NER models, Hugging Face Transformers with GGUF-quantized models loaded from local storage, rule-based pattern matching for structured identifiers (phone numbers, SSNs, medical record numbers).
For air-gapped environments, model weights must be transferred via approved removable media during the initial setup phase. After that, the system operates entirely from local storage.
Stage 3: Annotation
Annotation — the human labeling of documents for NER, classification, bounding boxes, or Q&A pairs — does not inherently require internet connectivity. The challenge is that most annotation platforms are web-based SaaS tools that require an active connection.
What fails: Label Studio Cloud, Scale AI, Amazon SageMaker Ground Truth, Labelbox, any browser-based annotation tool backed by external servers.
What works: Self-installable annotation tools with no external dependencies; annotation workflows built into local desktop applications; browser-based tools that can serve entirely from localhost with no external asset loading.
The annotation stage is where many air-gapped pipelines break down — teams assume they can "just use Label Studio self-hosted" without checking whether the self-hosted version makes external calls for analytics, CDN assets, or license validation.
Stage 4: Synthetic Data Augmentation
Generating synthetic training data using an LLM is one of the most internet-dependent operations in a modern AI pipeline. Cloud LLM APIs (OpenAI, Anthropic, Google, Cohere) are simply not available in an air-gapped environment.
What fails: Any augmentation workflow that calls an external LLM API. Distilabel and similar libraries when configured with cloud endpoints. Hugging Face Inference API.
What works: Locally hosted LLMs using Ollama or llama.cpp. GGUF-quantized models (Llama 3, Mistral, Qwen, and others) loaded from local storage. Inference running on local GPU resources.
The practical requirements:
- A machine with sufficient GPU VRAM (16GB minimum for useful 7B models; 48GB for 30B+ models)
- Model weights pre-downloaded and transferred via removable media to the air-gapped machine
- Ollama or llama.cpp installed without package manager internet access (offline installation packages required)
For most document augmentation use cases, a 7B or 13B quantized model running on a workstation GPU is sufficient. Quality is lower than frontier cloud models but adequate for generating training variants of structured documents.
Stage 5: Export
Export — producing JSONL, YOLO/COCO, CSV, or chunked text from the annotated dataset — is the least connectivity-dependent stage. It is also where the audit trail must be finalized and exported alongside the training data.
What fails: Export pipelines that sync to cloud storage (S3, Azure Blob) as part of the export step. Versioning tools that use cloud-based artifact registries.
What works: Local file export to attached storage or an air-gapped internal network share. Local artifact versioning using git or similar tools without remote push.
Requirements for a Truly Air-Gapped ML Setup
Setting up an air-gapped AI data preparation environment requires planning before the machine is isolated. After isolation, you cannot download dependencies.
Pre-installation checklist:
- All application installers transferred via approved removable media
- All runtime dependencies (Python packages, system libraries) bundled or pre-installed
- All ML model weights downloaded and transferred (NER models, OCR models, LLMs)
- Annotation interface serving all assets from localhost (no external CDN references)
- License validation configured for offline operation or perpetual license
- Internal documentation and update procedures established
Hardware requirements:
- GPU workstation or server for LLM augmentation (16–48GB VRAM depending on model size)
- Sufficient local storage for source documents, processed data, and model weights
- Internal network share for multi-user access (not internet-connected)
Operational procedures:
- Software updates via removable media review process
- Model updates reviewed and approved before transfer to isolated network
- Audit log backup to internal archive storage
Tools That Fail in Air-Gapped Environments
| Tool | Why It Fails |
|---|---|
| Unstructured.io cloud API | Cloud-only document parsing |
| Adobe Acrobat AI features | Cloud LLM processing |
| Label Studio Cloud | SaaS platform |
| Scale AI / Labelbox | Cloud annotation platforms |
| Cleanlab / Dataiku cloud | Cloud processing for quality scoring |
| Distilabel with cloud LLMs | Requires external LLM API |
| Hugging Face Inference API | Cloud inference endpoint |
| GitHub Copilot / any coding AI | Requires internet connection |
How Ertas Data Suite Works in Air-Gapped Environments
Ertas Data Suite was designed for air-gapped operation from the ground up. It installs as a native desktop application — no Docker, no package manager internet access required during installation. All OCR, NER, and processing models are bundled. The annotation interface runs locally. The Augment module uses Ollama with locally hosted models and makes no external calls.
The entire pipeline — Ingest, Clean, Label, Augment, Export — runs without an internet connection at any stage. The audit trail is written to local storage and exported with the dataset. Software activation supports offline licensing for environments where license servers are not accessible.
For organizations with genuine air-gapped requirements, this architecture is not a feature — it is the minimum viable requirement. Any tool that makes an undocumented external call in an air-gapped environment is not just inconvenient; it is a security incident.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- Data Sovereignty in AI: Why Regulated Industries Can't Use Cloud Data Prep Tools — The legal and regulatory requirements that drive air-gapped and on-premise AI deployments.
- On-Premise AI Data Preparation: The Compliance Guide for Regulated Industries — Full compliance overview for GDPR, HIPAA, EU AI Act, and data sovereignty.
- On-Premise vs Self-Hosted vs Air-Gapped AI: What the Difference Actually Means — Detailed comparison of deployment models and their compliance implications.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

5 Questions to Ask Before Buying an On-Premise AI Data Platform
A buyer's guide for evaluating on-premise AI data platforms: offline capability, accessibility, audit trails, export formats, and implementation support.

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments
Step-by-step guide to deploying Ollama for AI-assisted data labeling in air-gapped environments — model transfer, offline setup, GPU configuration, and common failure modes.

How Cybersecurity Teams Build AI in Air-Gapped Environments
Cybersecurity teams deal with the most sensitive organizational data. Here's how to build AI data preparation and training pipelines that never touch the internet — including synthetic data generation with local LLMs.