
Air-Gapped Data Prep for Government and Defense AI Contractors
Technical guide to running AI data preparation pipelines in genuinely air-gapped government and defense environments with no internet connectivity.
Government and defense AI contracts operate under constraints that most commercial AI teams never encounter. The most significant: genuine air-gapped operation. Not "private cloud." Not "VPN-isolated." No internet. No external network connectivity at all. The workstation where you prepare training data may be in a SCIF, on a classified network, or in a facility where the Ethernet cable to the outside world does not exist.
This changes everything about your data preparation pipeline. Most modern AI tools assume internet connectivity at some point — for license validation, model weight downloads, OCR API calls, auto-updates, or telemetry. In an air-gapped environment, any tool that phones home is a tool that does not work.
This guide covers the technical requirements for running AI data preparation in air-gapped government and defense environments, what breaks, and how to architect a pipeline that functions with zero internet dependency.
What "Air-Gapped" Means in Government and Defense
An air-gapped system has no connection to any external network. This is not a configuration option — it is a physical network architecture enforced by the facility.
Classification Levels and Network Implications
| Network | Classification | Internet Access | Description |
|---|---|---|---|
| NIPRNet | Unclassified (CUI) | Yes, filtered | Department of Defense unclassified network |
| SIPRNet | Secret | No | Secret-level classified network |
| JWICS | Top Secret/SCI | No | Joint Worldwide Intelligence Communications System |
| Stand-alone | Varies | No | Physically isolated workstations |
For Secret and above, the working environment is air-gapped by definition. But even at the CUI (Controlled Unclassified Information) level, many government facilities operate air-gapped environments as a security posture choice, particularly for data preparation involving sensitive datasets.
Security Clearance Implications
Personnel working in classified environments must hold appropriate clearances. This affects your staffing model: you cannot assign any available data engineer to a classified project. Annotators, engineers, and QA staff must all be cleared to the appropriate level.
For service providers, this means your team for government AI work is a subset of your total staff, and you cannot easily scale it.
What Breaks in Air-Gapped Environments
License Validation
Many commercial and open-source tools validate licenses by contacting an external server at startup or periodically during use. In an air-gapped environment, this validation fails, and the tool either refuses to start or operates in degraded mode.
Affected tools: Commercial labeling platforms, some IDE extensions, cloud-linked subscriptions, SaaS tools with local installers.
Workaround: Negotiate offline license keys with vendors before deployment. Some vendors offer hardware-locked licenses or USB dongles. Others simply do not support offline use.
Auto-Updates
Tools that check for updates on startup will either fail silently (consuming timeout delays) or fail loudly (blocking startup). Either way, in an air-gapped environment, the version you deploy is the version you run until you manually update.
Implication: Version management becomes your responsibility. Pin every dependency, document every version, and test the complete stack before deploying to the air-gapped environment.
Cloud OCR and Parsing APIs
Many document parsing tools — including some configurations of Unstructured.io and most commercial OCR platforms — send documents to cloud APIs for processing. In an air-gapped environment, these calls fail.
Affected tools: Unstructured.io (cloud mode), Azure Document Intelligence, Google Document AI, Amazon Textract.
Alternative: Use parsing tools that run entirely locally. Docling, Unstructured.io in local mode (with local model weights pre-loaded), Tesseract OCR (local), or surya-ocr for layout detection.
Model Weight Downloads
NER models, embedding models, and language models used for data augmentation or PII detection typically download weights from Hugging Face, PyPI, or custom repositories on first use. In an air-gapped environment, this download fails.
Workaround: Pre-download all model weights on a connected system, verify their integrity (checksums), transfer them to the air-gapped environment via approved media, and configure tools to load from local paths.
Package Managers and Dependency Resolution
pip install, npm install, cargo build — all of these reach out to external registries. In an air-gapped environment, they fail.
Workaround: Build and test your complete environment on a connected system, then transfer it as a pre-built package (Docker image, virtual environment archive, or installer bundle). On the air-gapped system, install from the local package.
Pre-Deployment Checklist
Before deploying any data preparation pipeline to an air-gapped environment, verify the following:
Software Bundle
- All application binaries included and tested
- All model weights pre-loaded (NER, OCR, embedding, LLM if used)
- All Python/Node/Rust dependencies bundled (no network resolution required)
- License keys configured for offline operation
- Auto-update mechanisms disabled
- Telemetry and analytics disabled
- All configuration files pre-set for local-only operation
Infrastructure
- No Docker registry pulls required at runtime (images pre-loaded or not using Docker)
- No Kubernetes cluster required (unless the facility provides one)
- Database runs locally (SQLite, local PostgreSQL, or embedded)
- No external API calls in any code path (including error reporting, crash analytics)
- File paths configured for the target system (no hardcoded cloud storage paths)
Verification
- Full pipeline tested end-to-end with network cable physically disconnected
- All file imports/exports tested with local file system only
- All model inference tested with pre-loaded weights
- Audit logging verified to write to local storage
- Export functions verified to produce local files (no cloud upload paths)
The Native Desktop Advantage
In classified and air-gapped environments, infrastructure is constrained. You may not have access to a Kubernetes cluster, a Docker runtime, or even administrator privileges to install system packages. The workstation may be a locked-down Windows machine with a standard government image.
This is where application architecture matters. Tools that require Docker, Kubernetes, or complex server infrastructure are difficult to deploy in these environments. Tools that run as native desktop applications — installed from a single binary with no external dependencies — are dramatically easier.
The difference in practice:
| Requirement | Web App (Docker/K8s) | Native Desktop App |
|---|---|---|
| Installation complexity | High (container runtime, orchestration, networking) | Low (single installer) |
| Admin privileges required | Usually yes | Often no |
| Infrastructure dependencies | Docker daemon, orchestrator, load balancer | None |
| Port/network configuration | Required (even for local) | Not required |
| Deployment on locked-down workstations | Difficult | Straightforward |
| Offline operation | Requires pre-pulled images | Built-in |
For government and defense work, native desktop applications eliminate an entire category of deployment problems.
Data Transfer: Getting Data In and Out
In air-gapped environments, data moves via approved physical media. The specifics depend on the facility's security procedures, but common mechanisms include:
Removable Media
USB drives, external hard drives, or optical media that have been approved by the facility's security office. Data transferred to the air-gapped system must be scanned and approved. Data transferred out must go through a review process.
Cross-Domain Solutions (CDS)
Hardware devices that mediate data transfer between networks of different classification levels. These enforce content inspection, data format restrictions, and security policy. Transfers through a CDS are logged and auditable.
Sneakernet Implications for Your Pipeline
Your pipeline must support import and export via file system paths, not network endpoints. "Upload from URL" features are useless. "Connect to S3 bucket" is irrelevant. The pipeline must read from and write to local directories, with clear file naming and manifest documentation so the data transfer process can be audited.
Export formats must be self-contained. A training dataset export that references external files, requires network resolution, or depends on a running server is unusable in this context.
NIST and FedRAMP Considerations
NIST SP 800-171
For CUI (Controlled Unclassified Information), NIST SP 800-171 specifies 110 security requirements across 14 families. Relevant to data preparation:
- Access Control (AC): Limit system access to authorized users. Enforce least privilege. Log access events.
- Audit and Accountability (AU): Create, protect, and retain audit records. Ensure individual accountability.
- Configuration Management (CM): Establish and enforce security configuration settings. Track changes.
- System and Information Integrity (SI): Monitor systems and take action on detected flaws.
Your data preparation tools must support these requirements: user authentication, audit logging, configuration management, and integrity verification.
FedRAMP
If your tools are cloud-based and being used for federal work, they must be FedRAMP authorized. In an air-gapped environment, FedRAMP is less relevant because you are not using cloud services. But if any part of your pipeline runs on a government cloud (GovCloud, milCloud), FedRAMP authorization applies.
CMMC (Cybersecurity Maturity Model Certification)
For defense contractors, CMMC certification may be required. CMMC Level 2 aligns with NIST SP 800-171. Your data preparation processes must be documented and auditable to support CMMC assessment.
Practical Architecture for Air-Gapped Data Prep
Recommended Stack
- Document parsing: Docling (local) or Tesseract + layout detection model (pre-loaded)
- Text cleaning: Python scripts with all dependencies bundled in a virtual environment
- PII/PHI redaction: Local NER model (spaCy or fine-tuned BERT, weights pre-loaded) + regex patterns
- Labeling: Native desktop application with local database and audit logging
- Augmentation: Local LLM (Llama 3.1 8B or similar, weights pre-loaded) or rule-based methods
- Export: Local file output with manifest and lineage documentation
What to Avoid
- Any tool that requires a network call at any point in its operation
- Docker-based deployments (unless the facility explicitly supports Docker)
- Python packages that lazy-load model weights from Hugging Face at runtime
- Tools with embedded analytics or telemetry
- Cloud-first platforms with "offline mode" that has not been thoroughly tested
Ertas Data Suite is built as a native desktop application using Tauri 2.0 (Rust + React). It operates fully offline with no internet dependency at any stage. All five modules (Ingest → Clean → Label → Augment → Export) run locally with pre-bundled dependencies. There is no license phone-home, no telemetry, no cloud API calls. It installs from a single binary, runs without Docker or Kubernetes, and produces exportable audit trails and training datasets as local files — making it deployable in air-gapped government environments without infrastructure modification.
Conclusion
Air-gapped data preparation is not a modified version of cloud data preparation. It is a fundamentally different operational environment with constraints that eliminate most of the modern AI toolchain. The service providers who succeed in government and defense AI work are the ones who plan for these constraints from the beginning — pre-bundling dependencies, testing offline, deploying native applications, and building export workflows that produce self-contained deliverables.
The market opportunity is substantial and growing. Government AI spending is increasing, and the compliance bar is a moat that keeps out providers who have not invested in the infrastructure to meet it.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Premise Runtime Architecture for Enterprise AI Data Preparation
Architectural guide for running AI data preparation on-premise — deployment models, compute tiers, local LLM inference, and storage strategies for enterprise datasets.

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments
Step-by-step guide to deploying Ollama for AI-assisted data labeling in air-gapped environments — model transfer, offline setup, GPU configuration, and common failure modes.

How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning
A complete guide to building on-premise data preparation pipelines for LLM fine-tuning — covering the 5 stages from ingestion to export, tool comparisons, and architecture for regulated environments.