Air-Gapped Data Prep for Government and Defense AI Contractors

Government and defense AI contracts operate under constraints that most commercial AI teams never encounter. The most significant: genuine air-gapped operation. Not "private cloud." Not "VPN-isolated." No internet. No external network connectivity at all. The workstation where you prepare training data may be in a SCIF, on a classified network, or in a facility where the Ethernet cable to the outside world does not exist.

This changes everything about your data preparation pipeline. Most modern AI tools assume internet connectivity at some point — for license validation, model weight downloads, OCR API calls, auto-updates, or telemetry. In an air-gapped environment, any tool that phones home is a tool that does not work.

This guide covers the technical requirements for running AI data preparation in air-gapped government and defense environments, what breaks, and how to architect a pipeline that functions with zero internet dependency.

What "Air-Gapped" Means in Government and Defense

An air-gapped system has no connection to any external network. This is not a configuration option — it is a physical network architecture enforced by the facility.

Classification Levels and Network Implications

Network	Classification	Internet Access	Description
NIPRNet	Unclassified (CUI)	Yes, filtered	Department of Defense unclassified network
SIPRNet	Secret	No	Secret-level classified network
JWICS	Top Secret/SCI	No	Joint Worldwide Intelligence Communications System
Stand-alone	Varies	No	Physically isolated workstations

For Secret and above, the working environment is air-gapped by definition. But even at the CUI (Controlled Unclassified Information) level, many government facilities operate air-gapped environments as a security posture choice, particularly for data preparation involving sensitive datasets.

Security Clearance Implications

Personnel working in classified environments must hold appropriate clearances. This affects your staffing model: you cannot assign any available data engineer to a classified project. Annotators, engineers, and QA staff must all be cleared to the appropriate level.

For service providers, this means your team for government AI work is a subset of your total staff, and you cannot easily scale it.

What Breaks in Air-Gapped Environments

License Validation

Many commercial and open-source tools validate licenses by contacting an external server at startup or periodically during use. In an air-gapped environment, this validation fails, and the tool either refuses to start or operates in degraded mode.

Affected tools: Commercial labeling platforms, some IDE extensions, cloud-linked subscriptions, SaaS tools with local installers.

Workaround: Negotiate offline license keys with vendors before deployment. Some vendors offer hardware-locked licenses or USB dongles. Others simply do not support offline use.

Auto-Updates

Tools that check for updates on startup will either fail silently (consuming timeout delays) or fail loudly (blocking startup). Either way, in an air-gapped environment, the version you deploy is the version you run until you manually update.

Implication: Version management becomes your responsibility. Pin every dependency, document every version, and test the complete stack before deploying to the air-gapped environment.

Cloud OCR and Parsing APIs

Many document parsing tools — including some configurations of Unstructured.io and most commercial OCR platforms — send documents to cloud APIs for processing. In an air-gapped environment, these calls fail.

Affected tools: Unstructured.io (cloud mode), Azure Document Intelligence, Google Document AI, Amazon Textract.

Alternative: Use parsing tools that run entirely locally. Docling, Unstructured.io in local mode (with local model weights pre-loaded), Tesseract OCR (local), or surya-ocr for layout detection.

Model Weight Downloads

NER models, embedding models, and language models used for data augmentation or PII detection typically download weights from Hugging Face, PyPI, or custom repositories on first use. In an air-gapped environment, this download fails.

Workaround: Pre-download all model weights on a connected system, verify their integrity (checksums), transfer them to the air-gapped environment via approved media, and configure tools to load from local paths.

Package Managers and Dependency Resolution

pip install, npm install, cargo build — all of these reach out to external registries. In an air-gapped environment, they fail.

Workaround: Build and test your complete environment on a connected system, then transfer it as a pre-built package (Docker image, virtual environment archive, or installer bundle). On the air-gapped system, install from the local package.

Pre-Deployment Checklist

Before deploying any data preparation pipeline to an air-gapped environment, verify the following:

Software Bundle

All application binaries included and tested
All model weights pre-loaded (NER, OCR, embedding, LLM if used)
All Python/Node/Rust dependencies bundled (no network resolution required)
License keys configured for offline operation
Auto-update mechanisms disabled
Telemetry and analytics disabled
All configuration files pre-set for local-only operation

Infrastructure

No Docker registry pulls required at runtime (images pre-loaded or not using Docker)
No Kubernetes cluster required (unless the facility provides one)
Database runs locally (SQLite, local PostgreSQL, or embedded)
No external API calls in any code path (including error reporting, crash analytics)
File paths configured for the target system (no hardcoded cloud storage paths)

Verification

Full pipeline tested end-to-end with network cable physically disconnected
All file imports/exports tested with local file system only
All model inference tested with pre-loaded weights
Audit logging verified to write to local storage
Export functions verified to produce local files (no cloud upload paths)

The Native Desktop Advantage

In classified and air-gapped environments, infrastructure is constrained. You may not have access to a Kubernetes cluster, a Docker runtime, or even administrator privileges to install system packages. The workstation may be a locked-down Windows machine with a standard government image.

This is where application architecture matters. Tools that require Docker, Kubernetes, or complex server infrastructure are difficult to deploy in these environments. Tools that run as native desktop applications — installed from a single binary with no external dependencies — are dramatically easier.

The difference in practice:

Requirement	Web App (Docker/K8s)	Native Desktop App
Installation complexity	High (container runtime, orchestration, networking)	Low (single installer)
Admin privileges required	Usually yes	Often no
Infrastructure dependencies	Docker daemon, orchestrator, load balancer	None
Port/network configuration	Required (even for local)	Not required
Deployment on locked-down workstations	Difficult	Straightforward
Offline operation	Requires pre-pulled images	Built-in

For government and defense work, native desktop applications eliminate an entire category of deployment problems.

Data Transfer: Getting Data In and Out

In air-gapped environments, data moves via approved physical media. The specifics depend on the facility's security procedures, but common mechanisms include:

Removable Media

USB drives, external hard drives, or optical media that have been approved by the facility's security office. Data transferred to the air-gapped system must be scanned and approved. Data transferred out must go through a review process.

Cross-Domain Solutions (CDS)

Hardware devices that mediate data transfer between networks of different classification levels. These enforce content inspection, data format restrictions, and security policy. Transfers through a CDS are logged and auditable.

Sneakernet Implications for Your Pipeline

Your pipeline must support import and export via file system paths, not network endpoints. "Upload from URL" features are useless. "Connect to S3 bucket" is irrelevant. The pipeline must read from and write to local directories, with clear file naming and manifest documentation so the data transfer process can be audited.

Export formats must be self-contained. A training dataset export that references external files, requires network resolution, or depends on a running server is unusable in this context.

NIST and FedRAMP Considerations

NIST SP 800-171

For CUI (Controlled Unclassified Information), NIST SP 800-171 specifies 110 security requirements across 14 families. Relevant to data preparation:

Access Control (AC): Limit system access to authorized users. Enforce least privilege. Log access events.
Audit and Accountability (AU): Create, protect, and retain audit records. Ensure individual accountability.
Configuration Management (CM): Establish and enforce security configuration settings. Track changes.
System and Information Integrity (SI): Monitor systems and take action on detected flaws.

Your data preparation tools must support these requirements: user authentication, audit logging, configuration management, and integrity verification.

FedRAMP

If your tools are cloud-based and being used for federal work, they must be FedRAMP authorized. In an air-gapped environment, FedRAMP is less relevant because you are not using cloud services. But if any part of your pipeline runs on a government cloud (GovCloud, milCloud), FedRAMP authorization applies.

CMMC (Cybersecurity Maturity Model Certification)

For defense contractors, CMMC certification may be required. CMMC Level 2 aligns with NIST SP 800-171. Your data preparation processes must be documented and auditable to support CMMC assessment.

Practical Architecture for Air-Gapped Data Prep

Recommended Stack

Document parsing: Docling (local) or Tesseract + layout detection model (pre-loaded)
Text cleaning: Python scripts with all dependencies bundled in a virtual environment
PII/PHI redaction: Local NER model (spaCy or fine-tuned BERT, weights pre-loaded) + regex patterns
Labeling: Native desktop application with local database and audit logging
Augmentation: Local LLM (Llama 3.1 8B or similar, weights pre-loaded) or rule-based methods
Export: Local file output with manifest and lineage documentation

What to Avoid

Any tool that requires a network call at any point in its operation
Docker-based deployments (unless the facility explicitly supports Docker)
Python packages that lazy-load model weights from Hugging Face at runtime
Tools with embedded analytics or telemetry
Cloud-first platforms with "offline mode" that has not been thoroughly tested

Ertas Data Suite is built as a native desktop application using Tauri 2.0 (Rust + React). It operates fully offline with no internet dependency at any stage. All five modules (Ingest → Clean → Label → Augment → Export) run locally with pre-bundled dependencies. There is no license phone-home, no telemetry, no cloud API calls. It installs from a single binary, runs without Docker or Kubernetes, and produces exportable audit trails and training datasets as local files — making it deployable in air-gapped government environments without infrastructure modification.

Conclusion

Air-gapped data preparation is not a modified version of cloud data preparation. It is a fundamentally different operational environment with constraints that eliminate most of the modern AI toolchain. The service providers who succeed in government and defense AI work are the ones who plan for these constraints from the beginning — pre-bundling dependencies, testing offline, deploying native applications, and building export workflows that produce self-contained deliverables.

The market opportunity is substantial and growing. Government AI spending is increasing, and the compliance bar is a moat that keeps out providers who have not invested in the infrastructure to meet it.