AI Data Preparation for Government Agencies: Security Classifications and Air-Gapped Requirements

Government and defense agencies are adopting AI for document analysis, intelligence processing, logistics optimization, and decision support. The training data for these models comes from government document archives — much of it classified, sensitive, or subject to strict handling requirements that make cloud-based data preparation impossible.

Preparing government data for AI requires tools and processes that operate within the security constraints of classified environments. This guide covers the unique challenges and requirements.

The Government Data Landscape

Classified Documents

Confidential, Secret, Top Secret: Documents with formal security classifications that dictate handling, storage, and processing requirements
Compartmented information (SCI): Intelligence data restricted to specific programs and clearance levels
Special Access Programs (SAP): Restricted information requiring additional access beyond clearance level

Controlled Unclassified Information (CUI)

Government data that isn't classified but requires safeguarding: law enforcement sensitive, privacy-protected, export-controlled
CUI categories cover 20+ types of sensitive-but-unclassified data

Publicly Available Government Data

Open data portals, FOIA releases, public reports
Still requires careful handling — aggregation of public data can reveal classified patterns

Why Government Data Prep Is Different

Security Classification Handling

Every document, every extracted data point, and every training example inherits the security classification of its source. A training dataset derived from Secret documents is itself Secret. The data preparation pipeline must:

Track classification levels through every transformation
Ensure the processing environment meets the classification level's requirements
Prevent inadvertent classification spillage (processing Secret data on an Unclassified system)
Maintain derivative classification markings

Air-Gapped Operation

Classified networks (SIPRNet, JWICS) are physically isolated from the internet. Data preparation tools that require cloud connectivity, license servers, telemetry, or update checks are disqualified. The tool must:

Install and operate with zero internet connectivity
Include all dependencies in the installation package
Function without phoning home for licensing
Update through physical media or secure transfer, not automatic updates

Personnel Security

Only cleared personnel can access classified data. The data preparation tool must support:

User authentication tied to the facility's identity management
Role-based access control (different analysts may have different compartment access)
Audit logging of every user action (who accessed what, when)
Session management (automatic lockout, screen protection)

Facility Requirements

Classified data processing must occur in accredited facilities:

SCIFs (Sensitive Compartmented Information Facilities) for SCI data
Accredited IS (Information Systems) for classified processing
Physical security controls (access control, surveillance, RF shielding)

Compliance Frameworks

CMMC (Cybersecurity Maturity Model Certification)

Required for Defense Industrial Base (DIB) contractors. CMMC levels define cybersecurity practices for handling CUI and classified data. Data preparation tools used by DIB contractors must operate within CMMC-compliant environments.

FedRAMP

Federal Risk and Authorization Management Program. Cloud services used by federal agencies must be FedRAMP authorized. However, for classified data preparation, cloud services are generally not an option — air-gapped on-premise processing is the standard.

ITAR (International Traffic in Arms Regulations)

Technical data related to defense articles is ITAR-controlled. AI training data derived from ITAR-controlled documents inherits those restrictions:

Cannot be shared with foreign nationals
Cannot be processed on systems accessible to non-US persons
Export requires State Department authorization

NIST 800-171/172

Security requirements for protecting CUI in non-federal systems. Defines 110+ security controls covering access control, audit, incident response, and system integrity.

The Data Preparation Pipeline for Government

Stage 1: Ingestion

Document parsing in an air-gapped environment (no cloud OCR services)
Local OCR with government-approved engines
Classification marking detection and preservation
Multi-format handling (PDFs, emails, images, signals intelligence formats)

Stage 2: Cleaning

Redaction of classification markings for training data (preventing the model from learning to reproduce classified markings)
Cross-domain transfer review (ensuring data doesn't move between classification levels without authorization)
Quality scoring using local models (no cloud API calls)
Deduplication within classification boundaries

Stage 3: Labeling

Cleared analysts label data within their authorized access level
Multi-level labeling workflows (different analysts label different portions based on clearance)
Audit trail for every label decision (who, when, what clearance level)
Quality review by senior analysts

Stage 4: Export

Training datasets with inherited classification markings
JSONL/structured formats for NLP models
Documentation package for ATO (Authority to Operate) review
Audit trail export for security review

Tool Selection Criteria for Government

When evaluating data preparation tools for government use:

True air-gapped operation: Does it work with zero network connectivity? No license servers, no telemetry, no update checks?
Native desktop application: Docker containers in classified environments add complexity. A native app installs like any other approved software.
Complete audit trail: Every action logged with user identity, timestamp, and action details — required for security reviews.
Local AI capabilities: AI-assisted labeling and quality scoring must use local models (Ollama/llama.cpp), not cloud APIs.
ATO documentation support: Can the tool produce the security documentation needed for Authority to Operate approval?

Ertas Data Suite meets these criteria as a native desktop application built with Tauri (Rust + React) that operates fully air-gapped. Local LLM inference via Ollama/llama.cpp provides AI-assisted features without data egress. The complete audit trail supports ATO documentation requirements.

For government agencies, data preparation isn't just a technical challenge — it's a security challenge. The tools must be as secure as the data they process.