Back to blog
    AI Data Preparation for Government Agencies: Security Classifications and Air-Gapped Requirements
    governmentdefensedata-preparationair-gappedsecurity-classificationon-premisesegment:enterprise

    AI Data Preparation for Government Agencies: Security Classifications and Air-Gapped Requirements

    How government and defense agencies can prepare classified and sensitive data for AI model training in air-gapped environments — covering CMMC, FedRAMP, ITAR, and security classification handling.

    EErtas Team·

    Government and defense agencies are adopting AI for document analysis, intelligence processing, logistics optimization, and decision support. The training data for these models comes from government document archives — much of it classified, sensitive, or subject to strict handling requirements that make cloud-based data preparation impossible.

    Preparing government data for AI requires tools and processes that operate within the security constraints of classified environments. This guide covers the unique challenges and requirements.

    The Government Data Landscape

    Classified Documents

    • Confidential, Secret, Top Secret: Documents with formal security classifications that dictate handling, storage, and processing requirements
    • Compartmented information (SCI): Intelligence data restricted to specific programs and clearance levels
    • Special Access Programs (SAP): Restricted information requiring additional access beyond clearance level

    Controlled Unclassified Information (CUI)

    • Government data that isn't classified but requires safeguarding: law enforcement sensitive, privacy-protected, export-controlled
    • CUI categories cover 20+ types of sensitive-but-unclassified data

    Publicly Available Government Data

    • Open data portals, FOIA releases, public reports
    • Still requires careful handling — aggregation of public data can reveal classified patterns

    Why Government Data Prep Is Different

    Security Classification Handling

    Every document, every extracted data point, and every training example inherits the security classification of its source. A training dataset derived from Secret documents is itself Secret. The data preparation pipeline must:

    • Track classification levels through every transformation
    • Ensure the processing environment meets the classification level's requirements
    • Prevent inadvertent classification spillage (processing Secret data on an Unclassified system)
    • Maintain derivative classification markings

    Air-Gapped Operation

    Classified networks (SIPRNet, JWICS) are physically isolated from the internet. Data preparation tools that require cloud connectivity, license servers, telemetry, or update checks are disqualified. The tool must:

    • Install and operate with zero internet connectivity
    • Include all dependencies in the installation package
    • Function without phoning home for licensing
    • Update through physical media or secure transfer, not automatic updates

    Personnel Security

    Only cleared personnel can access classified data. The data preparation tool must support:

    • User authentication tied to the facility's identity management
    • Role-based access control (different analysts may have different compartment access)
    • Audit logging of every user action (who accessed what, when)
    • Session management (automatic lockout, screen protection)

    Facility Requirements

    Classified data processing must occur in accredited facilities:

    • SCIFs (Sensitive Compartmented Information Facilities) for SCI data
    • Accredited IS (Information Systems) for classified processing
    • Physical security controls (access control, surveillance, RF shielding)

    Compliance Frameworks

    CMMC (Cybersecurity Maturity Model Certification)

    Required for Defense Industrial Base (DIB) contractors. CMMC levels define cybersecurity practices for handling CUI and classified data. Data preparation tools used by DIB contractors must operate within CMMC-compliant environments.

    FedRAMP

    Federal Risk and Authorization Management Program. Cloud services used by federal agencies must be FedRAMP authorized. However, for classified data preparation, cloud services are generally not an option — air-gapped on-premise processing is the standard.

    ITAR (International Traffic in Arms Regulations)

    Technical data related to defense articles is ITAR-controlled. AI training data derived from ITAR-controlled documents inherits those restrictions:

    • Cannot be shared with foreign nationals
    • Cannot be processed on systems accessible to non-US persons
    • Export requires State Department authorization

    NIST 800-171/172

    Security requirements for protecting CUI in non-federal systems. Defines 110+ security controls covering access control, audit, incident response, and system integrity.

    The Data Preparation Pipeline for Government

    Stage 1: Ingestion

    • Document parsing in an air-gapped environment (no cloud OCR services)
    • Local OCR with government-approved engines
    • Classification marking detection and preservation
    • Multi-format handling (PDFs, emails, images, signals intelligence formats)

    Stage 2: Cleaning

    • Redaction of classification markings for training data (preventing the model from learning to reproduce classified markings)
    • Cross-domain transfer review (ensuring data doesn't move between classification levels without authorization)
    • Quality scoring using local models (no cloud API calls)
    • Deduplication within classification boundaries

    Stage 3: Labeling

    • Cleared analysts label data within their authorized access level
    • Multi-level labeling workflows (different analysts label different portions based on clearance)
    • Audit trail for every label decision (who, when, what clearance level)
    • Quality review by senior analysts

    Stage 4: Export

    • Training datasets with inherited classification markings
    • JSONL/structured formats for NLP models
    • Documentation package for ATO (Authority to Operate) review
    • Audit trail export for security review

    Tool Selection Criteria for Government

    When evaluating data preparation tools for government use:

    1. True air-gapped operation: Does it work with zero network connectivity? No license servers, no telemetry, no update checks?
    2. Native desktop application: Docker containers in classified environments add complexity. A native app installs like any other approved software.
    3. Complete audit trail: Every action logged with user identity, timestamp, and action details — required for security reviews.
    4. Local AI capabilities: AI-assisted labeling and quality scoring must use local models (Ollama/llama.cpp), not cloud APIs.
    5. ATO documentation support: Can the tool produce the security documentation needed for Authority to Operate approval?

    Ertas Data Suite meets these criteria as a native desktop application built with Tauri (Rust + React) that operates fully air-gapped. Local LLM inference via Ollama/llama.cpp provides AI-assisted features without data egress. The complete audit trail supports ATO documentation requirements.

    For government agencies, data preparation isn't just a technical challenge — it's a security challenge. The tools must be as secure as the data they process.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading