
Processing Classified Documents for NLP in Air-Gapped Environments
Architecture and operational guide for preparing classified documents as NLP training data in completely air-gapped environments. Covers security requirements, approved workflow patterns, air-gap verification, and pipeline design for sensitive document processing.
Building NLP models from classified documents is a problem that most AI tooling was never designed to solve. Commercial data preparation platforms assume network connectivity — for updates, for cloud storage, for telemetry. Classified environments assume the opposite: the machine processing the data must have zero network connectivity, verified and auditable.
This creates a fundamental tooling gap. Organizations processing classified documents for NLP training data need a pipeline that handles document parsing, text extraction, cleaning, annotation, and export — all running on a single machine with no network stack, no outbound connections, and no hidden dependencies that phone home.
This playbook covers the architecture, security requirements, and workflow patterns for preparing classified documents as NLP training data in air-gapped environments.
Security Classification Levels and Data Handling
Different classification levels impose different physical and operational constraints on the data processing environment.
| Classification | Physical Requirements | Personnel | Data Transfer |
|---|---|---|---|
| CUI (Controlled Unclassified) | Locked room, access-controlled workstation | Cleared personnel, need-to-know basis | Encrypted removable media with logging |
| Secret | SCIF or equivalent, TEMPEST-rated equipment | Active Secret clearance | Cross-domain solution (CDS) or manual review + sneakernet |
| Top Secret / SCI | SCIF, full TEMPEST compliance, RF shielding | Active TS/SCI clearance | CDS with multi-person approval, or physical media under escort |
The data pipeline platform must operate at the highest classification level of any document it processes. If a single Top Secret document enters the pipeline, the entire workstation is treated as a Top Secret system.
Air-Gap Architecture Requirements
An air-gapped environment is not simply a machine with WiFi turned off. True air-gap isolation requires verified physical separation from all networks.
Hardware Configuration
The processing workstation must meet the following baseline:
- No wireless hardware. WiFi cards, Bluetooth modules, and cellular modems must be physically removed (not just disabled in software). BIOS-level disablement is insufficient for higher classification levels.
- No network interface. Ethernet ports should be physically disconnected or the NIC removed entirely. For TEMPEST environments, the system should have no network hardware installed.
- USB port control. Only authorized, inventoried removable media devices may connect. USB ports not needed for authorized peripherals should be physically blocked or epoxied.
- Audio/visual isolation. Microphones and cameras must be physically removed. Speaker output should be disabled at the hardware level for TEMPEST environments.
- BIOS hardening. Boot order locked to internal drive only. BIOS password set. Secure boot enabled where supported.
Software Stack
The processing environment must be self-contained with zero external dependencies at runtime.
| Component | Requirement | Why |
|---|---|---|
| Operating System | Hardened Linux (SELinux enforcing) or approved Windows with STIG applied | Minimize attack surface, enforce mandatory access controls |
| Data pipeline tool | Native application, no container runtime, no package manager calls | Containers may attempt registry pulls; package managers need network |
| ML/NLP libraries | Pre-installed, version-locked, integrity-verified | No pip install, no npm install, no runtime downloads |
| Document parsers | Bundled with application, no external service calls | PDF parsing must not call Adobe APIs or cloud OCR |
| Embedding models | Local model files, no API calls | Embedding generation must run entirely on-device |
Ertas Data Suite meets these requirements as a native desktop application built on Tauri (Rust + React). It bundles all parsers, processing logic, and UI into a single installable binary. No Docker, no container runtime, no network services. At runtime, it opens no listening ports and makes no outbound connections.
Air-Gap Verification Checklist
Before processing classified documents, the air-gap must be verified. This checklist should be completed by the system administrator and reviewed by the security officer.
| Check | Method | Pass Criteria |
|---|---|---|
| No network hardware present | Physical inspection + lspci/lsusb audit | Zero network controllers listed |
| No wireless radios | Physical inspection of motherboard, expansion slots | All wireless modules physically removed |
| USB ports controlled | Physical inspection | Unauthorized ports blocked; authorized ports inventoried |
| No outbound connection capability | Attempt ping, DNS lookup, curl from terminal | All fail with "network unreachable" (not timeout) |
| No listening services | ss -tulnp or netstat equivalent | Zero listening ports |
| Application integrity | SHA-256 hash of installed application matches known-good hash | Hash match confirmed |
| OS hardening applied | STIG compliance scan or equivalent | All applicable controls pass |
| Audit logging active | Verify syslog/auditd running and writing to local storage | Log entries being generated |
This verification must be repeated after any hardware change, software update, or maintenance event. Document each verification with date, operator, and security officer sign-off.
Approved Workflow Patterns
Pattern 1: Document-to-Training-Data Pipeline
This is the primary workflow — converting a corpus of classified documents into structured NLP training data.
Authorized Media Import
|
File Import (PDF, Word, scanned images)
|
Document Parsing (text extraction, layout analysis)
|
Cleaning (deduplication, format normalization)
|
PII/Classification Marking Redaction
|
Quality Scoring
|
Annotation (NER, classification labels, Q&A pairs)
|
Train/Val/Test Split
|
JSONL Export
|
Authorized Media Export (under review)
In Ertas, this maps directly to the node graph: File Import, PDF Parser (or Word/Image Parser), Deduplicator, Format Normalizer, PII Redactor, Quality Scorer, Train/Val/Test Splitter, and JSONL Exporter. Each node produces an observable intermediate output. Security reviewers can inspect the data at any stage before it moves to the next.
Key constraint: The exported JSONL file is classified at the same level as the source documents. It must be handled, stored, and transferred according to that classification level's requirements.
Pattern 2: Knowledge Base Construction (RAG)
Building a searchable knowledge base from classified documents for use by authorized AI systems within the same security enclave.
Authorized Media Import
|
File Import → Parser → PII Redactor
|
RAG Chunker → Embedding (local model) → Vector Store Writer
|
[Knowledge base stored locally on classified system]
|
API Endpoint → Query Embedder → Vector Search → Context Assembler → API Response
|
[Retrieval endpoint accessible only within the air-gapped enclave]
The Ertas RAG pipeline runs entirely locally. Embedding generation uses a local model (no API calls). The vector store is a local file. The retrieval API endpoint listens only on localhost — accessible to other applications on the same machine but not to any network.
Pattern 3: Cross-Domain Downgrade
When NLP training data prepared from classified sources needs to move to a lower-classification environment (for example, using redacted training data on an unclassified model training cluster), the pipeline must include a formal downgrade review.
This is not a technology problem — it is a process problem. The pipeline's role is to produce clean, fully redacted output and provide the audit trail that human reviewers need to authorize the cross-domain transfer.
Ertas supports this by generating a complete processing log: every document ingested, every transformation applied, every redaction performed, with timestamps and checksums. This log is the artifact reviewers examine during the downgrade authorization process.
Document Types and Parsing Considerations
Classified document corpora typically include:
| Document Type | Parsing Challenge | Ertas Approach |
|---|---|---|
| Typed reports (PDF) | Classification markings in headers/footers, portion markings inline | PDF Parser extracts text; PII Redactor configured for classification marking patterns |
| Scanned documents | OCR accuracy varies with scan quality; handwritten annotations | Image Parser with local OCR; Quality Scorer flags low-confidence extractions |
| Technical manuals | Complex tables, diagrams with callouts, multi-column layouts | PDF Parser with layout analysis; structured extraction preserves table formatting |
| Email archives (PST/MBOX) | Nested threading, attachments, forwarded chains with mixed classification | File Import handles archive formats; Deduplicator resolves forwarded duplicates |
| Presentations | Bullet-point text, embedded charts, speaker notes | PowerPoint Parser extracts text from slides and notes separately |
Handling Classification Markings
Classified documents contain portion markings — classification indicators on individual paragraphs, such as "(S)" for Secret or "(U)" for Unclassified. The pipeline should:
- Detect and parse portion markings during text extraction
- Tag each text segment with its classification level
- Enable filtering by classification level during export (for example, extracting only "(U)" portions for a lower-classification training set)
The PII Redactor node can be configured to recognize standard portion marking patterns and either preserve them as metadata or redact them depending on the downstream use case.
Operational Security Considerations
Media handling. All removable media used to transfer data into or out of the air-gapped environment must be inventoried, tracked, and degaussed or destroyed after use. Never reuse media across classification levels.
Screen capture and photography. The workstation should have no screen capture capability. Photography of the screen is prohibited. Ertas does not include any screen recording or screenshot functionality.
Maintenance and updates. Software updates to the air-gapped workstation require the same media transfer protocols as classified data. Obtain the Ertas update package on clean media, verify its hash against a known-good value published through a separate channel, and install without network connectivity.
Personnel access. Only cleared personnel with need-to-know should have physical access to the processing workstation. Log all access with badge-in/badge-out records.
Pipeline Observability Without Network
Traditional pipeline monitoring assumes a dashboard accessible over the network. In an air-gapped environment, observability is local.
Ertas provides pipeline observability directly in its desktop UI. Every node in the pipeline graph shows its processing status, record counts, error rates, and output previews. The complete execution log is written to a local file that can be reviewed on the same machine or exported on authorized media for compliance review.
No network-based monitoring, no cloud dashboards, no telemetry. Everything stays on the machine.
Getting Started
Processing classified documents for NLP is constrained by security requirements that eliminate most commercial tooling from consideration. The tool must be a native application, fully self-contained, with zero network dependencies and complete local observability.
Ertas Data Suite was built for exactly this operating model. A single installable binary that runs on a hardened workstation, processes documents through a visual pipeline, and produces AI-ready training data — all without opening a single network connection. Every transformation is logged locally, every intermediate output is inspectable, and the entire pipeline is auditable by your security officer.
The classified documents contain the domain knowledge your NLP models need. Ertas provides the pipeline to extract it safely.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

ITAR-Compliant AI Training Data Pipelines for Defense Contractors
A compliance-focused guide to building AI training data pipelines that satisfy ITAR export control requirements. Covers the ITAR compliance matrix, pipeline architecture for controlled technical data, audit requirements, and on-premise deployment for defense contractors.

AI Data Preparation for Government Agencies: Security Classifications and Air-Gapped Requirements
How government and defense agencies can prepare classified and sensitive data for AI model training in air-gapped environments — covering CMMC, FedRAMP, ITAR, and security classification handling.

Best RAG Pipeline for Financial Services: Air-Gapped Retrieval for PII-Heavy Data
Financial institutions handle PII-dense documents that cannot touch cloud infrastructure. Here is how to build an air-gapped RAG pipeline that meets SOC 2, GDPR, and internal audit requirements while keeping retrieval fast.