Processing Classified Documents for NLP in Air-Gapped Environments

Building NLP models from classified documents is a problem that most AI tooling was never designed to solve. Commercial data preparation platforms assume network connectivity — for updates, for cloud storage, for telemetry. Classified environments assume the opposite: the machine processing the data must have zero network connectivity, verified and auditable.

This creates a fundamental tooling gap. Organizations processing classified documents for NLP training data need a pipeline that handles document parsing, text extraction, cleaning, annotation, and export — all running on a single machine with no network stack, no outbound connections, and no hidden dependencies that phone home.

This playbook covers the architecture, security requirements, and workflow patterns for preparing classified documents as NLP training data in air-gapped environments.

Security Classification Levels and Data Handling

Different classification levels impose different physical and operational constraints on the data processing environment.

Classification	Physical Requirements	Personnel	Data Transfer
CUI (Controlled Unclassified)	Locked room, access-controlled workstation	Cleared personnel, need-to-know basis	Encrypted removable media with logging
Secret	SCIF or equivalent, TEMPEST-rated equipment	Active Secret clearance	Cross-domain solution (CDS) or manual review + sneakernet
Top Secret / SCI	SCIF, full TEMPEST compliance, RF shielding	Active TS/SCI clearance	CDS with multi-person approval, or physical media under escort

The data pipeline platform must operate at the highest classification level of any document it processes. If a single Top Secret document enters the pipeline, the entire workstation is treated as a Top Secret system.

Air-Gap Architecture Requirements

An air-gapped environment is not simply a machine with WiFi turned off. True air-gap isolation requires verified physical separation from all networks.

Hardware Configuration

The processing workstation must meet the following baseline:

No wireless hardware. WiFi cards, Bluetooth modules, and cellular modems must be physically removed (not just disabled in software). BIOS-level disablement is insufficient for higher classification levels.
No network interface. Ethernet ports should be physically disconnected or the NIC removed entirely. For TEMPEST environments, the system should have no network hardware installed.
USB port control. Only authorized, inventoried removable media devices may connect. USB ports not needed for authorized peripherals should be physically blocked or epoxied.
Audio/visual isolation. Microphones and cameras must be physically removed. Speaker output should be disabled at the hardware level for TEMPEST environments.
BIOS hardening. Boot order locked to internal drive only. BIOS password set. Secure boot enabled where supported.

Software Stack

The processing environment must be self-contained with zero external dependencies at runtime.

Component	Requirement	Why
Operating System	Hardened Linux (SELinux enforcing) or approved Windows with STIG applied	Minimize attack surface, enforce mandatory access controls
Data pipeline tool	Native application, no container runtime, no package manager calls	Containers may attempt registry pulls; package managers need network
ML/NLP libraries	Pre-installed, version-locked, integrity-verified	No pip install, no npm install, no runtime downloads
Document parsers	Bundled with application, no external service calls	PDF parsing must not call Adobe APIs or cloud OCR
Embedding models	Local model files, no API calls	Embedding generation must run entirely on-device

Ertas Data Suite meets these requirements as a native desktop application built on Tauri (Rust + React). It bundles all parsers, processing logic, and UI into a single installable binary. No Docker, no container runtime, no network services. At runtime, it opens no listening ports and makes no outbound connections.

Air-Gap Verification Checklist

Before processing classified documents, the air-gap must be verified. This checklist should be completed by the system administrator and reviewed by the security officer.

Check	Method	Pass Criteria
No network hardware present	Physical inspection + lspci/lsusb audit	Zero network controllers listed
No wireless radios	Physical inspection of motherboard, expansion slots	All wireless modules physically removed
USB ports controlled	Physical inspection	Unauthorized ports blocked; authorized ports inventoried
No outbound connection capability	Attempt ping, DNS lookup, curl from terminal	All fail with "network unreachable" (not timeout)
No listening services	ss -tulnp or netstat equivalent	Zero listening ports
Application integrity	SHA-256 hash of installed application matches known-good hash	Hash match confirmed
OS hardening applied	STIG compliance scan or equivalent	All applicable controls pass
Audit logging active	Verify syslog/auditd running and writing to local storage	Log entries being generated

This verification must be repeated after any hardware change, software update, or maintenance event. Document each verification with date, operator, and security officer sign-off.

Approved Workflow Patterns

Pattern 1: Document-to-Training-Data Pipeline

This is the primary workflow — converting a corpus of classified documents into structured NLP training data.

Authorized Media Import
        |
   File Import (PDF, Word, scanned images)
        |
   Document Parsing (text extraction, layout analysis)
        |
   Cleaning (deduplication, format normalization)
        |
   PII/Classification Marking Redaction
        |
   Quality Scoring
        |
   Annotation (NER, classification labels, Q&A pairs)
        |
   Train/Val/Test Split
        |
   JSONL Export
        |
   Authorized Media Export (under review)

In Ertas, this maps directly to the node graph: File Import, PDF Parser (or Word/Image Parser), Deduplicator, Format Normalizer, PII Redactor, Quality Scorer, Train/Val/Test Splitter, and JSONL Exporter. Each node produces an observable intermediate output. Security reviewers can inspect the data at any stage before it moves to the next.

Key constraint: The exported JSONL file is classified at the same level as the source documents. It must be handled, stored, and transferred according to that classification level's requirements.

Pattern 2: Knowledge Base Construction (RAG)

Building a searchable knowledge base from classified documents for use by authorized AI systems within the same security enclave.

Authorized Media Import
        |
   File Import → Parser → PII Redactor
        |
   RAG Chunker → Embedding (local model) → Vector Store Writer
        |
   [Knowledge base stored locally on classified system]
        |
   API Endpoint → Query Embedder → Vector Search → Context Assembler → API Response
        |
   [Retrieval endpoint accessible only within the air-gapped enclave]

The Ertas RAG pipeline runs entirely locally. Embedding generation uses a local model (no API calls). The vector store is a local file. The retrieval API endpoint listens only on localhost — accessible to other applications on the same machine but not to any network.

Pattern 3: Cross-Domain Downgrade

When NLP training data prepared from classified sources needs to move to a lower-classification environment (for example, using redacted training data on an unclassified model training cluster), the pipeline must include a formal downgrade review.

This is not a technology problem — it is a process problem. The pipeline's role is to produce clean, fully redacted output and provide the audit trail that human reviewers need to authorize the cross-domain transfer.

Ertas supports this by generating a complete processing log: every document ingested, every transformation applied, every redaction performed, with timestamps and checksums. This log is the artifact reviewers examine during the downgrade authorization process.

Document Types and Parsing Considerations

Classified document corpora typically include:

Document Type	Parsing Challenge	Ertas Approach
Typed reports (PDF)	Classification markings in headers/footers, portion markings inline	PDF Parser extracts text; PII Redactor configured for classification marking patterns
Scanned documents	OCR accuracy varies with scan quality; handwritten annotations	Image Parser with local OCR; Quality Scorer flags low-confidence extractions
Technical manuals	Complex tables, diagrams with callouts, multi-column layouts	PDF Parser with layout analysis; structured extraction preserves table formatting
Email archives (PST/MBOX)	Nested threading, attachments, forwarded chains with mixed classification	File Import handles archive formats; Deduplicator resolves forwarded duplicates
Presentations	Bullet-point text, embedded charts, speaker notes	PowerPoint Parser extracts text from slides and notes separately

Handling Classification Markings

Classified documents contain portion markings — classification indicators on individual paragraphs, such as "(S)" for Secret or "(U)" for Unclassified. The pipeline should:

Detect and parse portion markings during text extraction
Tag each text segment with its classification level
Enable filtering by classification level during export (for example, extracting only "(U)" portions for a lower-classification training set)

The PII Redactor node can be configured to recognize standard portion marking patterns and either preserve them as metadata or redact them depending on the downstream use case.

Operational Security Considerations

Media handling. All removable media used to transfer data into or out of the air-gapped environment must be inventoried, tracked, and degaussed or destroyed after use. Never reuse media across classification levels.

Screen capture and photography. The workstation should have no screen capture capability. Photography of the screen is prohibited. Ertas does not include any screen recording or screenshot functionality.

Maintenance and updates. Software updates to the air-gapped workstation require the same media transfer protocols as classified data. Obtain the Ertas update package on clean media, verify its hash against a known-good value published through a separate channel, and install without network connectivity.

Personnel access. Only cleared personnel with need-to-know should have physical access to the processing workstation. Log all access with badge-in/badge-out records.

Pipeline Observability Without Network

Traditional pipeline monitoring assumes a dashboard accessible over the network. In an air-gapped environment, observability is local.

Ertas provides pipeline observability directly in its desktop UI. Every node in the pipeline graph shows its processing status, record counts, error rates, and output previews. The complete execution log is written to a local file that can be reviewed on the same machine or exported on authorized media for compliance review.

No network-based monitoring, no cloud dashboards, no telemetry. Everything stays on the machine.

Getting Started

Processing classified documents for NLP is constrained by security requirements that eliminate most commercial tooling from consideration. The tool must be a native application, fully self-contained, with zero network dependencies and complete local observability.

Ertas Data Suite was built for exactly this operating model. A single installable binary that runs on a hardened workstation, processes documents through a visual pipeline, and produces AI-ready training data — all without opening a single network connection. Every transformation is logged locally, every intermediate output is inspectable, and the entire pipeline is auditable by your security officer.

The classified documents contain the domain knowledge your NLP models need. Ertas provides the pipeline to extract it safely.