Back to blog
    How Cybersecurity Teams Build AI in Air-Gapped Environments
    cybersecurityair-gappedon-premiseenterprise-aicompliancesegment:enterprise

    How Cybersecurity Teams Build AI in Air-Gapped Environments

    Cybersecurity teams deal with the most sensitive organizational data. Here's how to build AI data preparation and training pipelines that never touch the internet — including synthetic data generation with local LLMs.

    EErtas Team·

    There is a particular irony in how most AI tools are deployed: they send data to the cloud to process it. For the average enterprise, this is a privacy tradeoff worth making. For cybersecurity teams, it is not a tradeoff at all — it is a disqualifying condition.

    "Most AI tools process inference over the cloud, making the data essentially public." That quote came from a cybersecurity firm we spoke with during our discovery calls. It captures the problem precisely. The data that cybersecurity teams work with — threat intelligence, incident reports, internal network topology, vulnerability details, behavioral analytics, and security event logs — is the most sensitive category of data in most organizations. Sending it to a third-party cloud service to be processed, even under a data processing agreement, defeats the purpose of having it secured in the first place.

    This guide covers how cybersecurity teams are building AI in environments where the data stays where it belongs.

    What Cybersecurity Teams Need AI For

    Before addressing the infrastructure constraints, it helps to be specific about the AI use cases that are driving demand in security operations:

    Alert triage and classification: Security operations centers handle thousands of alerts per day. The vast majority are false positives. A well-trained classification model that triages alerts by true positive probability — trained on the organization's own historical alert data — can dramatically reduce analyst fatigue and mean time to respond.

    Log anomaly detection: Network flow data, authentication logs, endpoint telemetry, and application logs contain the signals of lateral movement, privilege escalation, and data exfiltration. Classical rule-based detection misses novel patterns. ML models trained on baseline behavior can surface statistical anomalies that rules would never catch.

    Threat intelligence extraction: Unstructured threat reports, incident post-mortems, and vendor advisories contain valuable indicators of compromise, attacker techniques, and affected systems. NER models trained to extract these entities into structured formats accelerate threat intel ingestion significantly.

    Vulnerability triage: When a new CVE drops, security teams need to assess which systems are affected, what the exploitation probability is in their environment, and how to prioritize remediation. Models trained on the organization's asset inventory and historical vulnerability data can automate the initial triage layer.

    Incident report generation: Security analysts spend significant time writing incident reports, post-mortems, and executive summaries. Fine-tuned models trained on historical incidents can generate first drafts from structured event data, with analyst review before finalization.

    All of these use cases require training data derived from the organization's own operational data. None of that data can leave the environment.

    The Air-Gapped Constraint in Practice

    "Air-gapped" means no network connectivity at runtime. Not "self-hosted in your own cloud account." Not "Docker on your data center servers with firewall rules." Physically disconnected from external networks, or strictly network-isolated with no outbound internet connectivity.

    This creates specific requirements for every component of the AI data preparation pipeline:

    Document parsing: Must run entirely locally. No cloud OCR APIs (Google Document AI, Azure Document Intelligence, AWS Textract all phone home). Requires embedded OCR — Tesseract, Surya, or similar — running on local hardware.

    AI-assisted features: Any ML-assisted labeling, entity recognition, or quality scoring must use locally-hosted models. This means GGUF model files downloaded to local storage before deployment, running via Ollama or llama.cpp with no internet access at inference time.

    Quality scoring: Embedding-based deduplication and semantic quality scoring require local embedding models. Sentence-transformers run well on CPU for most embedding tasks. The model files must be pre-downloaded.

    Export and transfer: Data moves between systems via secure file transfer (encrypted drives, internal network transfer), never through external services.

    Updates: Software updates cannot be pushed automatically. Updates must be manually applied after review, which creates additional maintenance requirements but also reduces the attack surface.

    The most common failure mode when building air-gapped AI pipelines is discovering mid-project that a component phones home. Many open-source tools send telemetry, check for updates, or load models from external APIs without making this explicit. Any tool used in an air-gapped pipeline must be audited for external network calls before deployment.

    The Data Types and Their Preparation Requirements

    Security Event Logs

    The highest-volume data type in most security environments. Format is typically structured (CEF, LEEF, syslog, JSON) which makes parsing straightforward. The preparation challenges are:

    • Volume: Security logs are enormous. A medium-sized enterprise generates hundreds of gigabytes of log data per day. Training data needs to be sampled, filtered, and labeled — not processed in full.
    • Label imbalance: True positive alerts are rare (often less than 0.1% of events). Training a classification model requires deliberate sampling strategies to get enough positive examples, combined with synthetic data generation to augment the rare-class training set.
    • Temporal context: Many security events only have meaning in sequence (a series of failed login attempts, then a successful one from a new location). Training data preparation must preserve temporal ordering and context windows around events.

    Threat Intelligence Documents

    Unstructured reports in PDF, Word, or HTML format. Preparation requirements:

    • Document ingestion with entity-aware parsing (IOCs like IP addresses, hashes, domain names, CVE identifiers must be preserved exactly, not corrupted by OCR normalization)
    • NER annotation to label entities with their type (IP address vs. domain vs. file hash vs. threat actor name vs. affected product)
    • Relation extraction annotation for more advanced use cases (X exploits Y; A is associated with B)

    Incident Reports and Post-Mortems

    Internal documents containing detailed technical descriptions of past incidents. These are the most sensitive documents in the environment (they describe how attackers have successfully compromised systems) and the most valuable for training (they contain ground truth about attacker behavior in the organization's specific environment).

    Preparation requirements:

    • Careful PII and sensitive-system redaction (hostnames, internal IP addresses, and system names that appear in incident reports may need to be anonymized before use in training data shared beyond the original incident team)
    • Structured extraction of incident attributes (MITRE ATT&CK techniques, affected systems, timeline, remediation steps)
    • Consistent formatting for fine-tuning incident summary models

    Vulnerability Data

    Structured data from vulnerability scanners (Nessus, Qualys, Rapid7) combined with asset inventory data. Preparation requirements:

    • Joining asset data with vulnerability data while removing asset-identifying information (hostnames, IPs) before training
    • Labeling historical vulnerabilities with their actual exploitation outcome (exploited vs. not exploited in the environment)

    Building the Pipeline: Stage by Stage

    Ingest

    All documents go through a local parsing pipeline. For structured log data, this is straightforward format conversion. For unstructured documents (PDFs, Word, HTML threat reports), this requires embedded OCR and layout analysis running fully locally.

    The parser must handle the specific formats common in security environments: PDF threat reports with complex layouts, CSV/JSON log exports, XML vulnerability scan outputs, and Word incident reports.

    Clean

    Deduplication is important for log-derived training data where the same event type appears thousands of times. Semantic deduplication identifies near-identical events that would create training data with very low diversity.

    PII and sensitive identifier redaction: decide upfront which identifiers should be removed (internal IP addresses? hostnames? usernames?) vs. preserved (these may be the features the model needs to learn from). This is a domain-expert judgment call that ML engineers shouldn't be making alone.

    Label

    Security domain expertise is essential for annotation quality. A security analyst who has triaged thousands of alerts labels examples with far better accuracy than an ML engineer who has read the labeling guide. The tooling must be accessible to analysts — no Docker setup, no command-line interface, no Python environment.

    Annotation types for security AI:

    • Alert classification (true positive / false positive / needs investigation)
    • MITRE ATT&CK tactic and technique labeling for events and reports
    • Entity labeling for threat intelligence NER
    • Severity ratings for incidents

    Augment

    Synthetic data generation addresses the rarest and most valuable class: actual confirmed true positive alerts. Using a locally-hosted LLM (Llama, Qwen, Gemma running via Ollama from pre-downloaded GGUF files), the augmentation module generates plausible synthetic examples of attack patterns not well-represented in historical data.

    The LLM runs entirely locally — no API calls, no data egress. Temperature and diversity controls ensure the synthetic examples are diverse enough to improve model generalization.

    Export

    Finalized training data exports in the format required by the downstream model training job: JSONL for fine-tuning language models, CSV for classical ML classifiers, or structured JSON for agent tool-calling datasets.

    Tooling Requirements for Air-Gapped Security Environments

    Any tool used in an air-gapped security AI pipeline must satisfy:

    • No telemetry: No usage data sent home, no error reporting to external services
    • No auto-update: Updates should require explicit manual action
    • Pre-downloadable models: All AI model files (for parsing, NER, quality scoring, augmentation) must be downloadable before deployment and usable without internet at runtime
    • No cloud fallback: No features that silently fall back to cloud APIs when local models are unavailable
    • Auditable dependencies: All third-party libraries should be auditable for unexpected network calls

    Ertas Data Suite is built for this use case: native desktop app, all AI inference through locally-hosted LLMs via Ollama and llama.cpp, no telemetry, no update checks at runtime, and pre-downloadable GGUF model files.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading