Back to blog
    How to Build an Air-Gapped AI Pipeline for Regulated Industries
    air-gappedregulated-industrieson-premisecompliancesovereign-aisegment:enterprise

    How to Build an Air-Gapped AI Pipeline for Regulated Industries

    A decision-stage technical guide to building an AI pipeline with zero internet connectivity. Covers pipeline architecture at each stage — data ingestion, cleaning, labeling, augmentation, and export — with hardware requirements, tool comparisons, and transfer mechanisms for air-gapped environments.

    EErtas Team·

    You have decided that your AI pipeline must run air-gapped — physically isolated from the internet with no exceptions. Maybe your data is classified. Maybe your regulator requires it. Maybe your security team conducted a risk assessment and concluded that any external connectivity is unacceptable for this particular workload.

    This article is not about whether you need air-gapped operation. (If you are unsure, see our guide to air-gapped vs on-premise vs self-hosted deployment for the decision framework.) This article covers the architecture decisions you need to make at each pipeline stage when building an AI system that will never touch the internet.

    The pipeline has five stages. Each stage has different hardware requirements, different tool constraints, and different failure modes when connectivity is removed. We will walk through each one.


    Pipeline Architecture Overview

    An air-gapped AI pipeline has the same logical stages as any other ML pipeline. The difference is that every component at every stage must function with zero external connectivity — no API calls, no license servers, no CDN-hosted assets, no telemetry, no dependency downloads.

    ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────┐   ┌──────────┐
    │  Ingest  │──▶│  Clean   │──▶│  Label   │──▶│  Augment     │──▶│  Export  │
    │          │   │          │   │          │   │  (Synthetic)  │   │          │
    │ OCR      │   │ PII/PHI  │   │ NER      │   │ Local LLM    │   │ JSONL    │
    │ PDF      │   │ Redact   │   │ Classify │   │ Inference     │   │ COCO     │
    │ Layout   │   │ Normalize│   │ BBox     │   │              │   │ CSV      │
    └──────────┘   └──────────┘   └──────────┘   └──────────────┘   └──────────┘
           │              │              │              │                   │
           ▼              ▼              ▼              ▼                   ▼
       [Audit Log]   [Audit Log]   [Audit Log]   [Audit Log]         [Audit Log]
    

    Every stage writes to a local audit log. In air-gapped environments, the audit trail is your only evidence of what happened to the data. There is no cloud logging service to fall back on.


    Stage 1: Data Ingestion

    Data ingestion converts raw enterprise files — PDFs, Word documents, scanned images, spreadsheets, emails — into machine-readable text and structured content. In an air-gapped environment, this means all parsing and OCR must be embedded in the application.

    Architecture decisions

    OCR engine selection: You need OCR that runs entirely locally, with no external API calls and no internet-dependent model downloads.

    OCR engineAir-gapped compatibleLanguage supportAccuracy on clean docsAccuracy on scanned docsGPU acceleration
    Tesseract 5.xYes — fully local, open source100+ languages via offline language packsGoodModerateNo
    PaddleOCRYes — fully local, open source80+ languages, strong CJK supportVery goodGoodYes (optional)
    EasyOCRYes — fully local, open source80+ languagesGoodModerateYes (optional)
    Google Document AINo — cloud APIN/AN/AN/AN/A
    Azure Document IntelligenceNo — cloud APIN/AN/AN/AN/A
    AWS TextractNo — cloud APIN/AN/AN/AN/A

    For air-gapped environments, Tesseract and PaddleOCR are the primary options. Tesseract is more widely deployed and has better documentation for offline installation. PaddleOCR typically produces better results on complex layouts (multi-column, tables, mixed text/image) but requires more careful dependency management for offline installation.

    PDF parsing: PDF parsing has two modes — text extraction (for digitally-created PDFs) and OCR extraction (for scanned PDFs). Most enterprise document collections contain both.

    PDF parserAir-gapped compatibleHandles scanned PDFsTable extractionLayout preservation
    PyMuPDF (fitz)YesWith embedded OCRBasicGood
    pdfplumberYesNo (text-only)GoodGood
    DoclingYes (self-hosted)With embedded OCRVery good (97.9%)Very good
    CamelotYesNo (text-only)Very good (tables specifically)Limited
    MarkerYesWith embedded OCRGoodVery good
    Adobe Acrobat APINo — cloud serviceN/AN/AN/A

    Recommendation for air-gapped: Docling (IBM Research, open source) for primary parsing, with PyMuPDF as a fallback for simpler documents. Docling's table extraction accuracy (97.9% on benchmarks) is important for enterprise documents where tables contain critical structured data.

    Hardware requirements for ingestion

    WorkloadCPURAMGPUStorage
    Text PDF extraction (no OCR)4+ cores8 GBNot required2x source document volume
    OCR on scanned documents8+ cores16 GBOptional (speeds PaddleOCR 3-5x)3x source document volume
    High-volume ingestion (10K+ docs)16+ cores32 GBRecommended3-5x source document volume

    Storage multiplier accounts for both the original documents and the extracted structured output (JSON, text, metadata).


    Stage 2: Cleaning and De-Identification

    Cleaning transforms raw extracted text into normalized, consistent content. De-identification detects and redacts personally identifiable information (PII) and protected health information (PHI). In air-gapped environments, all NLP models for entity detection must run locally.

    Architecture decisions

    PII/PHI detection approach: You have two options — rule-based pattern matching, or NLP model-based named entity recognition (NER). In practice, you need both.

    Detection methodWhat it catchesFalse positive rateAir-gapped compatible
    Regex pattern matchingSSNs, phone numbers, emails, credit cards, dates in standard formats, medical record numbersLow (patterns are precise)Yes — no dependencies
    spaCy NER (local models)Names, organizations, locations, dates in non-standard formatsModerate (requires tuning)Yes — model weights loaded from local storage
    Hugging Face NER (GGUF/ONNX)Names, organizations, domain-specific entitiesLow-to-moderateYes — quantized models run locally
    AWS Comprehend MedicalPHI in clinical textLowNo — cloud API
    Google Healthcare NLPPHI in clinical textLowNo — cloud API

    Recommended air-gapped approach: Layer both methods. Use regex patterns for structured identifiers (SSNs, phone numbers, emails, medical record numbers, dates). Use a locally loaded NER model (spaCy or quantized transformer) for unstructured identifiers (names, organizations, locations in free text).

    For HIPAA-regulated data specifically, the de-identification must satisfy the Safe Harbor method (removal of 18 specific identifier categories) or the Expert Determination method. Regex catches most structured identifiers. NER catches the unstructured ones. A human review stage after automated de-identification is standard practice for HIPAA compliance.

    Data normalization: Air-gapped environments often process documents accumulated over decades — different encoding schemes, inconsistent date formats, legacy character sets. Normalization converts these to consistent UTF-8 encoding, standardized date formats, and consistent whitespace handling. This is computationally cheap and has no connectivity requirements.

    Hardware requirements for cleaning

    WorkloadCPURAMGPUNotes
    Regex-only PII detection4+ cores8 GBNot requiredFast, handles millions of records
    spaCy NER models4+ cores16 GBNot required (CPU inference)Slower than regex, more thorough
    Transformer NER (quantized)8+ cores16 GB8+ GB VRAM recommendedBest accuracy, requires GPU for reasonable speed
    Combined pipeline8+ cores32 GB16+ GB VRAMRegex first pass, NER second pass, human review final pass

    Stage 3: Labeling and Annotation

    Labeling is where domain experts assign categories, entities, bounding boxes, or quality scores to processed data. In air-gapped environments, the labeling interface must serve entirely from localhost — no external CDN assets, no cloud-synced projects, no browser-based tools that load scripts from remote servers.

    Architecture decisions

    Annotation tool selection: Most modern annotation tools are web applications that assume internet connectivity. Even self-hosted versions often load JavaScript libraries from CDNs, analytics scripts, or font files from external servers.

    Annotation toolAir-gapped compatibleModalitiesDesktop nativeDomain-expert accessible
    Prodigy (Explosion AI)Yes — fully local, perpetual licenseNLP, CV, audioPython-based (runs locally)Moderate (requires terminal)
    Label Studio (self-hosted)Partial — check for external asset loadingNLP, CV, audio, videoNo (Docker/K8s web app)Yes (browser UI)
    CVAT (self-hosted)Partial — web app with potential external dependenciesCV onlyNo (Docker web app)Yes (browser UI)
    LabelboxNo — cloud SaaSNLP, CVNoYes
    Scale AINo — cloud SaaSNLP, CVNoYes

    The Label Studio caveat: Label Studio can be self-hosted, but the self-hosted version must be audited for external calls. Previous versions loaded Google Fonts from external CDN, included analytics scripts, and made calls to check for updates. In an air-gapped environment, these calls fail silently or cause errors. You need to verify — by inspecting network traffic — that your self-hosted Label Studio instance makes zero external HTTP requests.

    Recommendation for air-gapped: For NLP annotation, Prodigy is the most reliably air-gapped option — it is a Python library with no web dependencies, serving its UI entirely from localhost. The trade-off is that it requires a Python environment, which limits accessibility for non-technical domain experts.

    For organizations where domain experts (doctors, lawyers, engineers) need direct access to the labeling interface, a native desktop annotation tool that requires no terminal, no Python, and no browser connectivity is the best option. This is the approach Ertas Data Suite takes — a native desktop app where the entire annotation interface runs locally with zero network dependencies.

    Hardware requirements for labeling

    Labeling is the least compute-intensive stage. It is primarily a human activity with software assistance.

    WorkloadCPURAMGPUNotes
    Text annotation (NER, classification)2+ cores8 GBNot requiredPrimarily UI-bound, not compute-bound
    Image annotation (bounding boxes, segmentation)4+ cores16 GBOptional (speeds rendering)Large images need more RAM
    AI-assisted labeling (model suggestions)8+ cores16 GB8+ GB VRAMLocal model provides label suggestions for human review

    Stage 4: Synthetic Data Augmentation

    Synthetic data augmentation uses LLMs to generate additional training examples from existing labeled data. In an air-gapped environment, this requires running LLM inference locally — no cloud APIs, no external model endpoints.

    Architecture decisions

    Local LLM runtime selection:

    RuntimeAir-gapped compatibleModel formatGPU supportMulti-model serving
    OllamaYes — offline installation availableGGUFNVIDIA, AMD, Apple SiliconYes
    llama.cppYes — compile from source, no dependenciesGGUFNVIDIA, AMD, Apple Silicon, VulkanNo (single model)
    vLLMYes — but complex offline dependency installationSafeTensors, GPTQNVIDIA (primarily)Yes
    Microsoft Foundry LocalYes — designed for disconnected operationONNXNVIDIA, AMD, Intel, Qualcomm, Apple SiliconYes
    Hugging Face Inference APINo — cloud endpointN/AN/AN/A

    Recommended for air-gapped: Ollama for general-purpose augmentation. It supports a wide range of GGUF models, has straightforward offline installation (copy the binary + model files), and serves an OpenAI-compatible API on localhost. For environments where Microsoft's ecosystem is preferred, Foundry Local is the alternative — with the trade-off of a narrower model selection.

    Model selection for augmentation:

    ModelParametersVRAM required (Q4 quantized)Augmentation qualityAir-gapped installation complexity
    Phi-4-mini3.8B~4 GBGood for simple tasksLow (small download, fast transfer)
    Llama 3.1 8B8B~6 GBGood for general augmentationLow
    Mistral 7B7B~6 GBGood for structured outputLow
    Qwen 2.5 14B14B~10 GBVery goodModerate (larger transfer)
    Llama 3.1 70B70B~40 GBExcellentHigh (large download, requires high-VRAM GPU)

    For most enterprise augmentation tasks — generating paraphrases, creating classification variants, expanding entity examples — an 8B-14B quantized model is the practical sweet spot. Quality is sufficient, hardware requirements are manageable, and the model files (4-10 GB) are feasible to transfer via removable media.

    Hardware requirements for augmentation

    WorkloadCPURAMGPUThroughput
    7-8B model augmentation8+ cores32 GB16 GB VRAM (RTX 4080 or equivalent)~30-50 tokens/sec
    14B model augmentation8+ cores32 GB24 GB VRAM (RTX 4090 or equivalent)~20-35 tokens/sec
    70B model augmentation16+ cores64 GB48+ GB VRAM (A6000 or 2x RTX 4090)~10-20 tokens/sec
    CPU-only augmentation (7B)16+ cores64 GBNone~3-8 tokens/sec (slow but functional)

    GPU is strongly recommended for augmentation. CPU-only inference on 7B models works but generates data 5-10x slower, which matters when you need to produce thousands of synthetic training examples.


    Stage 5: Export

    Export converts processed, labeled, and augmented data into formats consumable by downstream training and deployment systems. In an air-gapped environment, export targets local storage — never cloud object storage.

    Architecture decisions

    Export format selection depends on downstream use case:

    Use caseExport formatFile structure
    LLM fine-tuningJSONL (instruction, input, output)One JSON object per line
    RAG / retrievalChunked text with metadataJSONL or structured JSON
    Computer vision (object detection)YOLO or COCO formatImages + annotation files
    Computer vision (classification)Directory structure with class foldersimage/class_name/file.jpg
    Classical MLCSV with features and labelsStandard tabular format
    DPO fine-tuningJSONL with chosen/rejected pairsPreference pairs per line

    Audit trail export: In regulated environments, the training data alone is not sufficient. You must also export:

    • Data lineage (which source document produced which training example)
    • Transformation log (every cleaning, redaction, and modification with timestamps)
    • Operator log (who labeled what, when, and what they changed)
    • Quality metrics (inter-annotator agreement, confidence scores)

    For EU AI Act Article 30 compliance, this audit documentation must accompany the training data and be available for inspection. For HIPAA, the de-identification audit trail must demonstrate that PHI was properly removed before data was used for training.

    Hardware requirements for export

    WorkloadCPURAMGPUNotes
    JSONL/CSV export2+ cores8 GBNot requiredI/O-bound, not compute-bound
    Large-scale export (100K+ records)4+ cores16 GBNot requiredDisk speed matters more than CPU
    Export with audit trail generation4+ cores16 GBNot requiredAudit trail can be larger than the data itself

    Transfer Mechanisms: Getting Software and Models into Air-Gapped Environments

    The most overlooked aspect of air-gapped AI is initial setup. You cannot install software from the internet. You cannot download model weights. Everything must be transferred through approved physical channels.

    Physical media transfer

    The standard approach for classified and air-gapped environments:

    1. Prepare on a connected machine: Download all software installers, dependencies, model weights, and configuration files onto a clean, formatted drive
    2. Security scan: Run the media through your organization's malware scanning and security review process
    3. Chain of custody: Document who prepared the media, what it contains, and when it was transferred
    4. Install on the air-gapped machine: Copy files from approved media to the target system
    5. Verify integrity: Compare checksums (SHA-256) of installed files against the prepared manifest

    For model weights specifically: a 7B GGUF model is roughly 4-6 GB. A 70B model is 35-45 GB. USB drives or portable SSDs handle these sizes easily. Larger datasets (hundreds of GB of source documents) may require portable NAS devices or multiple drives.

    One-way data diodes

    For organizations with more sophisticated air-gapped networks, hardware data diodes provide a one-way transfer mechanism. Data flows into the air-gapped network but cannot flow out. This is used in defense and critical infrastructure environments where removable media is also restricted.

    Data diodes allow automated, scheduled transfers of model updates and software patches into the air-gapped environment without creating any outbound data path.

    What must be pre-staged

    Before isolating the machine, transfer all of the following:

    CategorySpecific itemsTypical size
    Application installersAI pipeline software, annotation tools, inference runtime1-5 GB
    Runtime dependenciesPython packages (wheel files), system libraries2-10 GB
    OCR language packsTesseract language data, PaddleOCR models0.5-2 GB
    NER modelsspaCy models, quantized transformer models for PII detection1-5 GB
    LLM weightsGGUF models for augmentation and AI-assisted labeling4-45 GB per model
    Configuration filesPipeline configs, export templates, audit trail schemas<100 MB

    Total pre-staging for a complete air-gapped AI pipeline: approximately 10-70 GB, depending on how many LLM models you include.


    Compliance Mapping: Who Actually Requires Air-Gapped?

    Not every regulation requires air-gapped operation. Understanding which regulations require which deployment model prevents over-engineering.

    Regulation / ContextAir-gapped required?On-premise sufficient?Notes
    US classified systems (ITAR, classified data)YesNoPhysical isolation required by policy
    US CMMC Level 3+ (DoD contractors)Often yesDepends on data typeControlled Unclassified Information handling
    HIPAA (healthcare)No (but recommended for PHI training data)YesHIPAA requires safeguards, not specific deployment models
    GDPR (EU)NoOften sufficientRequires data residency + processing controls; on-premise with audit trail satisfies most requirements
    EU AI Act (high-risk systems)NoOften sufficientRequires documentation and audit trail; deployment model is not prescribed
    India DPDP ActNoMay be required for significant data fiduciariesData localization for certain categories
    Saudi Arabia PDPLNoEffectively required for personal dataProcessing within the Kingdom
    Financial regulations (SOX, PCI-DSS)No (except for specific high-security environments)YesStrong access controls required; deployment model flexible
    Critical infrastructure (NERC CIP)Often yes for OT networksYes for IT networksOT/IT segmentation is standard

    The practical guideline: Air-gapped is required for classified/defense data and critical infrastructure OT networks. On-premise is sufficient for most regulated industries (healthcare, finance, legal). Sovereign cloud (domestic provider) is acceptable for data that requires jurisdictional control but not physical isolation.


    Putting It Together: Reference Architecture

    A complete air-gapped AI pipeline for a regulated enterprise:

    Hardware:

    • Workstation or server: 16+ cores, 64 GB RAM, NVIDIA RTX 4090 (24 GB VRAM) or A6000 (48 GB VRAM)
    • Local storage: 2+ TB NVMe SSD for active projects, plus NAS for archival
    • Removable media station: for initial setup and periodic model/software updates

    Software stack:

    • OS: Linux (Ubuntu/RHEL) or Windows, fully updated before isolation
    • Ingestion: Docling + PyMuPDF + Tesseract/PaddleOCR
    • Cleaning: spaCy NER + regex patterns + custom rules
    • Labeling: Native desktop annotation tool (no Docker, no browser dependencies)
    • Augmentation: Ollama + Llama 3.1 8B (GGUF Q4)
    • Export: JSONL + audit trail generator
    • Inference runtime: Ollama, llama.cpp, or Foundry Local

    Estimated hardware cost: $8,000-$15,000 for a workstation build (RTX 4090 class), or $20,000-$40,000 for a server build (A6000 class). Compare to cloud GPU costs of $2-$4/hour for equivalent compute — the on-premise hardware pays for itself in 6-18 months of continuous use.

    This architecture handles the complete pipeline from raw documents to AI-ready training data, entirely within an air-gapped perimeter, with full audit trail at every stage.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading