On-Premise Runtime Architecture for Enterprise AI Data Preparation

Most enterprise AI projects spend 60–80% of their timeline on data preparation. Ingestion, cleaning, labeling, augmentation, export — these steps happen before a single training run begins. For organizations in regulated industries, this work must happen on infrastructure they control.

The architecture you choose for on-premise data preparation determines your throughput, your operational complexity, and whether domain experts can actually use the tools without an ML engineering escort. This guide covers the deployment models, compute requirements, and infrastructure patterns that work in practice.

Deployment Model: The First Decision

Three deployment approaches dominate on-premise data preparation tooling:

Native Desktop Applications

The application installs like any other program. It runs as a local process with direct access to CPU, GPU, and filesystem. No containers, no orchestration layer, no web server. Installation takes minutes. Updates are application-level — download and install.

Advantages: Zero DevOps overhead. Domain experts can install and run the tool themselves. Air-gapped operation is built in — the application runs entirely offline after installation. Direct hardware access means GPU inference doesn't go through a virtualization layer.

Limitations: Single-machine scope. Scaling across a team requires shared storage or coordination outside the tool.

Docker Containers

The tool ships as one or more container images. Deployment requires Docker (or Podman), and complex tools often need Docker Compose or similar orchestration. GPU access requires NVIDIA Container Toolkit and proper driver configuration.

Advantages: Reproducible environments. Consistent behavior across different host operating systems. Familiar deployment model for DevOps teams.

Limitations: GPU passthrough adds configuration complexity. Docker itself is a security surface. Air-gapped deployment requires pre-pulling images and all dependencies — a process that often fails due to missing layers or runtime dependencies. Domain experts cannot self-serve; they need engineering support for deployment and troubleshooting.

Self-Hosted Web Applications (with or without Kubernetes)

The tool runs as a web service, typically behind Nginx or Traefik. Kubernetes deployments add service discovery, scaling, and health monitoring.

Advantages: Multi-user access through a browser. Centralized compute resources. Kubernetes enables autoscaling for batch workloads.

Limitations: Highest operational complexity. Requires networking, TLS configuration, authentication, and ongoing cluster management. GPU scheduling in Kubernetes requires the NVIDIA device plugin and careful resource quotas. Air-gapped Kubernetes deployments are notoriously difficult.

Compute Requirements by Pipeline Stage

Data preparation is not a single workload. Each stage has different compute characteristics:

Ingestion

Primarily I/O-bound. The bottleneck is reading documents from disk or network storage — PDF parsing, image decoding, text extraction. SSD storage matters more than CPU speed here. A modern NVMe drive can read at 3–7 GB/s, while a spinning disk tops out at 100–200 MB/s. For large document archives, this difference is measured in hours.

CPU usage during ingestion is moderate. Most formats (PDF, DOCX, HTML) use single-threaded parsing libraries, so more cores help only when processing many files in parallel.

Typical requirement: 4+ CPU cores, 16 GB RAM, NVMe SSD.

OCR (Optical Character Recognition)

CPU-intensive or GPU-accelerated depending on the engine. Tesseract runs on CPU and processes roughly 1–3 pages per second. GPU-accelerated OCR engines (PaddleOCR, EasyOCR) can achieve 10–30 pages per second on a mid-range GPU.

For scanned document archives, OCR is often the slowest stage in the pipeline. A 100,000-page archive at 2 pages/second takes roughly 14 hours with CPU-only OCR.

Typical requirement: GPU recommended (8+ GB VRAM), 32 GB RAM for batch processing.

Cleaning and Deduplication

CPU and memory-bound. Deduplication across large datasets requires holding similarity hashes in memory. Exact dedup on 1 million documents is straightforward; fuzzy dedup (MinHash, SimHash) at scale requires significant RAM.

PII detection and redaction adds CPU load. Regex-based PII detection is fast; NER-based detection (using a small language model) is slower but more accurate.

Typical requirement: 8+ CPU cores, 32–64 GB RAM depending on dataset size.

Labeling with Local LLM

GPU-intensive when using AI-assisted labeling. A 7B parameter model at Q4 quantization requires approximately 4–5 GB of VRAM and can process 20–50 documents per minute for classification tasks. A 14B model at Q4 needs 8–10 GB VRAM and runs at roughly half the speed.

Manual labeling is CPU-trivial — the bottleneck is human speed, not compute.

Typical requirement: GPU with 8+ GB VRAM (16 GB preferred), 32 GB system RAM.

Augmentation and Synthetic Generation

Similar to labeling — GPU-bound when using local LLMs for synthetic data generation. Longer outputs (generating synthetic documents vs. generating labels) increase the GPU time per item. Generation of 500-word synthetic documents at 7B Q4 produces roughly 5–15 documents per minute depending on hardware.

Typical requirement: GPU with 8–16 GB VRAM, 32 GB system RAM.

Export

I/O-bound. Converting processed data to training formats (JSONL, Parquet, HuggingFace datasets) is limited by write speed. Compression adds CPU load. Export of 100 GB of processed data takes 10–30 minutes on NVMe, longer on HDD.

Typical requirement: NVMe SSD, 16 GB RAM, moderate CPU.

Three Hardware Tiers

Not every data preparation project needs the same infrastructure. Here's what works at each scale:

Tier 1: Lightweight (CPU-Only, Under 100 GB Data)

Hardware: Modern workstation or laptop. 8+ cores, 32 GB RAM, 1 TB NVMe SSD.
Cost: $1,500–$3,000
Use case: Small document sets, text-heavy data, manual labeling workflows
LLM inference: CPU-only via llama.cpp. Slower (2–5 tokens/second for 7B models) but functional for pre-annotation of small batches.
OCR: CPU-only Tesseract. Adequate for under 10,000 pages.

This tier handles proof-of-concept projects and small-scale production data preparation. Many enterprise engagements start here.

Tier 2: Mid-Range (GPU-Accelerated, 100 GB–1 TB)

Hardware: Workstation with a dedicated GPU. 16+ cores, 64 GB RAM, 2 TB NVMe, NVIDIA RTX 4070/4080 (12–16 GB VRAM) or equivalent.
Cost: $5,000–$10,000
Use case: Production data preparation, AI-assisted labeling, synthetic augmentation
LLM inference: GPU-accelerated via Ollama or llama.cpp. 30–80 tokens/second for 7B Q4. Fast enough for interactive labeling workflows.
OCR: GPU-accelerated. 10–30 pages/second.

This is the workhorse tier. A single workstation at this level handles the data preparation needs of most enterprise AI projects — including the ones that clients assume need a GPU cluster.

Tier 3: Heavy (Multi-GPU, 1 TB+)

Hardware: Server or high-end workstation with 2–4 GPUs. 32+ cores, 128–256 GB RAM, 4+ TB NVMe (or NVMe RAID), NVIDIA RTX 4090 or A6000 (24–48 GB VRAM per GPU).
Cost: $20,000–$50,000
Use case: Large-scale enterprise data preparation, concurrent pipeline stages, 14B+ model inference
LLM inference: Multi-GPU enables larger models (30B+) or parallel inference across multiple datasets.
OCR: Batch-optimized with GPU acceleration. Processes 100,000+ page archives overnight.

Most organizations discover they don't need this tier for data preparation. The common misconception: "We have 2 TB of documents, so we need a massive GPU cluster." In practice, data preparation processes documents sequentially or in small batches. The 2 TB passes through a mid-range workstation over days, not minutes — and that timeline is usually fine because data preparation is a one-time or periodic task, not a real-time service.

Local LLM Inference Architecture

Local LLM inference is the component that most changes the data preparation workflow. Instead of sending documents to a cloud API, the model runs on the same machine (or a machine on the same network) as the data preparation tool.

The two primary inference backends:

Ollama: Manages model downloads, quantization variants, and GPU allocation. Provides an OpenAI-compatible API on localhost. Easy to set up; good model library. Overhead is minimal — Ollama adds a thin HTTP layer over llama.cpp.

llama.cpp: Direct inference without the HTTP layer. Slightly more complex to configure but offers finer control over memory allocation, batch size, and threading. Preferred in air-gapped environments where Ollama's model registry is unreachable.

Both support GGUF model formats. For data preparation tasks — classification, entity extraction, pre-annotation — instruction-following models in the 7B–14B range provide the best balance of speed and accuracy. Larger models (30B+) rarely improve labeling quality enough to justify the throughput reduction.

Storage Architecture

Data preparation involves reading large volumes of source data, writing intermediate results, and producing final outputs. Storage I/O is a bottleneck at every stage.

Source data: Store on the fastest available media. NVMe SSD is preferred. If source data lives on network storage (NFS, SMB), copy it to local SSD before processing. Network I/O latency adds up across millions of file reads.

Intermediate data: OCR results, extracted text, embeddings, and partial processing outputs. These can be 2–5x the size of source data. Ensure enough local SSD capacity for the full intermediate dataset.

Output data: Final labeled, augmented, and exported datasets. Typically smaller than intermediate data but still significant. Export directly to the destination (training server, shared storage) when possible.

Rule of thumb: Provision 3–5x your source data size in local NVMe capacity for a complete pipeline run.

Networking — Or the Absence of It

In air-gapped environments, there is no network to configure. The data preparation tool, the local LLM, and the data all reside on the same machine or on a physically isolated network segment. This is the simplest networking architecture possible — and for many regulated environments, it's a requirement.

For connected on-premise deployments, networking considerations are minimal for single-user data preparation. If multiple team members need to access results, a shared NFS/SMB mount or a simple file sync mechanism is sufficient. Data preparation tools do not need complex service meshes, load balancers, or ingress controllers.

Native Desktop Architecture in Practice

Ertas Data Suite takes the native desktop approach to this problem. Built with Tauri 2.0 (Rust backend, React frontend), it installs as a standard desktop application and runs entirely on the local machine. The five pipeline modules — Ingest, Clean, Label, Augment, Export — share a common local data store and access CPU, GPU, and NPU hardware directly through the operating system.

Local LLM inference integrates through Ollama or llama.cpp, running on the same machine. There's no network hop between the labeling interface and the model — the application communicates with the inference backend over localhost.

This architecture eliminates the deployment complexity that makes tools like Label Studio (Docker/Docker Compose) and IBM Data Prep Kit (Python environment + Docker) inaccessible to domain experts. A compliance officer or subject-matter expert can install the application, open a dataset, and start labeling without waiting for an ML engineer to set up infrastructure.

Choosing Your Architecture

The decision matrix is simpler than vendors make it:

Factor	Native Desktop	Docker	Kubernetes
Users	1–3	3–10	10+
Setup time	Minutes	Hours	Days
Ops overhead	None	Low–Medium	High
Air-gap compatible	Yes	Possible	Difficult
Domain expert access	Direct	Needs support	Needs support
GPU access	Direct	Passthrough	Device plugin

For most service providers delivering AI solutions to enterprise clients, the data preparation phase involves 1–3 people working on a defined dataset. The native desktop model handles this. Kubernetes becomes relevant when you're running a shared data preparation platform for multiple concurrent projects — which is a different problem than preparing data for a specific engagement.

This article is the hub for Pillar 4: On-Premise Runtime and Infrastructure for Data Prep. For deeper coverage of specific topics:

Deployment models: Native Desktop vs Docker vs Kubernetes for On-Premise ML Data Pipelines
Hardware selection: Hardware Sizing for On-Premise Data Preparation
Local LLM tuning: Optimizing Local LLM Inference for Data Labeling and Augmentation
Batch processing: Batch Processing Large Document Archives On-Premise
Air-gapped Ollama: Running Ollama in Air-Gapped Enterprise Environments
Throughput benchmarks: On-Premise Data Prep Pipeline Throughput Benchmarks

The runtime architecture you select affects every downstream decision — hardware procurement, team workflow, client delivery timelines, and ongoing operational cost. Get the deployment model right first, then optimize within it.

On-Premise Runtime Architecture for Enterprise AI Data Preparation

Deployment Model: The First Decision

Native Desktop Applications

Docker Containers

Self-Hosted Web Applications (with or without Kubernetes)

Compute Requirements by Pipeline Stage

Ingestion

OCR (Optical Character Recognition)

Cleaning and Deduplication

Labeling with Local LLM

Augmentation and Synthetic Generation

Export

Three Hardware Tiers

Tier 1: Lightweight (CPU-Only, Under 100 GB Data)

Tier 2: Mid-Range (GPU-Accelerated, 100 GB–1 TB)

Tier 3: Heavy (Multi-GPU, 1 TB+)

Local LLM Inference Architecture

Storage Architecture

Networking — Or the Absence of It

Native Desktop Architecture in Practice

Choosing Your Architecture

Ship AI that runs on your users' devices.

Keep reading

Hardware Sizing for On-Premise Data Preparation: CPU, GPU, and Memory Requirements

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments

Synthetic Data Generation in Air-Gapped Environments for Fine-Tuning

Deployment Model: The First Decision

Native Desktop Applications

Docker Containers

Self-Hosted Web Applications (with or without Kubernetes)

Compute Requirements by Pipeline Stage

Ingestion

OCR (Optical Character Recognition)

Cleaning and Deduplication

Labeling with Local LLM

Augmentation and Synthetic Generation

Export

Three Hardware Tiers

Tier 1: Lightweight (CPU-Only, Under 100 GB Data)

Tier 2: Mid-Range (GPU-Accelerated, 100 GB–1 TB)

Tier 3: Heavy (Multi-GPU, 1 TB+)

Local LLM Inference Architecture

Storage Architecture

Networking — Or the Absence of It

Native Desktop Architecture in Practice

Choosing Your Architecture

Related Guides

Ship AI that runs on your users' devices.

Keep reading

Hardware Sizing for On-Premise Data Preparation: CPU, GPU, and Memory Requirements

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments

Synthetic Data Generation in Air-Gapped Environments for Fine-Tuning