Back to blog
    Native Desktop vs Docker vs Kubernetes for On-Premise ML Data Pipelines
    native-desktopdockerkubernetesdeploymenton-premisedata-pipelineair-gappedsegment:service-provider

    Native Desktop vs Docker vs Kubernetes for On-Premise ML Data Pipelines

    Technical comparison of native desktop, Docker, and Kubernetes deployment models for on-premise data preparation tools — covering installation, ops overhead, security, and air-gap compatibility.

    EErtas Team·

    When a service provider wins an enterprise data preparation engagement, the first technical decision isn't which model to use or how to structure the labeling schema. It's how the tooling gets deployed on the client's infrastructure.

    This decision has cascading effects: on timeline (days vs. weeks of setup), on who can operate the tools (ML engineers only vs. domain experts too), on security posture (attack surface area), and on whether the client's IT team will approve the deployment at all.

    Three deployment models dominate the on-premise data preparation landscape. Each makes different trade-offs.


    Native Desktop Applications

    A native desktop application installs like any other program on the target machine. On macOS, it's a .dmg or .app. On Windows, an .msi or .exe installer. On Linux, a .deb, .rpm, or AppImage.

    Installation Complexity

    Minimal. Download the installer, run it, launch the application. The entire process takes 2–10 minutes. No prerequisite software is required beyond the operating system itself (and GPU drivers, if GPU inference is needed).

    Operational Overhead

    Close to zero. The application manages its own data directory, configuration, and process lifecycle. Updates are application-level: download the new version and install it over the existing one. There's no daemon to monitor, no container registry to maintain, no cluster to keep healthy.

    Security Surface Area

    Small. The application runs as a single user-space process. It doesn't listen on network ports (unless local LLM inference uses localhost). There's no web server, no authentication layer, no API exposed to the network. The attack surface is the application binary itself and its data directory.

    Air-Gap Compatibility

    Native. After installation, the application requires no network connectivity. Model weights for local LLM inference can be transferred via USB or approved media. The application has no "phone home" requirement.

    Domain Expert Accessibility

    High. A compliance officer, legal reviewer, or medical coder can install and use a native desktop application without DevOps support. This matters because data preparation quality depends on domain expertise, not ML expertise.

    Hardware Access

    Direct. The application accesses CPU, GPU, and NPU through the operating system's native APIs. No virtualization layer, no passthrough configuration. GPU memory is fully available to the application.

    Update Management

    Standard application updates. In air-gapped environments, updates are delivered via physical media or internal software distribution systems.


    Docker Containers

    Docker-based tools ship as one or more container images, typically orchestrated with Docker Compose. Label Studio, for example, ships as a Docker image with a separate PostgreSQL container for metadata.

    Installation Complexity

    Moderate. Prerequisites include Docker Engine (or Docker Desktop), Docker Compose, and for GPU workloads, the NVIDIA Container Toolkit with compatible drivers. On enterprise Linux systems, this often means coordinating with the IT team for package installation and Docker daemon configuration.

    Total setup time: 1–4 hours for a straightforward deployment, longer when corporate proxy settings, registry mirrors, or driver issues complicate things.

    Operational Overhead

    Medium. Docker containers need monitoring — are they running? Have they consumed too much memory? Is the volume mount configured correctly? Docker Compose restarts handle basic recovery, but persistent state management (volumes, bind mounts) requires attention.

    Docker daemon updates sometimes break container compatibility. NVIDIA Container Toolkit updates sometimes break GPU access. These are not daily problems, but they appear at the worst moments.

    Security Surface Area

    Larger than native desktop. Docker introduces several components: the Docker daemon (running as root), container networking, exposed ports, and volume mounts. Each is a potential attack vector. Container escape vulnerabilities, while rare, are a real concern for security-conscious organizations.

    Many enterprise security teams require a review of Docker images, their base layers, and any network exposure before approving deployment.

    Air-Gap Compatibility

    Possible but difficult. Air-gapped Docker deployment requires:

    1. Pre-pulling all container images on an internet-connected machine
    2. Saving images as tarballs (docker save)
    3. Transferring tarballs to the target machine via approved media
    4. Loading images (docker load)
    5. Ensuring all runtime dependencies (model weights, configuration) are included

    This process frequently fails due to missing image layers, missing runtime dependencies (like model files that are downloaded at container startup), or version mismatches between Docker Engine versions.

    Domain Expert Accessibility

    Low. Non-technical users cannot self-serve Docker deployments. Even running the tool requires understanding of Docker commands or a wrapper script. Troubleshooting (container crashes, volume issues, port conflicts) requires Docker expertise.

    Hardware Access

    Indirect. GPU access requires the NVIDIA Container Toolkit and proper device passthrough configuration. GPU memory available inside the container may be less than the host's full capacity depending on configuration. NPU access from within containers is poorly supported in 2026.

    Update Management

    Pull new images, restart containers. In air-gapped environments, this repeats the full image transfer process.


    Kubernetes Orchestration

    Kubernetes deployments are typically used when multiple teams need concurrent access to data preparation tools, or when the tool is part of a larger ML platform.

    Installation Complexity

    High. A Kubernetes cluster requires either a managed service (which is cloud, not on-premise) or a self-managed deployment using kubeadm, k3s, or a commercial distribution. On-premise Kubernetes also needs a storage provisioner (Rook-Ceph, Longhorn, or NFS), a load balancer (MetalLB for bare metal), and an ingress controller.

    GPU scheduling requires the NVIDIA GPU Operator or manual device plugin configuration. Setting up GPU time-slicing or MIG (Multi-Instance GPU) adds further complexity.

    Total setup time: days to weeks, depending on the team's Kubernetes experience and the organization's change management process.

    Operational Overhead

    High. Kubernetes clusters require ongoing maintenance: node updates, certificate rotation, storage management, monitoring (Prometheus/Grafana stack), log aggregation, and backup. A dedicated platform engineering team is the norm, not the exception.

    Security Surface Area

    Large. Kubernetes introduces API servers, etcd, kubelet, container runtime, pod networking, service mesh (if used), ingress, and RBAC configuration. Each component is a potential vulnerability. Kubernetes security hardening is a discipline in itself.

    Air-Gap Compatibility

    Very difficult. Air-gapped Kubernetes requires:

    1. All container images pre-loaded into a local registry
    2. A local registry running on the air-gapped network
    3. Helm charts or manifests modified to point to the local registry
    4. All operator images and dependencies pre-loaded
    5. Certificate management without external CA access

    Organizations that successfully run air-gapped Kubernetes typically have dedicated platform teams with deep Kubernetes expertise.

    Domain Expert Accessibility

    Very low. End users access the tool through a browser, which is fine. But deployment, maintenance, and troubleshooting are entirely the domain of platform engineers.

    Hardware Access

    Managed by the cluster. GPU access requires the NVIDIA device plugin, with resource requests/limits specified per pod. Multi-GPU configurations need careful scheduling to avoid resource contention. GPU memory is managed at the pod level.

    Update Management

    Helm chart upgrades or manifest redeployment. In air-gapped environments, this means updating the local registry with new images before upgrading.


    Comparison Summary

    FactorNative DesktopDockerKubernetes
    Install time2–10 min1–4 hoursDays–weeks
    Ops overheadNoneLow–MediumHigh
    Security surfaceSmallMediumLarge
    Air-gap deploymentStraightforwardDifficultVery difficult
    Domain expert useSelf-serviceNeeds supportNeeds support
    GPU accessDirectPassthroughDevice plugin
    Multi-userNo (single machine)LimitedYes
    Update processInstall new versionPull new imagesRegistry + Helm
    Team neededNoneDevOps-lightPlatform team

    Who Needs What

    Single-User Data Preparation (Most Enterprise Engagements)

    The reality of enterprise data preparation: most projects involve 1–3 people preparing data for a specific use case. A service provider sends an ML engineer and a domain expert (or trains the client's domain expert) to label, clean, and augment data for a fine-tuning project.

    For this scenario, native desktop is the right deployment model. Setup is instant. The domain expert can work independently. Air-gapped operation is built in. There's no infrastructure to maintain after the engagement ends.

    Small Team Collaboration (3–10 People)

    When multiple people need to work on the same dataset concurrently — for example, a team of annotators labeling different subsets — the options are:

    • Native desktop with shared storage: Each user runs the application locally, reading from and writing to a shared NFS or SMB mount. Simple but requires coordination on file locking and task assignment.
    • Docker deployment: A centralized instance that team members access via browser. Works well when the organization already has Docker infrastructure. Adds deployment overhead but provides centralized state.

    Enterprise-Scale Multi-Team (10+ Concurrent Users)

    When an organization runs multiple data preparation projects simultaneously, with dedicated teams for each, Kubernetes becomes relevant. The platform team provisions namespaces per project, manages GPU scheduling across teams, and provides centralized monitoring.

    This is the use case Kubernetes was designed for. But it's also the rarest scenario in data preparation. Most organizations are not running 10 concurrent data prep projects.


    The Tools in Practice

    Several popular data preparation tools illustrate these deployment model trade-offs:

    Label Studio: Ships as a Docker image. Requires Docker Compose for the full stack (application + PostgreSQL). Air-gapped deployment requires pre-pulling images and ensuring the database initializes correctly without network access. Non-technical users cannot deploy it independently.

    IBM Data Prep Kit: Python-based toolkit. Requires a Python environment (or Docker) and often additional dependencies for specific pipeline components. Not a standalone application; it's a library that engineers integrate into custom pipelines.

    Prodigy: Python-based but runs as a local web server on the user's machine. Closer to the native desktop model in practice, though installation still requires Python environment management.

    Ertas Data Suite: Native desktop application built with Tauri 2.0. Installs as a standard application with no prerequisites beyond GPU drivers for accelerated inference. The five pipeline modules (Ingest, Clean, Label, Augment, Export) run entirely on the local machine with optional Ollama/llama.cpp integration for AI-assisted labeling.


    Making the Decision

    Start with the simplest deployment model that meets your actual requirements — not the requirements you might have in the future. For the majority of enterprise data preparation engagements, a native desktop application on a workstation with a decent GPU is the right answer. It deploys in minutes, operates in air-gapped environments without modification, and gives domain experts direct access to the tools they need.

    Docker becomes relevant when you need centralized access for a small team. Kubernetes becomes relevant when you're operating at platform scale. Most data preparation projects never reach either threshold.

    The fastest path to prepared data is the one with the fewest deployment obstacles between your team and the actual work.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading