Back to blog
    5 Questions to Ask Before Buying an On-Premise AI Data Platform
    on-premisebuying-guidedata-preparationenterprise-aievaluationsegment:enterprise

    5 Questions to Ask Before Buying an On-Premise AI Data Platform

    A buyer's guide for evaluating on-premise AI data platforms: offline capability, accessibility, audit trails, export formats, and implementation support.

    EErtas Team·

    "On-premise" has become a marketing checkbox. Vendors slap it on their feature list because they know enterprise buyers are asking for it. But the gap between "we offer on-premise deployment" and "our platform actually works well on your infrastructure, with your constraints, without phoning home" can be enormous.

    These are the five questions that separate genuine on-premise capability from a cloud platform awkwardly shoved into a Docker container and called "on-prem." Ask them during your evaluation, and pay attention to how vendors respond — hesitation, qualifications, and redirects tell you more than polished answers.


    Question 1: Does It Work Fully Offline, or Does It Phone Home?

    This is the question that eliminates the most vendors. Many platforms marketed as "on-premise" still require an internet connection for licensing validation, feature updates, telemetry reporting, or access to cloud-hosted model APIs.

    What "phone home" looks like in practice:

    • The software checks a licensing server on startup. If it cannot reach the server, it enters a degraded mode or stops working after a grace period.
    • AI-powered features (like auto-labeling or smart cleaning) route data through a cloud API. The platform is on your server, but your data is going to theirs.
    • Usage telemetry is collected and transmitted to the vendor. Even if no content data is sent, metadata about your workflows and data volumes is still leaving your network.
    • Updates require internet access, either to pull packages or to validate update tokens.

    What to ask:

    • "If I unplug the network cable, does every feature still work? Which features degrade or stop?"
    • "Does the platform make any outbound network requests? Can you provide a network traffic log from a running instance?"
    • "How does licensing work in a fully disconnected environment?"
    • "Are AI-assisted features (auto-labeling, smart suggestions) processed locally, or do they call an external API?"

    Why it matters: If you are buying on-premise because your data cannot leave your network — healthcare PHI, defense classified data, financial PII — then "on-premise except for this one API call" is not on-premise. A single outbound connection is a compliance violation in many regulated environments.

    Red flag: The vendor says "our platform is on-premise" but cannot clearly explain the licensing mechanism for air-gapped environments. This usually means they have not actually deployed in one.


    Question 2: Who Can Use It — ML Engineers Only, or Domain Experts Too?

    Data preparation quality depends on domain expertise. The people who know whether a label is correct, a cleaning rule makes sense, or a data point is an outlier are rarely the same people who can write Python scripts or navigate a CLI.

    If only ML engineers can operate the platform, you have created a bottleneck: every labeling decision, every cleaning rule review, and every quality check has to route through a technical team that is already overbooked.

    What to ask:

    • "Can a domain expert with no coding background label data, review pipeline output, and flag quality issues?"
    • "What does the labeling interface look like? Can we see it with our data, not your demo data?"
    • "How are review and approval workflows handled? Can a domain expert approve labeled data without touching the pipeline configuration?"
    • "What is the typical onboarding time for a non-technical user?"

    Why it matters: The best training data comes from tight feedback loops between domain experts and the data pipeline. If the platform requires a data engineer to translate every domain expert's feedback into code, the feedback loop slows from minutes to days.

    Red flag: The vendor's demo shows only CLI interactions or notebook-style interfaces. When you ask about the UI for domain experts, they describe a "planned" feature or point to a basic web form that is clearly an afterthought.

    What good looks like: A platform where a radiologist can review labeled medical images, a contract attorney can correct clause classifications, or an insurance adjuster can validate claims categorizations — all without writing code or asking an engineer for help.


    Question 3: Is Every Transformation Logged in an Audit Trail?

    Audit trails in AI data preparation are not a nice-to-have. The EU AI Act (Article 10) requires documented data governance for high-risk AI systems. HIPAA requires audit logs for PHI access and transformation. SOC 2 requires evidence of data handling controls. Even if you are not in a regulated industry today, audit readiness is becoming a baseline expectation for enterprise AI.

    What "audit trail" should mean:

    • Every data record has a lineage: where it came from, what transformations were applied, who applied them, when
    • Every label has attribution: who labeled it, when, what the original value was if it was changed
    • Every pipeline configuration change is logged: who changed what rule, when, and what the previous configuration was
    • Audit logs are immutable: they cannot be edited or deleted, even by administrators
    • Logs are exportable in standard formats for compliance review

    What to ask:

    • "Can you show me the audit trail for a single data record — from source ingestion through every transformation to final export?"
    • "Are audit logs immutable? Can an administrator delete or modify them?"
    • "What format are audit logs exported in? Can they be integrated with our existing compliance tools?"
    • "If a regulator asks 'who touched this data and when,' can the platform answer that question in under 5 minutes?"

    Why it matters: Without a complete audit trail, you cannot demonstrate compliance, you cannot reproduce your training data pipeline, and you cannot debug quality issues. When a model behaves unexpectedly, the first question is "what data was it trained on?" Without lineage, you cannot answer that.

    Red flag: The vendor says they have "logging" but it is just application logs (errors and system events), not data-level audit trails.


    Question 4: What Formats Can It Export?

    The prepared data needs to go somewhere — into a model training framework, a fine-tuning platform, a RAG pipeline, or a data warehouse. If the platform exports in a proprietary format that only works with their tools, you have traded cloud vendor lock-in for on-premise vendor lock-in.

    What to ask:

    • "What export formats are supported? JSONL, Parquet, CSV, COCO, YOLO, custom schemas?"
    • "Can I define a custom export schema, or am I limited to predefined formats?"
    • "Is there a bulk export API, or is export a manual process?"
    • "If I stop using your platform, can I export all my data — including labels, transformations, and audit trails — in open formats?"

    Why it matters: Your ML stack will evolve. The framework you use for training today may not be the one you use in two years. If your prepared data is locked in a proprietary format, migrating to a new tool means re-doing the preparation work.

    Red flag: The vendor's export documentation is sparse, formats are limited, or full export requires professional services. Also watch for platforms that export data but not metadata (labels, transformations, lineage) — the data without the metadata is significantly less valuable.

    What good looks like: The platform exports in standard ML formats with full metadata, supports custom schemas, provides API-driven export for automation, and lets you export everything (including audit trails) in open formats if you decide to leave.


    Question 5: What Does Implementation Look Like — Self-Serve or Supported?

    An on-premise platform is software that runs on your hardware. Getting it from "runs" to "useful" is the gap where most projects stall. The question is whether the vendor helps you cross that gap or leaves you to figure it out.

    What to ask:

    • "What does a typical implementation look like? Timeline, effort, who is involved?"
    • "Do you offer on-site or forward deployment for implementation?"
    • "What happens after the software is installed? Who configures the first pipeline? Who trains our team?"
    • "What ongoing support is included? What costs extra?"
    • "Can you provide references from organizations with similar infrastructure and data types?"

    Why it matters: Enterprise AI data preparation is not install-and-go. Configuring pipelines for your specific data, integrating with your source systems, designing label schemas for your domain, and training your team to operate the system — this work is as important as the software itself.

    A vendor that drops a Docker image and a link to documentation is giving you a tool. A vendor that embeds with your team, configures the platform for your data, and trains your people is giving you a capability.

    Red flag: The vendor's implementation plan is "install the software and read the docs." Or their implementation is outsourced to a third-party SI who has never used the product with your type of data.

    What good looks like: A defined implementation plan with clear milestones, direct access to the vendor's engineers (not just a support queue), hands-on training for your team, and a handoff process that leaves your team able to operate independently.


    Putting It All Together

    These five questions are not exhaustive, but they cover the areas where "on-premise" claims most often break down:

    1. Offline capability — does it actually work without internet?
    2. Accessibility — can the people who know the data actually use the tool?
    3. Audit trails — is every transformation logged and traceable?
    4. Export formats — can you get your data out in standard formats?
    5. Implementation — will the vendor help you get to production, or just hand you software?

    Use these questions early in your evaluation process. The answers will tell you quickly whether a vendor's "on-premise" claim is genuine or aspirational.


    Evaluating Ertas

    Ertas is built for genuine on-premise deployment: fully offline capable, no phone-home, open export formats, complete audit trails, and an interface that domain experts can use without engineering support. Our implementation model is forward deployment — our engineers embed with your team to configure and train.

    If you are evaluating on-premise AI data platforms, book a discovery call and bring these questions. We will answer them directly.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading