
What 27 Enterprise AI Teams Told Us About Their Data Prep Problem
Based on 27 discovery calls across regulated industries, one problem kept surfacing before fine-tuning, RAG, or agents could even begin: data preparation. Here's what we heard.
We ran 27 discovery calls across regulated industries over a six-month period. The conversations spanned engineering and construction firms, healthcare organizations, legal practices, financial services teams, on-device AI startups, and AI agencies building for enterprise clients.
We asked about AI adoption goals, current tooling, blockers, and where time was actually being spent. We expected a variety of answers. We got a pattern so consistent it was almost unsettling.
Nine distinct ICPs named data preparation as their number-one AI pain point — unprompted, before we asked directly. The specific problems differed: file formats, regulatory constraints, labeling complexity, infrastructure restrictions. But the root cause was always the same. There was a missing layer between raw business data and AI-ready training data, and nobody had a good answer for closing it.
This is what they told us.
The Teams We Talked To
The 27 calls broke down roughly as follows:
- Engineering and construction firms (4): Managing large document archives — BOQs, specifications, engineering drawings, project reports — with years of accumulated data in PDFs, scanned files, and legacy formats.
- Healthcare organizations (5): Clinical notes, patient records, radiology reports, billing data. HIPAA compliance requirements meant cloud tools were effectively off the table.
- Legal practices and legal-tech firms (4): Contract libraries, case files, regulatory filings. Data privilege and client confidentiality created similar restrictions to healthcare.
- Financial services and fintech (3): Transaction records, compliance documents, risk assessments. Regulatory audit trail requirements added a layer of complexity above and beyond standard AI tooling.
- On-device and edge AI companies (4): Building AI products designed to run locally on hardware. Their own data preparation pipelines were blocking product development timelines.
- AI agencies (5): Building AI systems for enterprise clients. The problem they reported was usually a proxy for their clients' problems — they were absorbing the data prep complexity themselves.
- Early-stage AI startups (2): Note-taking, document intelligence, knowledge management. Smaller teams but the same data problem, compressed into founder time.
Across all of these, 9 teams named data preparation as the primary bottleneck to their AI project — before model selection, before infrastructure, before compliance review. In most cases, they had already addressed those other areas. Data was what remained.
What "Data Prep" Actually Meant to Each Segment
One of the more interesting findings was that "data preparation" meant genuinely different things depending on the industry — but the experience of pain was identical.
For the engineering and construction firms, data prep meant converting a 700GB archive of PDF specifications, hand-drawn engineering documents, and scanned BOQs into structured data that could train a model to extract line items, quantities, and cost estimates. The AI Lead at one of these firms put it plainly:
"The problem is not fine-tuning but cleaning and preparing the diverse data."
The diversity is the challenge. A single project might involve PDFs with embedded tables, scanned blueprints, Excel files in proprietary formats, and handwritten notes. Getting from that to a clean, labeled dataset requires parsing, normalization, deduplication, and expert annotation — and there is no single tool that handles the full chain.
For healthcare teams, data prep meant something different: PHI redaction before any processing could begin, then structure extraction from clinical notes written in non-standard shorthand, then labeling by clinicians who are not data scientists. The compliance requirement is not incidental — it determines which tools are permissible and which are not.
For legal teams, the challenge was similar but with the added complexity of privilege. You cannot send client documents to a cloud API to parse them. You need parsing tools that run locally, labeling tools that domain experts (lawyers, not ML engineers) can actually operate, and an audit trail that can survive discovery.
For the edge AI companies, data prep was blocking product timelines. Their issue was annotation throughput — target classes change as products evolve, annotation tooling requires ML engineering to operate, and the dependency on engineers to run what is fundamentally a domain expert task was slowing everything down. The team at one edge AI startup told us:
"Data labeling is the primary challenge — target classes frequently change."
That last point — target classes frequently change — is underappreciated. In enterprise AI, the labeling schema is not fixed. It evolves as teams learn more about the problem. Every time it changes, the annotation tooling needs to be reconfigured, which requires ML engineering time. This makes the problem dynamic, not just large.
For AI agencies, the problem was that they were absorbing clients' data problems themselves. One agency founder told us:
"Clients in enterprise healthcare and legal are more likely to care about on-premises solutions."
The agency was not directly handling the data — they were designing the system that would. But their clients' on-premise requirements shaped every technology decision, and the fragmented landscape of available tools made that design process significantly harder than it needed to be.
The Five Types of Data Prep Problems
Across the 27 calls, five distinct categories of data preparation problems emerged. Most teams had at least two. Several had all five.
1. The Ingestion Problem
Raw data exists in formats that AI training pipelines cannot consume directly. PDFs, scanned images, legacy database exports, proprietary file formats, handwritten documents. Before any cleaning or labeling can happen, these files need to be parsed into structured text or structured data.
This is harder than it sounds. PDF parsing that works for clean digital PDFs fails on scanned documents. OCR that works on printed text fails on handwritten notes. Table extraction that works on simple tabular PDFs fails on complex multi-column engineering specifications. The variety of file formats in a typical enterprise document archive is enormous, and no single parsing tool handles all of them reliably.
2. The Cleaning Problem
Raw parsed text is noisy. OCR errors, formatting artifacts, duplicate sections across documents, inconsistent terminology across teams or time periods. Before data can be labeled, it needs to be cleaned — and cleaning at enterprise scale requires either significant manual effort or sophisticated tooling that most teams do not have.
The CTO at one on-device AI company described the standard expectation well:
"Making the data cleanup process significantly easier, even if only 80% automated, would be a huge mover."
Note the "80%" — this is not a request for perfect automation. Teams know that some human review is always needed. What they need is for the first 80% to not require ML engineers writing custom Python scripts.
3. The Labeling Problem
For supervised learning and instruction-tuning, data needs to be labeled. The people best placed to label it — the domain experts — are typically not technical enough to operate the available labeling tools without significant setup and support. This creates a dependency on ML engineers to run annotation pipelines that should be running on domain expert time.
We encountered this pattern across healthcare (doctors who need to label clinical notes), legal (lawyers who need to annotate contract clauses), and construction (engineers who need to identify and categorize BOQ line items). In each case, the annotation tool itself was the bottleneck, not the expertise of the annotators.
4. The Compliance Problem
In regulated industries, data cannot move to the cloud for processing. HIPAA, GDPR, legal privilege, and internal data governance policies all impose constraints on where data can go and who can see it. Most commercial AI data preparation tools are cloud-native. This means regulated enterprises either accept the compliance risk, build their own tools, or fall back to manual processes.
A cybersecurity firm put this constraint in clear terms:
"Most AI tools process inference over the cloud, making the data essentially public."
This is not a fringe concern. It is a structural incompatibility between how most AI tooling is designed and where regulated enterprise data lives. The consequence is that regulated organizations either do AI differently or they do not do AI at all.
5. The Integration Problem
For the teams that had already assembled a toolchain — a parser here, an annotation tool there, a cleaning library on top — the problem was stitching them together. No shared data format. No shared audit trail. Custom glue code that breaks when any individual tool updates. ML engineering time spent maintaining plumbing rather than building models.
This is the problem that looks solved from the outside but is actively consuming resources inside. One note-taking AI startup founder told us simply:
"Data is the biggest issue."
Not training. Not inference. Not deployment. Data. And they had already tried the fragmented tool approach.
Why Model Training Was Almost Never the Stated Problem
This is the finding that surprised us most when we stepped back and looked at the aggregate.
Model training — the activity that receives the most attention in the AI press, the most VC investment, and the most engineering talent — was rarely mentioned as a bottleneck. Teams were not saying "we need a better base model" or "we need better training infrastructure." They were saying: we cannot get to training because our data is not ready.
This aligns with the broader research. Industry consensus places 60-80% of ML project time on data preparation, not model training. Forrester and Capital One surveyed 500 enterprise data leaders and found that 73% identified data quality and preparation as the number-one barrier to AI success. The pattern we saw in 27 calls is consistent with what the research shows at scale.
The reason is structural: model training is a solved problem in the sense that the tools are mature, the infrastructure exists, and the process is well-understood. Data preparation is not solved. It is domain-specific, format-specific, compliance-specific, and organization-specific in ways that generalized tooling handles poorly.
What Teams Were Currently Doing
When we asked how teams were currently handling data preparation, the answers fell into three categories:
Category 1: Stitched toolchains. Three to seven individual tools, each handling one part of the pipeline, connected by custom code written and maintained by ML engineers. The most common stack: a document parser (often Docling or Unstructured.io), an annotation tool (usually Label Studio), and a cleaning library (typically Cleanlab or custom scripts). Synthetic data generation, when needed, added Distilabel or similar. Each tool has its own setup, its own data format, its own maintenance burden.
Category 2: Manual processes. In regulated industries where the toolchain approach ran into compliance walls, teams fell back to spreadsheets, manual review, and document-by-document processing. This works but does not scale beyond small pilots.
Category 3: Nothing yet. Several teams had not started data preparation at all. They had AI adoption goals, they had a rough sense of what training data they needed, and they were blocked at the first step. The tooling landscape was too fragmented, too technical, or too cloud-dependent for them to find an entry point.
None of these teams described their current approach as satisfactory. The stitched toolchain teams wanted to reduce engineering overhead. The manual process teams wanted to move faster. The teams who had not started wanted somewhere to begin.
The Missing Layer
What emerged from the 27 calls was a clear picture of what was absent from the enterprise AI tooling landscape: a unified, on-premise data preparation environment designed for regulated industries.
The individual components exist. Document parsers exist. Annotation tools exist. Quality scoring libraries exist. What does not exist — or did not, at the time of these calls — is a single environment that handles the full pipeline (ingest, clean, label, augment, export) without requiring ML engineering to glue tools together, without sending data outside the organization, and without demanding that domain experts learn to operate developer tools.
This is not a feature gap in any individual tool. It is a layer that the ecosystem has not built.
Implications for Enterprise AI Adoption
The picture these 27 calls painted is of an enterprise AI landscape where the limiting factor is rarely what the press focuses on. It is not model capability. It is not compute availability. It is not organizational will.
It is the gap between the data organizations have and the data their AI systems need — and the absence of tooling designed to close that gap in regulated, on-premise environments.
The 65% of enterprise AI deployments currently stalling are mostly stalling at this layer. The organizations that are succeeding are either investing heavily in ML engineering to build and maintain custom pipelines, or they are operating in segments where cloud tooling is acceptable and the data is already relatively clean.
For the rest — regulated industries, organizations with complex document archives, teams that need domain experts involved in annotation — the tooling gap is real. Acknowledging it clearly is the first step to closing it.
Your data is the bottleneck — not your models.
Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.
Related Reading
- Enterprise AI Projects Fail at the Data Stage — Not the Model Stage — the structural reasons why data preparation is where most enterprise AI projects stall
- The Hidden Cost of Stitching Together Docling, Label Studio, and Cleanlab — a detailed breakdown of the integration tax paid by teams running fragmented tool stacks
- Your ML Engineers Shouldn't Be Doing This — why domain expert lockout is slowing down annotation pipelines across regulated industries
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Enterprise AI Projects Fail at the Data Stage — Not the Model Stage
65% of enterprise AI deployments are stalling. The conventional wisdom blames model selection or infrastructure. The real reason is almost always the same: data preparation was underinvested.

The Hidden Cost of Stitching Together Docling, Label Studio, and Cleanlab
Most enterprise AI teams use 3-7 tools for data preparation. The individual tools are good. The integration is the problem — and the cost is higher than most teams realize.

What Is AI Data Readiness? The Assessment Every Enterprise Skips
Most enterprises jump straight to model selection without assessing whether their data is actually usable for AI. Here's what AI data readiness means and how to assess it.