The Hidden Cost of Stitching Together Docling, Label Studio, and Cleanlab

Every tool in the standard enterprise data preparation stack is genuinely good at what it does. Docling, developed by IBM Research, is a serious document parsing library with excellent handling of complex PDFs and table extraction. Label Studio is a capable, extensible annotation platform that supports a wide range of task types. Cleanlab is a well-researched data quality library with sophisticated label error detection. Distilabel offers a flexible pipeline interface for synthetic data generation.

The individual tools are not the problem. The integration is the problem — and it is more expensive than most teams realize before they are already inside it.

The Typical Enterprise Stack

Before we get to the costs, it helps to be precise about what the typical fragmented stack actually looks like.

A team beginning an enterprise AI data preparation project in 2025 or 2026 assembles something like this:

Document ingestion: Docling (IBM Research) Handles PDF parsing, table extraction, and conversion to structured formats like JSON or Markdown. Technically strong — particularly for research and technical documents. Requires Python, runs as a library or command-line tool. No GUI, no annotation capability, no quality management. Outputs structured text that needs to go somewhere.

Data annotation: Label Studio Self-hosted via Docker, deployed as a web application. Supports text classification, named entity recognition, image annotation, and other task types through a configuration-based interface. Domain experts and ML engineers access it through a browser. Strong feature set, active community, solid documentation. Requires a server to run, a Docker installation to maintain, and some engineering effort to configure annotation schemas.

Data quality: Cleanlab A Python library for identifying labeling errors and quality issues in datasets. Implements confident learning algorithms that can detect label inconsistencies at scale. Requires Python proficiency to operate — there is no GUI, no dashboard, no point-and-click workflow. The output is typically a dataframe of flagged examples that an engineer reviews and acts on.

Synthetic data generation: Distilabel (Argilla) A pipeline-oriented framework for generating synthetic training data using language models. Designed for ML engineers comfortable with writing pipeline configurations in Python. No GUI. Outputs to standard formats but requires custom configuration for each use case.

Computer vision annotation: CVAT For teams working with images or video, CVAT handles annotation workflows that Label Studio does not cover as well. Adds another tool with its own deployment, its own user management, its own data format.

This is five tools. Some teams add more — a sixth for format conversion, a seventh for data versioning. And crucially, none of them were designed to work together. Each produces output in its own format. Each requires its own setup. None shares state, schema, or audit trail with the others.

What Each Tool Does Well

It is worth being honest about the individual tools because the critique is specifically about integration, not capability.

Docling is genuinely excellent for parsing complex scientific and technical PDFs. The IBM Research team has invested serious engineering effort in table detection, layout analysis, and format conversion. For teams parsing academic papers or structured technical reports, it performs very well.

Label Studio's annotation configuration system is flexible and expressive. You can build annotation interfaces for unusual task types without much difficulty. The open-source community has contributed a large library of example configurations. If you have an ML engineer who can configure and maintain it, it is a capable platform.

Cleanlab's confident learning algorithm is state-of-the-art for automated label error detection. In benchmark comparisons, it consistently identifies annotation mistakes that human reviewers miss. For teams with Python expertise and clean data pipelines feeding into it, it adds real value.

These are tools built by capable teams for real problems. The fragmentation cost is not about their individual quality. It is about what happens when you need all of them to function as a coherent system.

The Integration Problem

When you connect these tools, you immediately encounter a set of problems that none of them solve individually.

No shared data format. Docling outputs Markdown or JSON. Label Studio works with its own annotation JSON schema. Cleanlab expects a NumPy array of labels or a pandas DataFrame. Distilabel has its own pipeline format. Moving data between any two of these tools requires a conversion step — either a script you write and maintain, or a manual export-and-import cycle.

This conversion code is not complicated to write. It is complicated to maintain. Every time a tool updates its output schema, your conversion code may break silently. Every time you change your annotation schema in Label Studio, you need to update the scripts that feed Cleanlab. Every new file format you need to support requires updates to the Docling parsing step and potentially to every downstream conversion.

No shared audit trail. If a compliance auditor asks you to demonstrate that a specific training example was derived from a specific source document, reviewed by a specific annotator, and passed a specific quality threshold before being included in the training set — you cannot answer that question with a unified report. You have to reconstruct the answer from logs in five separate systems, assuming the logs are detailed enough.

This is not a hypothetical. HIPAA audits, GDPR compliance reviews, EU AI Act Article 10 obligations, and internal information security audits all require data provenance documentation. The fragmented stack makes this expensive to produce and impossible to produce in real time.

No shared schema. When your labeling schema changes — a category is renamed, a new entity type is added, a classification is split into two — you need to make that change in Label Studio's annotation interface, update the Cleanlab quality checks that depend on the schema, update any Distilabel prompts that reference category names, and update the export scripts that map label values to training format. A schema change that should take an afternoon takes a week.

Dependency management across tools. Each tool has its own dependency chain. Docling, Cleanlab, and Distilabel are all Python libraries with their own sets of dependencies. They may require different Python versions, different versions of shared dependencies, or conflicting transitive requirements. Managing this in a shared environment is a known pain point — the standard answer is separate virtual environments or containers, which adds operational overhead.

The Hidden Costs

Let us try to make the cost concrete. These are estimates based on conversations with ML teams, not invoices — but they are grounded in real patterns.

Initial setup cost: Getting a five-tool stack configured, connected, and producing usable output for a new project typically takes one to three weeks of senior ML engineer time. This includes deploying Label Studio, writing initial parsing scripts for Docling, configuring the quality pipeline in Cleanlab, and writing the glue code that connects them. At a fully-loaded cost of $150-200/hour for a senior ML engineer, this is $12,000-$24,000 before a single example has been labeled.

Ongoing maintenance cost: Once the stack is running, it requires ongoing maintenance. Tool updates need to be evaluated, glue code needs to be updated when schemas change, deployment issues need to be debugged. Based on team reports, this runs 4-8 hours per week for a moderately active data preparation workflow. That is $30,000-$60,000 per year in senior engineering time spent on plumbing.

Debugging cost: When the output of your trained model is unexpectedly poor and you need to trace the issue back to a data problem, debugging across five tool boundaries is significantly harder than debugging within a single system. Teams report spending days on what should be hours-long investigations. A single data quality incident can cost 20-40 hours of engineering time to root-cause.

The compliance documentation cost: If your organization needs to produce data lineage documentation for a regulatory audit, assembling that documentation from logs across five separate systems can take weeks. We have heard from teams that had to dedicate a full engineer-month to producing compliance documentation for a single audit.

The domain expert lockout cost: Because every tool in this stack requires ML engineering to configure and operate, domain experts cannot participate directly in the annotation process without significant support. This means ML engineers spend time on annotation work they are not best qualified to do, and annotation quality suffers because the people with domain knowledge are not in the loop. This cost is real but harder to quantify — it shows up as additional annotation iterations, lower label quality, and slower model convergence.

When the Fragmented Stack Is Acceptable

The fragmented stack is not always the wrong choice. There are scenarios where it makes sense.

If your team has dedicated ML engineering capacity that can absorb the integration overhead, the individual tools are capable and the cost is manageable. Research teams and large enterprise ML platforms with five or more dedicated engineers often run these stacks successfully.

If your data preparation needs are stable — the same file formats, the same annotation schema, the same quality requirements — the integration overhead is a one-time cost rather than a recurring one. Stable workflows amortize the initial setup cost across many projects.

If compliance requirements are not strict — cloud tools are permissible, audit trail documentation is not required — many of the compliance-specific costs disappear. The integration cost remains, but it is lower.

If domain expert involvement is not needed — your annotation tasks can be handled by ML engineers or crowdsourced annotators — the domain expert lockout cost is less relevant.

When It Becomes a Liability

The fragmented stack becomes a genuine liability when:

Your document archive spans multiple file formats with different parsing requirements
Your annotation schema evolves as you learn more about the task
Compliance requires unified data lineage across the full pipeline
Domain experts need to be involved in annotation without ML engineering support
Your team's ML engineering capacity is limited and needs to be spent on model development, not data plumbing
You are operating in a regulated environment where cloud tooling is not permissible

These are not edge cases. They describe the majority of enterprise AI deployments in regulated industries. For these teams, the fragmented stack is not just inconvenient — it is actively blocking progress.

The CTO at one on-device AI company described the expectation precisely:

"Making the data cleanup process significantly easier, even if only 80% automated, would be a huge mover."

The "80% automated" framing is significant. Teams are not asking for magic. They are asking to not spend 40% of their ML engineering capacity on maintaining the connections between tools that should, by now, come connected.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →

What 27 Enterprise AI Teams Told Us About Their Data Prep Problem — primary research on the fragmented tool landscape that enterprise AI teams are navigating
Tool Entropy: Why Enterprise AI Data Pipelines Keep Growing More Complex — the predictable pattern by which two-tool stacks become seven-tool stacks
Your ML Engineers Shouldn't Be Doing This — the domain expert lockout problem that fragmented tooling creates

The Hidden Cost of Stitching Together Docling, Label Studio, and Cleanlab

The Typical Enterprise Stack

What Each Tool Does Well

The Integration Problem

The Hidden Costs

When the Fragmented Stack Is Acceptable

When It Becomes a Liability

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

What 27 Enterprise AI Teams Told Us About Their Data Prep Problem

Enterprise AI Projects Fail at the Data Stage — Not the Model Stage

What Is AI Data Readiness? The Assessment Every Enterprise Skips

The Typical Enterprise Stack

What Each Tool Does Well

The Integration Problem

The Hidden Costs

When the Fragmented Stack Is Acceptable

When It Becomes a Liability

Related Reading

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

What 27 Enterprise AI Teams Told Us About Their Data Prep Problem

Enterprise AI Projects Fail at the Data Stage — Not the Model Stage

What Is AI Data Readiness? The Assessment Every Enterprise Skips