Docling + Label Studio + Cleanlab: The Hidden Integration Tax

Docling for document parsing. Label Studio for annotation. Cleanlab for quality scoring. Each is excellent at what it does. Together, they form a common open-source data preparation stack.

The problem isn't any individual tool — it's the integration between them. The format conversions, shared state management, audit trail gaps, and custom Python scripts required to make them work together represent a hidden tax that grows with every project.

The Stack in Theory

The appeal is straightforward:

Docling (IBM Research): Parses PDFs, Word documents, and other formats into structured output. Handles tables, layout detection, and OCR. Open-source, well-maintained, 97.9% table extraction accuracy.

Label Studio (HumanSignal): Annotation platform supporting text, images, audio, and video. Web-based interface, customizable labeling schemas, team management. Open-source with an enterprise tier.

Cleanlab: Data quality scoring and label error detection. Identifies mislabeled examples, measures data quality, suggests corrections. Python library.

In theory: parse with Docling → label with Label Studio → quality-check with Cleanlab → export.

In practice, each arrow (→) represents days of engineering work.

The Integration Points

Docling → Label Studio

Docling outputs structured documents in its own format (DoclingDocument). Label Studio expects data in Label Studio's import format (JSON with specific field mappings, or plain text/HTML).

What you need to build:

A converter that transforms Docling's output into Label Studio's import format
Handling for different content types (extracted text, tables, images) — each needs different Label Studio template configuration
Metadata preservation — Docling's extraction confidence, page numbers, and source file references need to be carried through to Label Studio so annotators have context
Batch import logic for processing thousands of documents

What goes wrong:

Docling updates change the output schema — your converter breaks
Rich formatting (tables, lists, nested structures) gets flattened during conversion
Large documents exceed Label Studio's recommended task size — you need custom chunking logic
Source file references (page 3 of document X) are lost during conversion, making it hard for annotators to verify extractions

Label Studio → Cleanlab

Label Studio exports annotations in JSON format. Cleanlab expects a pandas DataFrame or numpy arrays with features and labels.

What you need to build:

An export pipeline that pulls completed annotations from Label Studio (via API or file export)
A transformer that converts Label Studio's annotation format into Cleanlab's expected input
Handling for partial annotations (not all documents may be labeled yet)
Logic to map Label Studio's potentially complex annotation structures (nested labels, relationships) to Cleanlab's flat label format

What goes wrong:

Label Studio's export format varies based on the annotation template used
Multi-annotator scenarios (multiple people labeling the same document) need to be resolved before Cleanlab can process them
Cleanlab's quality scores need to be mapped back to specific Label Studio tasks for review — this requires maintaining a mapping table

Cleanlab → Corrections Workflow

Cleanlab identifies potential label errors and quality issues. But the corrections need to happen in Label Studio.

What you need to build:

A pipeline that takes Cleanlab's flagged items and creates review tasks in Label Studio
Logic to prioritize which flagged items need human review (not all low-confidence items are actually wrong)
A feedback loop that re-runs Cleanlab after corrections to verify improvement
Tracking of which items have been reviewed vs. pending

What goes wrong:

The round-trip (export from LS → analyze in Cleanlab → re-import to LS for correction → re-export → re-analyze) involves 4+ data transformations, each a potential point of failure
Version tracking is manual — which version of the labels was Cleanlab run on? Are the current labels in Label Studio the corrected ones or the originals?

The Audit Trail Gap

This is the most consequential integration problem, especially for regulated industries.

Each tool maintains its own logs:

Docling: Logs parsing events and extraction quality
Label Studio: Logs annotation events and user actions
Cleanlab: Logs quality analysis results

But no tool logs what happens between tools:

When was Docling's output converted for Label Studio?
Which version of the conversion script was used?
Were any records dropped during format conversion?
When were Cleanlab's corrections applied back to Label Studio?
Who approved the final dataset for export?

These cross-tool events are where audit trails break. And under the EU AI Act, HIPAA, or GDPR, these gaps can constitute compliance violations.

Building a unified audit trail across three tools requires:

A custom logging framework that wraps every inter-tool operation
Timestamp synchronization across tools
Record-level tracking (mapping IDs across tools)
An aggregation layer that presents a unified lineage view

This is ~2-4 weeks of engineering work and ongoing maintenance as tools update.

The Maintenance Burden

Each tool updates independently:

Docling releases a new version → test converter compatibility → update if needed
Label Studio updates → test export pipeline → test import pipeline → update if needed
Cleanlab updates → test data transformation → update if needed

On average, expect 2-3 breaking changes per year across the three tools. Each takes 1-3 days to diagnose and fix.

The custom integration code (converters, transformers, audit logging, batch processing) also needs maintenance:

Bug fixes as edge cases are discovered
Performance optimization as data volumes grow
Documentation updates (if documentation exists)

Total ongoing maintenance: 4-8 weeks/year of engineering time.

The Alternative

The integration tax exists because these tools were designed independently. Each is excellent at its specific function but not designed to work with the others.

A unified platform that handles all three functions — parsing, annotation, and quality scoring — in a single system eliminates the integration tax entirely. No format conversion between stages. No cross-tool audit trail gaps. No converter scripts to maintain.

Ertas Data Suite takes this approach: Ingest, Clean, Label, Augment, and Export all run in the same application, sharing the same data model and audit infrastructure. The result is zero integration code, continuous lineage, and domain expert access without Docker or Python.

The individual tools in the stack are excellent. The tax is in the "+" signs between them.

Docling + Label Studio + Cleanlab: The Hidden Integration Tax

The Stack in Theory

The Integration Points

Docling → Label Studio

Label Studio → Cleanlab

Cleanlab → Corrections Workflow

The Audit Trail Gap

The Maintenance Burden

The Alternative

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Prodigy + Docling + Custom Scripts: A Real Enterprise Stack Audit

PDF Parsing Accuracy Benchmark: Docling vs Unstructured vs Marker vs Visual Pipeline

Data Preparation Time Estimator: How Long Does AI Data Prep Take by Document Type