Prodigy + Docling + Custom Scripts: A Real Enterprise Stack Audit

What does a real enterprise AI data preparation stack look like? Not the diagram on the architecture slide — the actual day-to-day reality of tools, scripts, and workarounds that an ML team operates.

This is an audit of a representative stack: Prodigy for annotation, Docling for document parsing, and custom Python scripts for everything in between. Each tool is well-regarded in its category. The friction is in the gaps.

The Stack

Prodigy (Explosion AI) — $390-$10,000/year

Prodigy is arguably the best annotation tool for NLP tasks. It's fast, scriptable, runs locally (important for sensitive data), and supports active learning. It's the tool that ML engineers who've used everything else usually prefer.

What it does well:

Extremely efficient annotation interface (designed for speed)
Runs entirely locally — no cloud dependency, no Docker required
Active learning: suggests labels, learns from corrections
Python API for customization
Supports NLP (NER, text classification, spans) and CV tasks

What it doesn't do:

No document parsing — expects text input, not PDFs
No data cleaning or quality scoring
No audit trail for compliance (designed for productivity, not governance)
Single-user focused — team features require custom orchestration
No multi-format export (outputs Prodigy's internal format)

Docling (IBM Research) — Free/Open-Source

Docling is a strong document parser. It handles PDFs, Word documents, and other formats with good table extraction and layout detection.

What it does well:

97.9% table extraction accuracy (competitive with commercial tools)
Layout-aware parsing (headings, paragraphs, lists, tables)
Multiple output formats (Markdown, JSON, text)
Open-source, actively maintained by IBM Research

What it doesn't do:

No labeling capability
No data cleaning, deduplication, or quality scoring
No PII detection or redaction
No audit trail
No GUI — command-line interface only

Custom Python Scripts — "Free"

Everything between Docling and Prodigy — and everything after Prodigy — is custom code:

docling_to_prodigy.py — converts Docling output to Prodigy's input format
clean_extracted_text.py — deduplication, quality filtering, normalization
pii_detection.py — regex and NER-based PII detection
prodigy_export.py — exports Prodigy annotations to training format
quality_check.py — inter-annotator agreement, label distribution analysis
prepare_training_data.py — final formatting for model training

Total: ~3,000-5,000 lines of Python across 8-12 scripts

The Friction Points

Friction Point 1: Docling → Prodigy Format Conversion

Docling outputs documents as structured objects with sections, tables, and metadata. Prodigy expects a stream of records in JSONL format with a text field.

The conversion script must:

Flatten document structure into annotation-sized chunks
Decide on chunking strategy (by page? by section? by paragraph?)
Preserve metadata (source file, page number, section) as Prodigy meta fields
Handle tables (convert to text? markdown? skip?)
Handle multi-page documents (one Prodigy task per page, or merge?)

The decisions in this converter are not technical — they're domain-specific. Whether to chunk by section or paragraph affects annotation quality. Whether to include tables affects model coverage. These decisions should be made by domain experts, but they're encoded in a Python script maintained by an ML engineer.

Friction Point 2: Manual Quality Pipeline

Between Docling extraction and Prodigy annotation, the data needs cleaning:

Deduplication (same document in multiple folders)
Quality filtering (OCR confidence below threshold → flag or exclude)
PII detection and redaction (before annotators see the data)
Normalization (encoding issues, whitespace, special characters)

This is 1,000-2,000 lines of custom Python that nobody wants to write, nobody wants to maintain, and nobody has tested comprehensively.

Friction Point 3: Audit Trail Gaps

For regulated industries, the audit trail looks like this:

Docling: Logs parsing events (if logging is configured)
Custom scripts: Log whatever the developer remembered to log (usually: nothing useful)
Prodigy: Logs annotation events with timestamps and session IDs

What's missing:

When was the format conversion run? By whom?
What was the PII detection configuration? What was redacted?
Which version of each script was used?
How were quality thresholds set? Who approved them?

These gaps are compliance risks under the EU AI Act and similar regulations.

Friction Point 4: The Bus Factor

In most enterprises using this stack, one ML engineer understands the full pipeline. They wrote the scripts, configured the tools, and handle the edge cases that arise during processing.

If that person leaves:

The custom scripts have minimal documentation
The Prodigy configuration has undocumented conventions
The edge case handling is tribal knowledge
The next engineer needs 4-8 weeks to understand the pipeline

This isn't a flaw of Prodigy or Docling — they're individual tools with good documentation. The bus factor risk is in the custom integration layer that connects them.

Friction Point 5: Domain Expert Exclusion

Prodigy is excellent for ML engineers. It's a Python-first tool with a command-line interface:

prodigy ner.manual my_dataset blank:en ./data.jsonl --label PERSON,ORG,DATE

A lawyer or doctor who needs to label domain-specific data cannot use this without an ML engineer setting up and running the session. This creates a dependency that bottlenecks labeling throughput.

What a Unified Platform Changes

The friction points above aren't caused by bad tools — they're caused by tool boundaries. Each tool is individually strong but not designed to work with the others.

A unified platform like Ertas Data Suite eliminates these boundaries:

Document parsing feeds directly into cleaning (no format conversion)
Cleaning feeds directly into labeling (no custom scripts)
Labeling includes quality review (no separate quality pipeline)
Export generates compliance documentation (no audit trail gaps)
Domain experts use the same interface as ML engineers (no accessibility barrier)

The trade-off: you lose Prodigy's specifically excellent annotation speed and Docling's specifically excellent table extraction. You gain pipeline continuity, audit trail completeness, and domain expert accessibility.

For enterprise production pipelines in regulated industries, the pipeline-level benefits typically outweigh the tool-level trade-offs. For research and experimentation, the individual tools may remain the better choice.

The stack is good. The gaps between the tools are where the cost lives.