
Prodigy + Docling + Custom Scripts: A Real Enterprise Stack Audit
Walking through what a typical enterprise data preparation stack looks like in practice — Prodigy for annotation, Docling for parsing, custom scripts for everything else — and identifying the friction points.
What does a real enterprise AI data preparation stack look like? Not the diagram on the architecture slide — the actual day-to-day reality of tools, scripts, and workarounds that an ML team operates.
This is an audit of a representative stack: Prodigy for annotation, Docling for document parsing, and custom Python scripts for everything in between. Each tool is well-regarded in its category. The friction is in the gaps.
The Stack
Prodigy (Explosion AI) — $390-$10,000/year
Prodigy is arguably the best annotation tool for NLP tasks. It's fast, scriptable, runs locally (important for sensitive data), and supports active learning. It's the tool that ML engineers who've used everything else usually prefer.
What it does well:
- Extremely efficient annotation interface (designed for speed)
- Runs entirely locally — no cloud dependency, no Docker required
- Active learning: suggests labels, learns from corrections
- Python API for customization
- Supports NLP (NER, text classification, spans) and CV tasks
What it doesn't do:
- No document parsing — expects text input, not PDFs
- No data cleaning or quality scoring
- No audit trail for compliance (designed for productivity, not governance)
- Single-user focused — team features require custom orchestration
- No multi-format export (outputs Prodigy's internal format)
Docling (IBM Research) — Free/Open-Source
Docling is a strong document parser. It handles PDFs, Word documents, and other formats with good table extraction and layout detection.
What it does well:
- 97.9% table extraction accuracy (competitive with commercial tools)
- Layout-aware parsing (headings, paragraphs, lists, tables)
- Multiple output formats (Markdown, JSON, text)
- Open-source, actively maintained by IBM Research
What it doesn't do:
- No labeling capability
- No data cleaning, deduplication, or quality scoring
- No PII detection or redaction
- No audit trail
- No GUI — command-line interface only
Custom Python Scripts — "Free"
Everything between Docling and Prodigy — and everything after Prodigy — is custom code:
docling_to_prodigy.py— converts Docling output to Prodigy's input formatclean_extracted_text.py— deduplication, quality filtering, normalizationpii_detection.py— regex and NER-based PII detectionprodigy_export.py— exports Prodigy annotations to training formatquality_check.py— inter-annotator agreement, label distribution analysisprepare_training_data.py— final formatting for model training
Total: ~3,000-5,000 lines of Python across 8-12 scripts
The Friction Points
Friction Point 1: Docling → Prodigy Format Conversion
Docling outputs documents as structured objects with sections, tables, and metadata. Prodigy expects a stream of records in JSONL format with a text field.
The conversion script must:
- Flatten document structure into annotation-sized chunks
- Decide on chunking strategy (by page? by section? by paragraph?)
- Preserve metadata (source file, page number, section) as Prodigy
metafields - Handle tables (convert to text? markdown? skip?)
- Handle multi-page documents (one Prodigy task per page, or merge?)
The decisions in this converter are not technical — they're domain-specific. Whether to chunk by section or paragraph affects annotation quality. Whether to include tables affects model coverage. These decisions should be made by domain experts, but they're encoded in a Python script maintained by an ML engineer.
Friction Point 2: Manual Quality Pipeline
Between Docling extraction and Prodigy annotation, the data needs cleaning:
- Deduplication (same document in multiple folders)
- Quality filtering (OCR confidence below threshold → flag or exclude)
- PII detection and redaction (before annotators see the data)
- Normalization (encoding issues, whitespace, special characters)
This is 1,000-2,000 lines of custom Python that nobody wants to write, nobody wants to maintain, and nobody has tested comprehensively.
Friction Point 3: Audit Trail Gaps
For regulated industries, the audit trail looks like this:
- Docling: Logs parsing events (if logging is configured)
- Custom scripts: Log whatever the developer remembered to log (usually: nothing useful)
- Prodigy: Logs annotation events with timestamps and session IDs
What's missing:
- When was the format conversion run? By whom?
- What was the PII detection configuration? What was redacted?
- Which version of each script was used?
- How were quality thresholds set? Who approved them?
These gaps are compliance risks under the EU AI Act and similar regulations.
Friction Point 4: The Bus Factor
In most enterprises using this stack, one ML engineer understands the full pipeline. They wrote the scripts, configured the tools, and handle the edge cases that arise during processing.
If that person leaves:
- The custom scripts have minimal documentation
- The Prodigy configuration has undocumented conventions
- The edge case handling is tribal knowledge
- The next engineer needs 4-8 weeks to understand the pipeline
This isn't a flaw of Prodigy or Docling — they're individual tools with good documentation. The bus factor risk is in the custom integration layer that connects them.
Friction Point 5: Domain Expert Exclusion
Prodigy is excellent for ML engineers. It's a Python-first tool with a command-line interface:
prodigy ner.manual my_dataset blank:en ./data.jsonl --label PERSON,ORG,DATE
A lawyer or doctor who needs to label domain-specific data cannot use this without an ML engineer setting up and running the session. This creates a dependency that bottlenecks labeling throughput.
What a Unified Platform Changes
The friction points above aren't caused by bad tools — they're caused by tool boundaries. Each tool is individually strong but not designed to work with the others.
A unified platform like Ertas Data Suite eliminates these boundaries:
- Document parsing feeds directly into cleaning (no format conversion)
- Cleaning feeds directly into labeling (no custom scripts)
- Labeling includes quality review (no separate quality pipeline)
- Export generates compliance documentation (no audit trail gaps)
- Domain experts use the same interface as ML engineers (no accessibility barrier)
The trade-off: you lose Prodigy's specifically excellent annotation speed and Docling's specifically excellent table extraction. You gain pipeline continuity, audit trail completeness, and domain expert accessibility.
For enterprise production pipelines in regulated industries, the pipeline-level benefits typically outweigh the tool-level trade-offs. For research and experimentation, the individual tools may remain the better choice.
The stack is good. The gaps between the tools are where the cost lives.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

What Is AI Data Readiness? The Assessment Every Enterprise Skips
Most enterprises jump straight to model selection without assessing whether their data is actually usable for AI. Here's what AI data readiness means and how to assess it.

80% of Enterprise Data Is Unstructured — Here's What That Actually Means for AI
Unpacking the commonly cited statistic that 80-90% of enterprise data is unstructured — what types of data are trapped, what the opportunity cost is, and how it relates to AI adoption.

Build vs. Buy AI Data Preparation: The Real Cost Breakdown
The real math on building in-house AI data preparation pipelines vs. buying a platform — covering engineering costs, maintenance, tool licensing, and hidden integration expenses.