Back to blog
    Prodigy + Docling + Custom Scripts: A Real Enterprise Stack Audit
    prodigydoclingstack-auditdata-preparationenterprise-aisegment:enterprise

    Prodigy + Docling + Custom Scripts: A Real Enterprise Stack Audit

    Walking through what a typical enterprise data preparation stack looks like in practice — Prodigy for annotation, Docling for parsing, custom scripts for everything else — and identifying the friction points.

    EErtas Team·

    What does a real enterprise AI data preparation stack look like? Not the diagram on the architecture slide — the actual day-to-day reality of tools, scripts, and workarounds that an ML team operates.

    This is an audit of a representative stack: Prodigy for annotation, Docling for document parsing, and custom Python scripts for everything in between. Each tool is well-regarded in its category. The friction is in the gaps.

    The Stack

    Prodigy (Explosion AI) — $390-$10,000/year

    Prodigy is arguably the best annotation tool for NLP tasks. It's fast, scriptable, runs locally (important for sensitive data), and supports active learning. It's the tool that ML engineers who've used everything else usually prefer.

    What it does well:

    • Extremely efficient annotation interface (designed for speed)
    • Runs entirely locally — no cloud dependency, no Docker required
    • Active learning: suggests labels, learns from corrections
    • Python API for customization
    • Supports NLP (NER, text classification, spans) and CV tasks

    What it doesn't do:

    • No document parsing — expects text input, not PDFs
    • No data cleaning or quality scoring
    • No audit trail for compliance (designed for productivity, not governance)
    • Single-user focused — team features require custom orchestration
    • No multi-format export (outputs Prodigy's internal format)

    Docling (IBM Research) — Free/Open-Source

    Docling is a strong document parser. It handles PDFs, Word documents, and other formats with good table extraction and layout detection.

    What it does well:

    • 97.9% table extraction accuracy (competitive with commercial tools)
    • Layout-aware parsing (headings, paragraphs, lists, tables)
    • Multiple output formats (Markdown, JSON, text)
    • Open-source, actively maintained by IBM Research

    What it doesn't do:

    • No labeling capability
    • No data cleaning, deduplication, or quality scoring
    • No PII detection or redaction
    • No audit trail
    • No GUI — command-line interface only

    Custom Python Scripts — "Free"

    Everything between Docling and Prodigy — and everything after Prodigy — is custom code:

    • docling_to_prodigy.py — converts Docling output to Prodigy's input format
    • clean_extracted_text.py — deduplication, quality filtering, normalization
    • pii_detection.py — regex and NER-based PII detection
    • prodigy_export.py — exports Prodigy annotations to training format
    • quality_check.py — inter-annotator agreement, label distribution analysis
    • prepare_training_data.py — final formatting for model training

    Total: ~3,000-5,000 lines of Python across 8-12 scripts

    The Friction Points

    Friction Point 1: Docling → Prodigy Format Conversion

    Docling outputs documents as structured objects with sections, tables, and metadata. Prodigy expects a stream of records in JSONL format with a text field.

    The conversion script must:

    • Flatten document structure into annotation-sized chunks
    • Decide on chunking strategy (by page? by section? by paragraph?)
    • Preserve metadata (source file, page number, section) as Prodigy meta fields
    • Handle tables (convert to text? markdown? skip?)
    • Handle multi-page documents (one Prodigy task per page, or merge?)

    The decisions in this converter are not technical — they're domain-specific. Whether to chunk by section or paragraph affects annotation quality. Whether to include tables affects model coverage. These decisions should be made by domain experts, but they're encoded in a Python script maintained by an ML engineer.

    Friction Point 2: Manual Quality Pipeline

    Between Docling extraction and Prodigy annotation, the data needs cleaning:

    • Deduplication (same document in multiple folders)
    • Quality filtering (OCR confidence below threshold → flag or exclude)
    • PII detection and redaction (before annotators see the data)
    • Normalization (encoding issues, whitespace, special characters)

    This is 1,000-2,000 lines of custom Python that nobody wants to write, nobody wants to maintain, and nobody has tested comprehensively.

    Friction Point 3: Audit Trail Gaps

    For regulated industries, the audit trail looks like this:

    • Docling: Logs parsing events (if logging is configured)
    • Custom scripts: Log whatever the developer remembered to log (usually: nothing useful)
    • Prodigy: Logs annotation events with timestamps and session IDs

    What's missing:

    • When was the format conversion run? By whom?
    • What was the PII detection configuration? What was redacted?
    • Which version of each script was used?
    • How were quality thresholds set? Who approved them?

    These gaps are compliance risks under the EU AI Act and similar regulations.

    Friction Point 4: The Bus Factor

    In most enterprises using this stack, one ML engineer understands the full pipeline. They wrote the scripts, configured the tools, and handle the edge cases that arise during processing.

    If that person leaves:

    • The custom scripts have minimal documentation
    • The Prodigy configuration has undocumented conventions
    • The edge case handling is tribal knowledge
    • The next engineer needs 4-8 weeks to understand the pipeline

    This isn't a flaw of Prodigy or Docling — they're individual tools with good documentation. The bus factor risk is in the custom integration layer that connects them.

    Friction Point 5: Domain Expert Exclusion

    Prodigy is excellent for ML engineers. It's a Python-first tool with a command-line interface:

    prodigy ner.manual my_dataset blank:en ./data.jsonl --label PERSON,ORG,DATE
    

    A lawyer or doctor who needs to label domain-specific data cannot use this without an ML engineer setting up and running the session. This creates a dependency that bottlenecks labeling throughput.

    What a Unified Platform Changes

    The friction points above aren't caused by bad tools — they're caused by tool boundaries. Each tool is individually strong but not designed to work with the others.

    A unified platform like Ertas Data Suite eliminates these boundaries:

    • Document parsing feeds directly into cleaning (no format conversion)
    • Cleaning feeds directly into labeling (no custom scripts)
    • Labeling includes quality review (no separate quality pipeline)
    • Export generates compliance documentation (no audit trail gaps)
    • Domain experts use the same interface as ML engineers (no accessibility barrier)

    The trade-off: you lose Prodigy's specifically excellent annotation speed and Docling's specifically excellent table extraction. You gain pipeline continuity, audit trail completeness, and domain expert accessibility.

    For enterprise production pipelines in regulated industries, the pipeline-level benefits typically outweigh the tool-level trade-offs. For research and experimentation, the individual tools may remain the better choice.

    The stack is good. The gaps between the tools are where the cost lives.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading