How to Scope a Data Preparation Engagement for Enterprise Fine-Tuning

Scoping is where data preparation engagements succeed or fail. Underscope, and you absorb the cost overrun. Overscope, and you price yourself out of the deal. Get the scope wrong entirely, and you spend weeks building a pipeline for the wrong problem.

This is a practical guide for ML service providers — consultancies, system integrators, forward deployment teams — who deliver data preparation pipelines for enterprise fine-tuning projects. It covers the discovery framework, common mistakes, a scoping checklist, and a sample engagement structure.

The Discovery Call Framework

The discovery call is your single best opportunity to understand what the engagement actually requires. Most service providers treat it as a sales conversation. Treat it as a technical interview instead.

Questions About Data

What data types exist? Documents (PDF, Word, scanned images), structured data (CSV, database exports), semi-structured data (JSON, XML), multimedia (audio, video, images). The answer determines your ingestion pipeline complexity.
What is the total volume? 10GB and 10TB require fundamentally different approaches. Get specific numbers, not ranges.
How many distinct formats? A single-format corpus (all PDFs) is straightforward. A multi-format corpus (PDFs + scanned images + spreadsheets + email exports) is 3–5x more complex.
Where does the data currently live? On-premise file servers, cloud storage, legacy databases, email archives, physical filing cabinets. Each source has different extraction requirements.
What is the data quality baseline? Has anyone looked at the data? Are there known quality issues? Has any cleaning been attempted?

Questions About Compliance

Which regulatory frameworks apply? HIPAA, GDPR, SOC 2, ITAR, CMMC, industry-specific regulations. Each imposes different constraints on how data can be processed and where.
Can data leave the client's network? In regulated industries, the answer is almost always no. This determines your deployment model.
Is there PII or PHI in the source data? If yes, you need a redaction or de-identification step before labeling.
What audit trail requirements exist? Some clients need full data lineage for regulatory compliance. Others just need it for internal governance.

Questions About the Target Use Case

What is the model being trained to do? Classification, extraction, generation, summarization, something else. The use case determines the labeling taxonomy and output format.
Who defined the labeling taxonomy? If the client has a taxonomy, you need to validate it. If they do not, you need to build one — and that is a separate work item.
What is the target output format? JSONL, Parquet, HuggingFace datasets, custom format. Confirm this before you start.
What does "done" look like? Get explicit acceptance criteria: dataset size, quality metrics, format requirements, documentation deliverables.

Questions About the Client's Team

Who will be involved from the client side? ML engineers, data engineers, domain experts, compliance officers. Each group has different needs.
Will domain experts participate in labeling? If yes, your tooling needs to be accessible to non-technical users.
Who will maintain the pipeline after handoff? This determines how you document and package the deliverable.

Common Scoping Mistakes

Underestimating Data Diversity

A client says "we have PDFs." You scope for PDF processing. When you arrive, the "PDFs" include scanned images with no OCR, born-digital PDFs with complex table layouts, PDFs with embedded forms, and PDFs that are actually Word documents saved as PDF. Each subtype requires different processing. Budget 2–3x your initial estimate for format diversity within a single stated format.

Ignoring Compliance Requirements

Compliance requirements do not just constrain where you process data. They constrain how you process it, what tools you can use, what audit trail you must produce, and how you handle the data after the engagement ends. A client in healthcare who says "we need HIPAA compliance" is telling you that every tool in your pipeline must meet BAA requirements, every data transformation must be logged, and PHI must be redacted before any non-authorized person sees it.

Assuming Clean Source Data

No enterprise data is clean. Even when the client says "our data is pretty clean," expect 15–30% of records to have quality issues: duplicate entries, inconsistent formatting, missing fields, encoding errors, corrupted files. Build data quality assessment into the first week of every engagement.

Scope Creep From Undefined Labeling Taxonomies

If the labeling taxonomy is not defined before the engagement starts, it will be defined during the engagement — incrementally, inconsistently, and expensively. Every taxonomy change requires relabeling previously completed work. Lock the taxonomy during scoping or budget for iteration.

The Scoping Checklist

Use this checklist during and after discovery to ensure complete scoping.

Data Inventory

All data sources identified and documented
Volume per source (GB/TB) confirmed
Formats per source listed and validated (not just stated)
Sample data accessed and reviewed
Data quality baseline assessed (% records with issues)

Compliance and Security

Applicable regulatory frameworks identified
Data residency requirements confirmed
PII/PHI presence assessed
Redaction or de-identification requirements defined
Audit trail requirements documented
Tool approval process understood (some clients require security review of any software installed on-premise)

Labeling and Taxonomy

Target use case clearly defined
Labeling taxonomy defined and approved by client
Edge cases in taxonomy discussed and documented
Inter-annotator agreement expectations set
Domain expert availability confirmed

Pipeline and Output

Target output format confirmed
Quality metrics and acceptance criteria defined
Export format validated against client's training pipeline
Handoff requirements documented (who maintains the pipeline post-engagement)

Timeline and Resources

Client-side team availability confirmed
Hardware/infrastructure availability confirmed
Timeline milestones agreed
Dependencies identified (e.g., waiting for data access, compliance review)

How Scope Affects Pricing

The primary cost drivers for a data preparation engagement are:

Cost Driver	Low Complexity	Medium Complexity	High Complexity
Data volume	< 50 GB	50–500 GB	500 GB+
Format diversity	Single format	2–3 formats	4+ formats or multi-modal
Labeling complexity	Binary classification	Multi-class with 5–15 labels	Hierarchical taxonomy, 50+ labels
Compliance requirements	Standard data handling	Industry-specific (HIPAA, SOC 2)	Air-gapped, full audit trail
Output formats	Single target	2–3 targets	Custom format with validation

A low-complexity engagement (single format, small volume, simple labels, standard compliance) typically falls at the lower end of the $10K–$20K range. High-complexity engagements (multi-modal, large volume, complex taxonomy, strict compliance) can exceed $20K and may require phased delivery.

Sample Engagement Structure

Small Engagement (50 GB, single format, 2–3 week timeline)

Phase	Duration	Deliverables
Discovery + Scoping	3 days	Data inventory, compliance summary, scope document
Pipeline Setup + Ingestion	3 days	Working pipeline, ingested data
Cleaning + Labeling	1–2 weeks	Cleaned, labeled dataset
QA + Export + Handoff	2 days	Validated dataset, lineage report, handoff documentation

Medium Engagement (200 GB, multi-format, 4–6 week timeline)

Phase	Duration	Deliverables
Discovery + Scoping	1 week	Data inventory, compliance summary, scope document, labeling taxonomy
Pipeline Setup + Ingestion	1 week	Working pipeline, ingested data, format conversion validation
Cleaning + Labeling	2–3 weeks	Cleaned, labeled dataset with QA checkpoints
Augmentation + QA	3–5 days	Augmented dataset, quality metrics report
Export + Handoff	3–5 days	Validated dataset, full lineage report, handoff documentation, team training

Reducing Scoping Uncertainty

The biggest source of scoping uncertainty is not knowing what the data actually looks like until you start processing it. Discovery calls reveal some surprises. The rest emerge during pipeline setup.

Using a unified platform that handles the full data preparation pipeline — from ingestion through export — significantly reduces this uncertainty. When all five stages (Ingest → Clean → Label → Augment → Export) run in a single tool, format surprises surface during ingestion rather than at the boundary between two separate tools. Ertas Data Suite is built for this workflow: it runs entirely on-premise, handles multi-format ingestion natively, and provides the audit trail that compliance-heavy engagements require.

The goal of good scoping is not to eliminate uncertainty — that is impossible with enterprise data. The goal is to identify where the uncertainty lives and build your engagement structure to absorb it without blowing the timeline or budget.

Where This Fits

Scoping is the first step in a data preparation service practice. Get it right, and the rest of the engagement follows a predictable structure. Get it wrong, and every subsequent phase inherits the error — usually in the form of rework, scope creep, or a handoff that the client cannot maintain.