
How to Scope a Data Preparation Engagement for Enterprise Fine-Tuning
A practical scoping framework for ML service providers — discovery questions, common mistakes, checklists, and engagement structure for data prep projects.
Scoping is where data preparation engagements succeed or fail. Underscope, and you absorb the cost overrun. Overscope, and you price yourself out of the deal. Get the scope wrong entirely, and you spend weeks building a pipeline for the wrong problem.
This is a practical guide for ML service providers — consultancies, system integrators, forward deployment teams — who deliver data preparation pipelines for enterprise fine-tuning projects. It covers the discovery framework, common mistakes, a scoping checklist, and a sample engagement structure.
The Discovery Call Framework
The discovery call is your single best opportunity to understand what the engagement actually requires. Most service providers treat it as a sales conversation. Treat it as a technical interview instead.
Questions About Data
- What data types exist? Documents (PDF, Word, scanned images), structured data (CSV, database exports), semi-structured data (JSON, XML), multimedia (audio, video, images). The answer determines your ingestion pipeline complexity.
- What is the total volume? 10GB and 10TB require fundamentally different approaches. Get specific numbers, not ranges.
- How many distinct formats? A single-format corpus (all PDFs) is straightforward. A multi-format corpus (PDFs + scanned images + spreadsheets + email exports) is 3–5x more complex.
- Where does the data currently live? On-premise file servers, cloud storage, legacy databases, email archives, physical filing cabinets. Each source has different extraction requirements.
- What is the data quality baseline? Has anyone looked at the data? Are there known quality issues? Has any cleaning been attempted?
Questions About Compliance
- Which regulatory frameworks apply? HIPAA, GDPR, SOC 2, ITAR, CMMC, industry-specific regulations. Each imposes different constraints on how data can be processed and where.
- Can data leave the client's network? In regulated industries, the answer is almost always no. This determines your deployment model.
- Is there PII or PHI in the source data? If yes, you need a redaction or de-identification step before labeling.
- What audit trail requirements exist? Some clients need full data lineage for regulatory compliance. Others just need it for internal governance.
Questions About the Target Use Case
- What is the model being trained to do? Classification, extraction, generation, summarization, something else. The use case determines the labeling taxonomy and output format.
- Who defined the labeling taxonomy? If the client has a taxonomy, you need to validate it. If they do not, you need to build one — and that is a separate work item.
- What is the target output format? JSONL, Parquet, HuggingFace datasets, custom format. Confirm this before you start.
- What does "done" look like? Get explicit acceptance criteria: dataset size, quality metrics, format requirements, documentation deliverables.
Questions About the Client's Team
- Who will be involved from the client side? ML engineers, data engineers, domain experts, compliance officers. Each group has different needs.
- Will domain experts participate in labeling? If yes, your tooling needs to be accessible to non-technical users.
- Who will maintain the pipeline after handoff? This determines how you document and package the deliverable.
Common Scoping Mistakes
Underestimating Data Diversity
A client says "we have PDFs." You scope for PDF processing. When you arrive, the "PDFs" include scanned images with no OCR, born-digital PDFs with complex table layouts, PDFs with embedded forms, and PDFs that are actually Word documents saved as PDF. Each subtype requires different processing. Budget 2–3x your initial estimate for format diversity within a single stated format.
Ignoring Compliance Requirements
Compliance requirements do not just constrain where you process data. They constrain how you process it, what tools you can use, what audit trail you must produce, and how you handle the data after the engagement ends. A client in healthcare who says "we need HIPAA compliance" is telling you that every tool in your pipeline must meet BAA requirements, every data transformation must be logged, and PHI must be redacted before any non-authorized person sees it.
Assuming Clean Source Data
No enterprise data is clean. Even when the client says "our data is pretty clean," expect 15–30% of records to have quality issues: duplicate entries, inconsistent formatting, missing fields, encoding errors, corrupted files. Build data quality assessment into the first week of every engagement.
Scope Creep From Undefined Labeling Taxonomies
If the labeling taxonomy is not defined before the engagement starts, it will be defined during the engagement — incrementally, inconsistently, and expensively. Every taxonomy change requires relabeling previously completed work. Lock the taxonomy during scoping or budget for iteration.
The Scoping Checklist
Use this checklist during and after discovery to ensure complete scoping.
Data Inventory
- All data sources identified and documented
- Volume per source (GB/TB) confirmed
- Formats per source listed and validated (not just stated)
- Sample data accessed and reviewed
- Data quality baseline assessed (% records with issues)
Compliance and Security
- Applicable regulatory frameworks identified
- Data residency requirements confirmed
- PII/PHI presence assessed
- Redaction or de-identification requirements defined
- Audit trail requirements documented
- Tool approval process understood (some clients require security review of any software installed on-premise)
Labeling and Taxonomy
- Target use case clearly defined
- Labeling taxonomy defined and approved by client
- Edge cases in taxonomy discussed and documented
- Inter-annotator agreement expectations set
- Domain expert availability confirmed
Pipeline and Output
- Target output format confirmed
- Quality metrics and acceptance criteria defined
- Export format validated against client's training pipeline
- Handoff requirements documented (who maintains the pipeline post-engagement)
Timeline and Resources
- Client-side team availability confirmed
- Hardware/infrastructure availability confirmed
- Timeline milestones agreed
- Dependencies identified (e.g., waiting for data access, compliance review)
How Scope Affects Pricing
The primary cost drivers for a data preparation engagement are:
| Cost Driver | Low Complexity | Medium Complexity | High Complexity |
|---|---|---|---|
| Data volume | < 50 GB | 50–500 GB | 500 GB+ |
| Format diversity | Single format | 2–3 formats | 4+ formats or multi-modal |
| Labeling complexity | Binary classification | Multi-class with 5–15 labels | Hierarchical taxonomy, 50+ labels |
| Compliance requirements | Standard data handling | Industry-specific (HIPAA, SOC 2) | Air-gapped, full audit trail |
| Output formats | Single target | 2–3 targets | Custom format with validation |
A low-complexity engagement (single format, small volume, simple labels, standard compliance) typically falls at the lower end of the $10K–$20K range. High-complexity engagements (multi-modal, large volume, complex taxonomy, strict compliance) can exceed $20K and may require phased delivery.
Sample Engagement Structure
Small Engagement (50 GB, single format, 2–3 week timeline)
| Phase | Duration | Deliverables |
|---|---|---|
| Discovery + Scoping | 3 days | Data inventory, compliance summary, scope document |
| Pipeline Setup + Ingestion | 3 days | Working pipeline, ingested data |
| Cleaning + Labeling | 1–2 weeks | Cleaned, labeled dataset |
| QA + Export + Handoff | 2 days | Validated dataset, lineage report, handoff documentation |
Medium Engagement (200 GB, multi-format, 4–6 week timeline)
| Phase | Duration | Deliverables |
|---|---|---|
| Discovery + Scoping | 1 week | Data inventory, compliance summary, scope document, labeling taxonomy |
| Pipeline Setup + Ingestion | 1 week | Working pipeline, ingested data, format conversion validation |
| Cleaning + Labeling | 2–3 weeks | Cleaned, labeled dataset with QA checkpoints |
| Augmentation + QA | 3–5 days | Augmented dataset, quality metrics report |
| Export + Handoff | 3–5 days | Validated dataset, full lineage report, handoff documentation, team training |
Reducing Scoping Uncertainty
The biggest source of scoping uncertainty is not knowing what the data actually looks like until you start processing it. Discovery calls reveal some surprises. The rest emerge during pipeline setup.
Using a unified platform that handles the full data preparation pipeline — from ingestion through export — significantly reduces this uncertainty. When all five stages (Ingest → Clean → Label → Augment → Export) run in a single tool, format surprises surface during ingestion rather than at the boundary between two separate tools. Ertas Data Suite is built for this workflow: it runs entirely on-premise, handles multi-format ingestion natively, and provides the audit trail that compliance-heavy engagements require.
The goal of good scoping is not to eliminate uncertainty — that is impossible with enterprise data. The goal is to identify where the uncertainty lives and build your engagement structure to absorb it without blowing the timeline or budget.
Where This Fits
Scoping is the first step in a data preparation service practice. Get it right, and the rest of the engagement follows a predictable structure. Get it wrong, and every subsequent phase inherits the error — usually in the form of rework, scope creep, or a handoff that the client cannot maintain.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Data Preparation as a Service: Building Repeatable ML Pipelines for Enterprise Clients
How ML service providers can build a scalable data preparation practice for enterprise clients — covering pipeline structure, pricing, and unified tooling.

Pricing Data Preparation Services for Enterprise Fine-Tuning Projects
Pricing models, cost drivers, and sample structures for ML service providers delivering on-premise data preparation to enterprise fine-tuning clients.

How to Scope an AI Data Preparation Project (RFP Template)
A practical RFP template for AI data preparation projects with section-by-section guidance on what to include and how to write requirements that get useful vendor responses.