
What to Expect from a $10K–$20K AI Data Prep Engagement
Transparent breakdown of what a $10K–$20K AI data preparation engagement includes: scope, timeline, deliverables, and what drives cost up or down.
Enterprise AI pricing is opaque by design. Most vendors want to get you on a call before they discuss numbers. By the time you learn the price, you have already invested hours in demos and discovery sessions, and the sunk cost makes it harder to walk away.
We think that is backward. If you are budgeting for an AI data preparation engagement, you should know what $10K–$20K buys before you pick up the phone. This post is a transparent breakdown of what a typical engagement at this price point includes, how the work is structured, and what factors push the cost higher or lower.
What This Price Point Covers
A $10K–$20K engagement is scoped for a single data pipeline — one primary data source, one target output format, one use case. It is not an enterprise-wide data transformation. It is a focused, high-value engagement designed to take one specific dataset from raw to AI-ready.
Typical deliverables:
- A working data pipeline on your infrastructure
- Ingestion from your source system (database, file share, document management system)
- Cleaning and transformation rules tailored to your data
- Label schema designed with your domain experts
- Quality validation with measurable metrics
- Export in your required training format (JSONL, Parquet, COCO, etc.)
- Documentation and team training for pipeline maintenance
- 30 days of post-engagement support
What it does not typically include at this price point: multi-source data integration, model training, ongoing managed services, or hardware procurement.
The Timeline
Most engagements at this level run 4-6 weeks. Here is how the time typically breaks down:
Week 1: Discovery (~$2K–$3K of effort)
This is where the engagement either succeeds or fails. Discovery week is about understanding what you actually have, not what you think you have.
What happens:
- Data audit: What data exists, where it lives, what format it is in, how much there is
- Environment setup: Access to your infrastructure, security credentials, network configuration
- Stakeholder interviews: Domain experts explain how the data is used, what matters, what does not
- Scope confirmation: The engagement scope is refined based on what the data audit reveals
What typically goes wrong: The data is in worse shape than expected. Source systems are undocumented. Access provisioning takes longer than planned. This is normal — discovery exists precisely to surface these issues before build starts.
Weeks 2-3: Pipeline Build (~$5K–$9K of effort)
The core engineering work. An engineer (or a pair of engineers for larger scopes) builds the pipeline on your infrastructure.
What happens:
- Ingestion pipeline: Connectors to your source systems, handling edge cases in data formats
- Cleaning rules: Deduplication, normalization, handling missing values, format standardization
- Labeling workflow: Label schema creation, annotation interface setup, domain expert onboarding
- Transformation logic: Converting raw data into the structure your ML pipeline needs
- Iterative review: Domain experts review output samples, provide feedback, refine rules
What typically goes wrong: Edge cases in data that were not visible during discovery. A document type that accounts for 5% of volume but 50% of complexity. Integration issues with legacy systems. Good engineers plan buffer time for this.
Week 4: Validation and Handoff (~$2K–$4K of effort)
The pipeline is tested, validated, and transferred to your team.
What happens:
- Quality metrics: Precision, recall, and agreement scores on labeled data
- Pipeline testing: End-to-end runs with production data volumes
- Documentation: Pipeline architecture, configuration, maintenance procedures
- Team training: Your engineers learn how to operate, modify, and extend the pipeline
- Handoff: Final delivery with acceptance criteria sign-off
What typically goes wrong: Validation reveals quality issues that require pipeline adjustments. This is why validation is a separate phase — it catches problems before handoff, not after.
What Drives Cost Up
Several factors push an engagement above $20K:
Multiple data sources. Each additional source system adds ingestion complexity, format handling, and integration testing. Two sources is roughly 1.5x the work, not 2x, but it adds up.
Complex document types. Scanned PDFs with handwriting, multi-column layouts, embedded tables, or mixed languages require more sophisticated processing and more domain expert time.
Strict compliance requirements. HIPAA, ITAR, or EU AI Act compliance adds documentation overhead, access control configuration, audit trail setup, and often a compliance review step.
Air-gapped environments. Working in disconnected environments adds logistical overhead: software must be transferred physically, updates require sneakernet, and troubleshooting cannot rely on internet access.
Large data volumes. A 10,000-document pipeline is fundamentally different from a 500,000-document pipeline in terms of processing optimization, storage management, and validation sampling.
Undefined scope. If the engagement starts without clear goals, the discovery phase expands, build iterates more, and the timeline stretches. This is the most common cost driver and the most preventable.
What Drives Cost Down
Clean, structured source data. If your data is already in a database with consistent schemas, the ingestion and cleaning phases shrink dramatically.
Clear scope. An organization that knows exactly what it wants — "We need 50,000 contract clauses labeled with 12 categories in JSONL format" — eliminates days of scoping conversations.
Available domain experts. When your subject matter experts can commit dedicated time during the engagement, feedback loops tighten and the build phase moves faster.
Standard formats. If your output format is standard JSONL or Parquet and your source data is common (PDFs, CSVs, standard databases), less custom engineering is needed.
Existing infrastructure. If your compute environment is already set up with the necessary dependencies, environment setup time drops from days to hours.
How Payment Typically Works
Most engagements at this level follow a milestone-based payment structure:
- 30% at engagement start — covers discovery and setup
- 40% at build milestone — triggered when the pipeline is functional and processing data
- 30% at handoff — triggered when validation is complete and your team has been trained
Some vendors offer project-based fixed pricing, others bill time-and-materials. Fixed pricing gives you cost certainty but less flexibility. T&M gives flexibility but requires trust and clear scope boundaries.
What $10K–$20K Does Not Buy
Setting expectations matters as much as describing what is included:
- It does not buy a full data platform. This is a pipeline for one use case, not an enterprise data infrastructure.
- It does not include model training. Data preparation and model training are separate disciplines. Some vendors bundle them; at this price point, most do not.
- It does not include ongoing operations. The engagement delivers a working pipeline and trains your team. Running it day-to-day is your responsibility, though many vendors offer support contracts.
- It does not guarantee model performance. Data preparation improves the probability of good model performance. It does not guarantee it. If someone promises that, ask harder questions.
Is It Worth It?
The honest answer: it depends on the alternative.
If your ML team is spending 3+ months manually preparing data, and an engineer's fully loaded cost is $15K/month, a $15K engagement that delivers a working pipeline in 4 weeks pays for itself immediately.
If your data is already clean and structured, and your team has the skills to build the pipeline themselves, the engagement may not make sense. Not every organization needs external help.
The question is not "is $10K–$20K a lot of money?" It is "what is the cost of not doing this?" Delayed model training, stalled AI initiatives, or an ML team spending its time on data janitorial work instead of model development — those costs add up faster than most organizations realize.
Next Steps
If you are scoping an AI data preparation engagement and want a transparent conversation about what it would take for your specific situation, book a discovery call with Ertas. The call is 30 minutes, there is no pitch, and we will tell you honestly whether a $10K–$20K engagement fits your needs — or whether you need more, less, or something different entirely.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

What Is AI Data Readiness? The Assessment Every Enterprise Skips
Most enterprises jump straight to model selection without assessing whether their data is actually usable for AI. Here's what AI data readiness means and how to assess it.

80% of Enterprise Data Is Unstructured — Here's What That Actually Means for AI
Unpacking the commonly cited statistic that 80-90% of enterprise data is unstructured — what types of data are trapped, what the opportunity cost is, and how it relates to AI adoption.

Build vs. Buy AI Data Preparation: The Real Cost Breakdown
The real math on building in-house AI data preparation pipelines vs. buying a platform — covering engineering costs, maintenance, tool licensing, and hidden integration expenses.