What to Expect from a $10K–$20K AI Data Prep Engagement

Enterprise AI pricing is opaque by design. Most vendors want to get you on a call before they discuss numbers. By the time you learn the price, you have already invested hours in demos and discovery sessions, and the sunk cost makes it harder to walk away.

We think that is backward. If you are budgeting for an AI data preparation engagement, you should know what $10K–$20K buys before you pick up the phone. This post is a transparent breakdown of what a typical engagement at this price point includes, how the work is structured, and what factors push the cost higher or lower.

What This Price Point Covers

A $10K–$20K engagement is scoped for a single data pipeline — one primary data source, one target output format, one use case. It is not an enterprise-wide data transformation. It is a focused, high-value engagement designed to take one specific dataset from raw to AI-ready.

Typical deliverables:

A working data pipeline on your infrastructure
Ingestion from your source system (database, file share, document management system)
Cleaning and transformation rules tailored to your data
Label schema designed with your domain experts
Quality validation with measurable metrics
Export in your required training format (JSONL, Parquet, COCO, etc.)
Documentation and team training for pipeline maintenance
30 days of post-engagement support

What it does not typically include at this price point: multi-source data integration, model training, ongoing managed services, or hardware procurement.

The Timeline

Most engagements at this level run 4-6 weeks. Here is how the time typically breaks down:

Week 1: Discovery (~$2K–$3K of effort)

This is where the engagement either succeeds or fails. Discovery week is about understanding what you actually have, not what you think you have.

What happens:

Data audit: What data exists, where it lives, what format it is in, how much there is
Environment setup: Access to your infrastructure, security credentials, network configuration
Stakeholder interviews: Domain experts explain how the data is used, what matters, what does not
Scope confirmation: The engagement scope is refined based on what the data audit reveals

What typically goes wrong: The data is in worse shape than expected. Source systems are undocumented. Access provisioning takes longer than planned. This is normal — discovery exists precisely to surface these issues before build starts.

Weeks 2-3: Pipeline Build (~$5K–$9K of effort)

The core engineering work. An engineer (or a pair of engineers for larger scopes) builds the pipeline on your infrastructure.

What happens:

Ingestion pipeline: Connectors to your source systems, handling edge cases in data formats
Cleaning rules: Deduplication, normalization, handling missing values, format standardization
Labeling workflow: Label schema creation, annotation interface setup, domain expert onboarding
Transformation logic: Converting raw data into the structure your ML pipeline needs
Iterative review: Domain experts review output samples, provide feedback, refine rules

What typically goes wrong: Edge cases in data that were not visible during discovery. A document type that accounts for 5% of volume but 50% of complexity. Integration issues with legacy systems. Good engineers plan buffer time for this.

Week 4: Validation and Handoff (~$2K–$4K of effort)

The pipeline is tested, validated, and transferred to your team.

What happens:

Quality metrics: Precision, recall, and agreement scores on labeled data
Pipeline testing: End-to-end runs with production data volumes
Documentation: Pipeline architecture, configuration, maintenance procedures
Team training: Your engineers learn how to operate, modify, and extend the pipeline
Handoff: Final delivery with acceptance criteria sign-off

What typically goes wrong: Validation reveals quality issues that require pipeline adjustments. This is why validation is a separate phase — it catches problems before handoff, not after.

What Drives Cost Up

Several factors push an engagement above $20K:

Multiple data sources. Each additional source system adds ingestion complexity, format handling, and integration testing. Two sources is roughly 1.5x the work, not 2x, but it adds up.

Complex document types. Scanned PDFs with handwriting, multi-column layouts, embedded tables, or mixed languages require more sophisticated processing and more domain expert time.

Strict compliance requirements. HIPAA, ITAR, or EU AI Act compliance adds documentation overhead, access control configuration, audit trail setup, and often a compliance review step.

Air-gapped environments. Working in disconnected environments adds logistical overhead: software must be transferred physically, updates require sneakernet, and troubleshooting cannot rely on internet access.

Large data volumes. A 10,000-document pipeline is fundamentally different from a 500,000-document pipeline in terms of processing optimization, storage management, and validation sampling.

Undefined scope. If the engagement starts without clear goals, the discovery phase expands, build iterates more, and the timeline stretches. This is the most common cost driver and the most preventable.

What Drives Cost Down

Clean, structured source data. If your data is already in a database with consistent schemas, the ingestion and cleaning phases shrink dramatically.

Clear scope. An organization that knows exactly what it wants — "We need 50,000 contract clauses labeled with 12 categories in JSONL format" — eliminates days of scoping conversations.

Available domain experts. When your subject matter experts can commit dedicated time during the engagement, feedback loops tighten and the build phase moves faster.

Standard formats. If your output format is standard JSONL or Parquet and your source data is common (PDFs, CSVs, standard databases), less custom engineering is needed.

Existing infrastructure. If your compute environment is already set up with the necessary dependencies, environment setup time drops from days to hours.

How Payment Typically Works

Most engagements at this level follow a milestone-based payment structure:

30% at engagement start — covers discovery and setup
40% at build milestone — triggered when the pipeline is functional and processing data
30% at handoff — triggered when validation is complete and your team has been trained

Some vendors offer project-based fixed pricing, others bill time-and-materials. Fixed pricing gives you cost certainty but less flexibility. T&M gives flexibility but requires trust and clear scope boundaries.

What $10K–$20K Does Not Buy

Setting expectations matters as much as describing what is included:

It does not buy a full data platform. This is a pipeline for one use case, not an enterprise data infrastructure.
It does not include model training. Data preparation and model training are separate disciplines. Some vendors bundle them; at this price point, most do not.
It does not include ongoing operations. The engagement delivers a working pipeline and trains your team. Running it day-to-day is your responsibility, though many vendors offer support contracts.
It does not guarantee model performance. Data preparation improves the probability of good model performance. It does not guarantee it. If someone promises that, ask harder questions.

Is It Worth It?

The honest answer: it depends on the alternative.

If your ML team is spending 3+ months manually preparing data, and an engineer's fully loaded cost is $15K/month, a $15K engagement that delivers a working pipeline in 4 weeks pays for itself immediately.

If your data is already clean and structured, and your team has the skills to build the pipeline themselves, the engagement may not make sense. Not every organization needs external help.

The question is not "is $10K–$20K a lot of money?" It is "what is the cost of not doing this?" Delayed model training, stalled AI initiatives, or an ML team spending its time on data janitorial work instead of model development — those costs add up faster than most organizations realize.

Next Steps

If you are scoping an AI data preparation engagement and want a transparent conversation about what it would take for your specific situation, book a discovery call with Ertas. The call is 30 minutes, there is no pitch, and we will tell you honestly whether a $10K–$20K engagement fits your needs — or whether you need more, less, or something different entirely.

What to Expect from a $10K–$20K AI Data Prep Engagement

What This Price Point Covers

The Timeline

Week 1: Discovery (~$2K–$3K of effort)

Weeks 2-3: Pipeline Build (~$5K–$9K of effort)

Week 4: Validation and Handoff (~$2K–$4K of effort)

What Drives Cost Up

What Drives Cost Down

How Payment Typically Works

What $10K–$20K Does Not Buy

Is It Worth It?

Next Steps

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

What Is AI Data Readiness? The Assessment Every Enterprise Skips

80% of Enterprise Data Is Unstructured — Here's What That Actually Means for AI

Build vs. Buy AI Data Preparation: The Real Cost Breakdown