
The First 30 Days of an Enterprise AI Data Pipeline Build
A behind-the-scenes weekly walkthrough of building an enterprise AI data pipeline: what happens, what goes wrong, and what good looks like at each stage.
Enterprise AI timelines look clean on slide decks. Neat phases, smooth arrows, everything landing on schedule. Reality is messier. Data is in formats no one documented. Access provisioning takes a week instead of a day. The domain expert who was supposed to be available full-time is available for two hours on Thursday.
This is a week-by-week walkthrough of what actually happens during the first 30 days of building an enterprise AI data pipeline. Not the idealized version — the real one, including the parts that go sideways and how to handle them.
Week 1: Data Audit and Scoping
What Is Supposed to Happen
The forward deployed engineer arrives (physically or virtually), meets the team, and conducts a thorough audit of the data landscape. The goal is to answer three questions: What data exists? Where does it live? What shape is it in?
Simultaneously, the engineer works with IT to get environment access: compute resources, network credentials, storage access, security clearances.
What Actually Happens
Day 1-2: Introductions and access requests. The engineer meets stakeholders, gets a tour of the systems, and submits access requests. In most enterprises, this is where the first delay happens. Security reviews, background checks, and access provisioning in regulated environments can take 2-5 business days. Good engagements anticipate this and start access requests during the pre-engagement phase.
Day 3-4: Data discovery. The engineer starts mapping data sources. This is almost always surprising. The data the client described during the sales process is a subset of what actually exists. There are additional databases, legacy file shares, exports from systems that were decommissioned three years ago, and spreadsheets on someone's desktop that contain critical reference data.
Common discoveries:
- Data is spread across more systems than anyone realized
- File formats are inconsistent even within a single source
- Metadata is incomplete or unreliable
- Data volume is 3-10x what was estimated
- Some data is in formats that require specialized parsers (scanned PDFs, proprietary database exports, mainframe extracts)
Day 5: Scope adjustment. The original scope, defined during the sales process based on the client's description, is revised based on what the data audit actually found. This is not scope creep — it is scope correction. The work was always this size; the estimate just did not know it yet.
What Goes Wrong
Access delays are the most common Week 1 problem. If the engineer cannot access the data systems, everything stalls. The mitigation is starting access provisioning before the engagement officially begins.
The second most common issue: the primary stakeholder (the person who bought the engagement) has a different understanding of the data than the people who actually work with it. The stakeholder says, "Our contracts are all in a SharePoint folder." The contracts team says, "Well, the recent ones are in SharePoint. The ones from before 2022 are in the old document management system. And the amendments are in email."
Week 2: Pipeline Architecture and Ingestion Testing
What Is Supposed to Happen
Based on the Week 1 audit, the engineer designs the pipeline architecture and begins building the ingestion layer — the part that pulls data from source systems into the preparation environment.
What Actually Happens
Day 6-7: Architecture design. The engineer maps out the pipeline: source connectors, transformation steps, labeling workflow, export format. This is reviewed with the client's technical team. Architecture decisions made this week — where to process, how to handle errors, what to log — determine the pipeline's long-term maintainability.
Day 8-9: Ingestion build and testing. The first connectors are built and tested. This is where data format issues become concrete. A PDF parser works on 90% of documents but fails on the 10% that are scanned images. A database connector pulls records successfully but timestamps are in three different formats. A CSV export has embedded newlines that break the parser.
Each of these issues is solvable. But each one takes time, and they compound. An engineer who has done this before will not be surprised. An engineer doing it for the first time will underestimate the effort by 2-3x.
Day 10: First end-to-end data flow. By the end of Week 2, raw data should be flowing from at least one source system into the preparation environment. It will not be clean. It will not be labeled. But it will be moving, and that is the foundation everything else builds on.
What Goes Wrong
Integration issues with legacy systems are the most common Week 2 problem. Modern APIs are predictable. Legacy database exports, proprietary file formats, and systems with no documentation are not. Budget extra time if your data lives in systems older than 10 years.
Performance can also surprise. A pipeline that processes 100 test documents in seconds may choke on 100,000 production documents. Week 2 is where these bottlenecks become visible.
Week 3: Cleaning Rules, Label Schema, and Domain Expert Onboarding
What Is Supposed to Happen
With data flowing into the pipeline, the focus shifts to transformation: cleaning rules that standardize data, and a label schema that domain experts will use to annotate data for model training.
What Actually Happens
Day 11-13: Cleaning and transformation rules. The engineer builds the rules that clean raw data: deduplication, normalization, handling missing values, format standardization, PII detection and redaction (if applicable). This is iterative — rules are written, tested against sample data, refined, and tested again.
The key insight: cleaning rules encode domain knowledge. A rule that says "if the date field contains 'TBD', treat it as null" is a domain decision, not a technical one. This is why domain experts need to be involved, not just engineers.
Day 14-15: Label schema design. The engineer works with domain experts to define the label schema — the categories, tags, or annotations that will be applied to the data. This is the most intellectually demanding part of the engagement.
A good label schema is:
- Exhaustive (covers all cases in the data)
- Mutually exclusive (no ambiguous overlaps)
- Practical (annotators can apply labels consistently)
- Aligned with the downstream model task
A bad label schema is obvious in retrospect but invisible during design. "Contract type" seems like a clear label until you encounter a document that is both an amendment and a renewal. "Severity" seems straightforward until two domain experts disagree on whether a finding is "moderate" or "high."
Day 15 continued: Domain expert onboarding. Domain experts are trained on the labeling interface and the label schema. They label a sample set. Inter-annotator agreement is measured. If agreement is low, the schema needs revision.
What Goes Wrong
Domain expert availability is the critical risk in Week 3. If the domain experts are too busy to participate, the label schema will be designed by engineers who do not understand the domain, and the resulting training data will be mediocre.
The other common issue: label schema disagreement. Different domain experts have different mental models. A senior attorney classifies contracts differently than a junior attorney. A cardiologist and a radiologist interpret the same imaging report differently. Resolving these disagreements takes time and diplomacy.
Week 4: Validation, Quality Metrics, and Compliance Setup
What Is Supposed to Happen
The pipeline is complete. Week 4 is about testing, measuring, and documenting.
What Actually Happens
Day 16-18: Pipeline validation. The full pipeline runs end-to-end with production-scale data. Quality metrics are measured:
- Ingestion completeness: What percentage of source records were successfully ingested?
- Cleaning accuracy: What percentage of transformations produced correct results?
- Label quality: What is the inter-annotator agreement? What is the precision/recall on a gold-standard sample?
- Export integrity: Does the output format match the specification? Can the downstream ML framework ingest it without errors?
Targets vary by use case, but typical benchmarks: 99%+ ingestion completeness, 95%+ cleaning accuracy, 85%+ inter-annotator agreement (Cohen's kappa > 0.7).
Day 19: Compliance documentation. For regulated industries, the pipeline's audit trail is reviewed and documented: data lineage reports, access logs, transformation histories, PII handling records. This documentation is the deliverable that compliance teams care about most — and it is the one most often skipped or rushed.
Day 20: Handoff and training. The engineer conducts a structured handoff: pipeline walkthrough, configuration documentation, maintenance procedures, troubleshooting guide. The client's technical team should be able to run, monitor, and modify the pipeline independently after this session.
What Goes Wrong
Validation often reveals issues that require pipeline modifications. A cleaning rule that worked on sample data produces incorrect results at scale. A label category that seemed clear during design is ambiguous in practice. The export format has an edge case the downstream framework does not handle.
This is not failure. This is what validation is for. The risk is not that issues surface — it is that Week 4 does not have enough buffer to address them. Good engagements plan for at least 2 days of rework in Week 4.
After Day 30
The pipeline is live. Your team owns it. The vendor's engineer is available for 30-60 days of remote support, but the system is yours to operate.
The first 30 days are the hardest. The data surprises are behind you. The integration issues are resolved. The domain experts know how to use the system. From here, the work shifts from building to operating — running the pipeline, monitoring quality, and extending it to new data sources or use cases as needs evolve.
Planning Your First 30 Days
If you are preparing for an AI data pipeline build and want to understand what the first month looks like for your specific data and environment, book a discovery call with Ertas. We will walk through your data landscape, flag likely Week 1 surprises, and give you a realistic timeline — not the slide deck version.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

What Is Forward Deployment? How AI Companies Embed with Enterprise Teams
Forward deployment means AI vendor engineers embed with your team on-site. How it works, why it matters for enterprise AI, and what a typical engagement looks like.

Why Your RAG Pipeline Fails Silently — And How to Make It Observable
Most RAG pipelines are invisible glue code. When retrieval quality drops, there is no logging, no node-level metrics, and no way to trace which document caused the bad answer. Here is how to build observable RAG infrastructure.

Best Visual RAG Pipeline Builder: From Documents to Retrieval Endpoint Without Writing Code
Building RAG pipelines typically requires Python expertise across five or more libraries. A visual pipeline builder lets domain experts and engineers alike build production RAG by dragging and connecting nodes on a canvas.