How to Scope an AI Data Preparation Project (RFP Template)

A weak RFP gets weak responses. When enterprises issue requests for proposals for AI data preparation, the documents are often either too vague ("We need help with our data") or too rigid ("Must support exactly these 47 features"), and neither approach produces responses that help you make a good decision.

A strong RFP describes your situation accurately, states your requirements clearly, and gives vendors enough context to propose a realistic solution — including telling you what you have not thought of. This template is designed for AI data preparation specifically, not generic IT procurement. Use it as a starting point and adapt it to your organization.

Section 1: Project Overview

This section tells the vendor what you are trying to accomplish and why. It is not a requirements list — it is context.

What to include:

Organization context: Your industry, size, and relevant regulatory environment (without disclosing sensitive details)
AI program status: Are you starting from scratch, or do you have an existing ML pipeline that needs better data?
Business objective: What will the prepared data be used for? Model training, fine-tuning, RAG, analytics?
Why now: What triggered this project? A new compliance requirement, a failed pilot, a strategic AI initiative?

Strong example:

"We are a mid-size insurance company (2,000 employees, 15 states) preparing training data for a claims processing model. We have 8 years of historical claims documents in mixed formats. Our initial pilot using raw data produced unacceptable model accuracy. We need a structured data preparation pipeline to clean, label, and format this data for fine-tuning."

Vague example to avoid:

"We are looking for an AI data preparation solution to support our digital transformation journey."

The vague version tells the vendor nothing useful. They will either decline to respond or submit a generic proposal that does not address your needs.

Section 2: Data Description

The most important section. Vendors cannot scope work they do not understand, and the accuracy of their proposal depends entirely on how well you describe your data.

What to include:

Data types: Documents (PDF, Word, scans), structured data (databases, spreadsheets), images, audio, video, or mixed
Volume: Number of records, documents, or files. Total storage size. Growth rate.
Current format: What format is the data in today? Be specific — "PDF" is better than "documents," and "scanned PDFs with OCR text layer" is better than "PDF"
Quality assessment: Is the data clean or messy? Are there duplicates, missing fields, inconsistent formats? If you have done a data audit, include the findings.
Source systems: Where does the data live? What databases, file shares, or applications?
Sensitive data: Does the data contain PII, PHI, financial records, or classified information?
Sample availability: Can you provide a representative sample to vendors during the evaluation? (This dramatically improves proposal quality.)

Strong example:

"Source data: ~120,000 insurance claims documents. 60% are typed PDFs, 30% are scanned PDFs (variable OCR quality), 10% are Word documents. Documents range from 1-50 pages. Data is stored in a SharePoint document library and an on-premise SQL Server database. Documents contain policyholder PII including names, addresses, SSNs, and medical information. We can provide a 500-document anonymized sample for evaluation."

Section 3: Compliance Requirements

State your compliance requirements explicitly. Do not assume the vendor will ask.

What to include:

Regulatory frameworks: HIPAA, GDPR, EU AI Act, SOC 2, ITAR, FedRAMP, industry-specific regulations
Data handling restrictions: Can data leave your network? Can it be processed in the cloud? Are there data residency requirements?
Audit requirements: Do you need a full audit trail of every data transformation? Data lineage reports? Access logs?
Access control requirements: Role-based access, SSO/LDAP integration, multi-factor authentication?
Documentation requirements: What compliance documentation do you need the vendor to produce?

Strong example:

"Data is subject to HIPAA. All processing must occur on our infrastructure — no data egress to vendor systems or cloud environments. Full audit trail required for every transformation, including data lineage from source document to training record. Role-based access control with Active Directory integration. Vendor must produce a HIPAA compliance attestation for the data preparation workflow."

Vague example to avoid:

"Must be compliant with applicable regulations."

This tells the vendor nothing. They will either assume minimal compliance or over-scope to cover every possibility.

Section 4: Pipeline Requirements

Describe what the data preparation pipeline needs to do, not how it should work technically. Let the vendor propose the architecture.

What to include:

Ingestion: What source systems need connectors? What formats need to be supported?
Cleaning: What cleaning operations are needed? Deduplication, normalization, format standardization, PII redaction?
Labeling: Do you need a labeling workflow? How many label categories? Will your domain experts perform labeling, or do you need the vendor to provide annotators?
Transformation: What transformations are needed? Text extraction from documents, entity recognition, classification, structuring?
Export: What output format does your ML pipeline require? JSONL, Parquet, COCO, custom schema?
Quality requirements: What quality metrics do you expect? Accuracy targets, inter-annotator agreement thresholds, completeness requirements?

Section 5: Deployment Constraints

Where and how the solution must run.

What to include:

Deployment model: On-premise, cloud, hybrid, or air-gapped?
Infrastructure: What compute and storage resources are available? Operating system, container support, GPU availability?
Network: Is internet access available? What network restrictions exist?
Integration: What systems does the pipeline need to integrate with? ML frameworks, data warehouses, monitoring tools?
Scalability: What volume does the pipeline need to handle at launch? In 12 months?

Section 6: Integration Requirements

How the data preparation pipeline fits into your broader technology stack.

What to include:

Upstream systems: Where does source data come from? How is it updated? Batch or real-time?
Downstream systems: Where does prepared data go? What ML frameworks or platforms will consume it?
Existing tools: What data tools do you already use? Is the vendor expected to replace or complement them?
APIs: Do you need API access for programmatic pipeline control?

Section 7: Timeline and Milestones

Be realistic. Aggressive timelines lead to cut corners.

What to include:

Overall timeline: When does the pipeline need to be operational?
Key milestones: Discovery, build, validation, handoff — what dates matter?
Dependencies: What external factors could affect the timeline? (IT provisioning, domain expert availability, compliance reviews)
Phasing: Is a phased approach acceptable? Pipeline v1 with core functionality, then iteration?

Strong example:

"Pipeline operational within 8 weeks of engagement start. Milestones: data audit complete by Week 2, ingestion pipeline functional by Week 4, labeling workflow operational by Week 6, full pipeline validated and handed off by Week 8. Domain experts available 10 hours/week during Weeks 3-7. IT provisioning will be completed before engagement start."

Section 8: Evaluation Criteria

Tell vendors how you will evaluate their proposals. This shapes the quality of responses.

What to include and suggested weights:

Technical approach (30%): Does the proposed solution address the pipeline requirements? Is the architecture sound?
Deployment model fit (20%): Does the solution work within your infrastructure and compliance constraints?
Implementation plan (20%): Is the timeline realistic? Are milestones concrete? Is the team qualified?
Pricing (15%): Is pricing transparent? Is the total cost of ownership clear, including implementation?
References (15%): Has the vendor done similar work in your industry? Can they provide references?

What Makes an RFP Strong vs. Weak

Strong RFPs:

Describe actual data with specifics (types, volumes, formats, quality)
State compliance requirements explicitly
Provide evaluation criteria and weights
Offer a data sample for vendors to assess
Include realistic timelines with acknowledged dependencies
Separate must-haves from nice-to-haves

Weak RFPs:

Describe data vaguely or not at all
List features without context on how they will be used
Omit compliance requirements or use generic language
Demand unrealistic timelines without acknowledging constraints
Provide no evaluation criteria, making vendor responses a guessing game
Copy-paste from a generic IT procurement template

One More Thing

Before issuing the RFP, consider calling your top two or three vendors for a brief pre-RFP conversation. A 15-minute call where you describe the project and ask whether they are a fit will save everyone time. Some vendors will self-select out, and the ones who respond will provide better proposals because they understood the context.

If Ertas is on your evaluation list and you want to have that pre-RFP conversation, book a discovery call. We will tell you honestly whether the project fits our capabilities — and if it does not, we will say so before you spend time writing an RFP section for us.

How to Scope an AI Data Preparation Project (RFP Template)

Section 1: Project Overview

Section 2: Data Description

Section 3: Compliance Requirements

Section 4: Pipeline Requirements

Section 5: Deployment Constraints

Section 6: Integration Requirements

Section 7: Timeline and Milestones

Section 8: Evaluation Criteria

What Makes an RFP Strong vs. Weak

One More Thing

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

How to Evaluate an AI Data Preparation Vendor (Scorecard)

How to Audit Your Unstructured Data for AI Potential

From PDF Archives to AI Training Data: What the Journey Actually Looks Like