
How to Scope an AI Data Preparation Project (RFP Template)
A practical RFP template for AI data preparation projects with section-by-section guidance on what to include and how to write requirements that get useful vendor responses.
A weak RFP gets weak responses. When enterprises issue requests for proposals for AI data preparation, the documents are often either too vague ("We need help with our data") or too rigid ("Must support exactly these 47 features"), and neither approach produces responses that help you make a good decision.
A strong RFP describes your situation accurately, states your requirements clearly, and gives vendors enough context to propose a realistic solution — including telling you what you have not thought of. This template is designed for AI data preparation specifically, not generic IT procurement. Use it as a starting point and adapt it to your organization.
Section 1: Project Overview
This section tells the vendor what you are trying to accomplish and why. It is not a requirements list — it is context.
What to include:
- Organization context: Your industry, size, and relevant regulatory environment (without disclosing sensitive details)
- AI program status: Are you starting from scratch, or do you have an existing ML pipeline that needs better data?
- Business objective: What will the prepared data be used for? Model training, fine-tuning, RAG, analytics?
- Why now: What triggered this project? A new compliance requirement, a failed pilot, a strategic AI initiative?
Strong example:
"We are a mid-size insurance company (2,000 employees, 15 states) preparing training data for a claims processing model. We have 8 years of historical claims documents in mixed formats. Our initial pilot using raw data produced unacceptable model accuracy. We need a structured data preparation pipeline to clean, label, and format this data for fine-tuning."
Vague example to avoid:
"We are looking for an AI data preparation solution to support our digital transformation journey."
The vague version tells the vendor nothing useful. They will either decline to respond or submit a generic proposal that does not address your needs.
Section 2: Data Description
The most important section. Vendors cannot scope work they do not understand, and the accuracy of their proposal depends entirely on how well you describe your data.
What to include:
- Data types: Documents (PDF, Word, scans), structured data (databases, spreadsheets), images, audio, video, or mixed
- Volume: Number of records, documents, or files. Total storage size. Growth rate.
- Current format: What format is the data in today? Be specific — "PDF" is better than "documents," and "scanned PDFs with OCR text layer" is better than "PDF"
- Quality assessment: Is the data clean or messy? Are there duplicates, missing fields, inconsistent formats? If you have done a data audit, include the findings.
- Source systems: Where does the data live? What databases, file shares, or applications?
- Sensitive data: Does the data contain PII, PHI, financial records, or classified information?
- Sample availability: Can you provide a representative sample to vendors during the evaluation? (This dramatically improves proposal quality.)
Strong example:
"Source data: ~120,000 insurance claims documents. 60% are typed PDFs, 30% are scanned PDFs (variable OCR quality), 10% are Word documents. Documents range from 1-50 pages. Data is stored in a SharePoint document library and an on-premise SQL Server database. Documents contain policyholder PII including names, addresses, SSNs, and medical information. We can provide a 500-document anonymized sample for evaluation."
Section 3: Compliance Requirements
State your compliance requirements explicitly. Do not assume the vendor will ask.
What to include:
- Regulatory frameworks: HIPAA, GDPR, EU AI Act, SOC 2, ITAR, FedRAMP, industry-specific regulations
- Data handling restrictions: Can data leave your network? Can it be processed in the cloud? Are there data residency requirements?
- Audit requirements: Do you need a full audit trail of every data transformation? Data lineage reports? Access logs?
- Access control requirements: Role-based access, SSO/LDAP integration, multi-factor authentication?
- Documentation requirements: What compliance documentation do you need the vendor to produce?
Strong example:
"Data is subject to HIPAA. All processing must occur on our infrastructure — no data egress to vendor systems or cloud environments. Full audit trail required for every transformation, including data lineage from source document to training record. Role-based access control with Active Directory integration. Vendor must produce a HIPAA compliance attestation for the data preparation workflow."
Vague example to avoid:
"Must be compliant with applicable regulations."
This tells the vendor nothing. They will either assume minimal compliance or over-scope to cover every possibility.
Section 4: Pipeline Requirements
Describe what the data preparation pipeline needs to do, not how it should work technically. Let the vendor propose the architecture.
What to include:
- Ingestion: What source systems need connectors? What formats need to be supported?
- Cleaning: What cleaning operations are needed? Deduplication, normalization, format standardization, PII redaction?
- Labeling: Do you need a labeling workflow? How many label categories? Will your domain experts perform labeling, or do you need the vendor to provide annotators?
- Transformation: What transformations are needed? Text extraction from documents, entity recognition, classification, structuring?
- Export: What output format does your ML pipeline require? JSONL, Parquet, COCO, custom schema?
- Quality requirements: What quality metrics do you expect? Accuracy targets, inter-annotator agreement thresholds, completeness requirements?
Section 5: Deployment Constraints
Where and how the solution must run.
What to include:
- Deployment model: On-premise, cloud, hybrid, or air-gapped?
- Infrastructure: What compute and storage resources are available? Operating system, container support, GPU availability?
- Network: Is internet access available? What network restrictions exist?
- Integration: What systems does the pipeline need to integrate with? ML frameworks, data warehouses, monitoring tools?
- Scalability: What volume does the pipeline need to handle at launch? In 12 months?
Section 6: Integration Requirements
How the data preparation pipeline fits into your broader technology stack.
What to include:
- Upstream systems: Where does source data come from? How is it updated? Batch or real-time?
- Downstream systems: Where does prepared data go? What ML frameworks or platforms will consume it?
- Existing tools: What data tools do you already use? Is the vendor expected to replace or complement them?
- APIs: Do you need API access for programmatic pipeline control?
Section 7: Timeline and Milestones
Be realistic. Aggressive timelines lead to cut corners.
What to include:
- Overall timeline: When does the pipeline need to be operational?
- Key milestones: Discovery, build, validation, handoff — what dates matter?
- Dependencies: What external factors could affect the timeline? (IT provisioning, domain expert availability, compliance reviews)
- Phasing: Is a phased approach acceptable? Pipeline v1 with core functionality, then iteration?
Strong example:
"Pipeline operational within 8 weeks of engagement start. Milestones: data audit complete by Week 2, ingestion pipeline functional by Week 4, labeling workflow operational by Week 6, full pipeline validated and handed off by Week 8. Domain experts available 10 hours/week during Weeks 3-7. IT provisioning will be completed before engagement start."
Section 8: Evaluation Criteria
Tell vendors how you will evaluate their proposals. This shapes the quality of responses.
What to include and suggested weights:
- Technical approach (30%): Does the proposed solution address the pipeline requirements? Is the architecture sound?
- Deployment model fit (20%): Does the solution work within your infrastructure and compliance constraints?
- Implementation plan (20%): Is the timeline realistic? Are milestones concrete? Is the team qualified?
- Pricing (15%): Is pricing transparent? Is the total cost of ownership clear, including implementation?
- References (15%): Has the vendor done similar work in your industry? Can they provide references?
What Makes an RFP Strong vs. Weak
Strong RFPs:
- Describe actual data with specifics (types, volumes, formats, quality)
- State compliance requirements explicitly
- Provide evaluation criteria and weights
- Offer a data sample for vendors to assess
- Include realistic timelines with acknowledged dependencies
- Separate must-haves from nice-to-haves
Weak RFPs:
- Describe data vaguely or not at all
- List features without context on how they will be used
- Omit compliance requirements or use generic language
- Demand unrealistic timelines without acknowledging constraints
- Provide no evaluation criteria, making vendor responses a guessing game
- Copy-paste from a generic IT procurement template
One More Thing
Before issuing the RFP, consider calling your top two or three vendors for a brief pre-RFP conversation. A 15-minute call where you describe the project and ask whether they are a fit will save everyone time. Some vendors will self-select out, and the ones who respond will provide better proposals because they understood the context.
If Ertas is on your evaluation list and you want to have that pre-RFP conversation, book a discovery call. We will tell you honestly whether the project fits our capabilities — and if it does not, we will say so before you spend time writing an RFP section for us.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

How to Evaluate an AI Data Preparation Vendor (Scorecard)
A structured scorecard for evaluating AI data preparation vendors across deployment, compliance, integration, pricing, and implementation support.

How to Audit Your Unstructured Data for AI Potential
A practical guide to assessing your enterprise's unstructured data for AI readiness — inventorying file types, estimating labeling effort, identifying PII, and evaluating document quality.

From PDF Archives to AI Training Data: What the Journey Actually Looks Like
A practical walkthrough of the full journey from a folder of enterprise PDFs to usable AI training data — covering ingestion, cleaning, labeling, augmentation, and export.