Building an Eval Dataset from Client Conversations

Every agency fine-tuning models for clients faces the same evaluation problem: how do you know the model actually works for the client's real use case?

You can build a test set from scratch. You can generate synthetic evaluation examples. Both approaches have a fundamental weakness — they test what you think the model should handle, not what it actually encounters in production.

The best evaluation datasets come from real client conversations. Support tickets, chat logs, sales call transcripts, production failure reports — these are the raw materials for an eval set that measures what actually matters. They capture the messy, ambiguous, domain-specific inputs that synthetic data rarely reproduces.

This guide walks through the complete process: sourcing conversations, extracting test cases, labeling expected outputs, and maintaining your eval set over time.

Why Real Conversations Beat Synthetic Test Data

Synthetic eval sets are useful for getting started. But they have three blind spots that real conversation data fills:

Distribution accuracy. Synthetic data reflects what you imagine the distribution looks like. Real data reflects what it actually looks like. If 40% of a client's support tickets are billing disputes and 5% are technical issues, a synthetic test set often inverts this — generating equal numbers of each category because that seems "balanced." The result: your eval overstates performance on rare categories and understates performance on the categories that actually matter.

Linguistic realism. Real users do not write like prompt engineers. They use abbreviations, make typos, mix languages, reference context from previous interactions, and express frustration in ways that change the query's tone and implicit intent. Synthetic examples tend to be clean, complete, and context-free.

Edge case discovery. You cannot synthetically generate edge cases you have not thought of. Real conversation data contains edge cases that nobody anticipated — and those are precisely the cases that cause production failures.

A study of production ML systems across multiple industries found that models evaluated on synthetic test sets showed accuracy 8-15 percentage points higher than the same models evaluated on real production data. That gap is the difference between a demo that impresses and a deployment that works.

Source 1: Support Tickets

Support tickets are the single richest source of evaluation data for most client deployments. They represent real user problems, expressed in real user language, with (usually) a known resolution.

What to extract:

Input: The customer's original message (first message in the ticket, before any agent interaction)
Expected output: The correct response, resolution, or classification based on how the ticket was actually resolved
Metadata: Ticket category, priority, resolution time, customer satisfaction score

Extraction process:

Export the last 90 days of resolved tickets from the client's helpdesk (Zendesk, Freshdesk, Intercom, etc.)
Filter to tickets resolved successfully (customer confirmed resolution or no reopening)
For each ticket, extract the initial customer message as the input
For classification tasks: use the assigned category/tag as the expected output
For response generation: use the first agent response that resolved the issue as the expected output
Remove tickets that required escalation or had unusual circumstances — these are edge cases worth testing separately but should not be in your core eval set

Volume target: 100-200 tickets gives you a robust eval set. If the client has lower volume, 50 is the minimum for meaningful accuracy estimates.

Watch out for: Agent responses often include greetings, apologies, and pleasantries that you probably do not want the model to reproduce. Strip these to isolate the substantive response, or keep them if the model is expected to replicate the full agent experience.

Source 2: Chat Logs

If the client has an existing chatbot or live chat system, historical chat logs provide conversational evaluation data.

What to extract:

Input: The user's message (or the full conversation up to a certain turn, if context matters)
Expected output: The ideal response at that point in the conversation
Context: Previous messages in the conversation, user metadata if available

Extraction process:

Export chat transcripts from the last 60-90 days
Identify conversations that reached a successful resolution (user thanked the agent, marked the issue as resolved, or did not return with the same question)
Select specific turns within conversations as test cases — not every turn is equally useful
Prioritize turns where the agent provided substantive information (not "Let me look into that" or "Please hold")
For multi-turn evaluation, include the conversation history as context in the input

Volume target: 75-150 conversation turns. Focus on diversity — you want turns from different types of conversations, not 50 turns from 5 long conversations.

Watch out for: Chat logs often contain personal information (names, account numbers, email addresses). Anonymize before adding to your eval set. Replace real names with placeholders, redact account numbers, and swap email domains to example.com.

Source 3: Sales Call Transcripts

For models that assist with sales enablement, lead qualification, or product recommendation, sales call transcripts are invaluable.

What to extract:

Input: Customer questions, objections, or requirements as stated during the call
Expected output: The correct product recommendation, objection handling, or information the salesperson provided
Outcome data: Whether the deal closed, the deal size, and time to close — this lets you weight eval examples by business impact

Extraction process:

Pull transcripts from the last 90 days (from Gong, Chorus, or similar call recording tools)
Focus on calls that resulted in closed deals — these demonstrate successful interactions
Identify 3-5 key moments per call: the initial needs assessment, a technical question, an objection, a pricing discussion, and the recommendation
For each moment, extract the customer's statement as input and the sales rep's response as expected output
Include 20-30% of examples from lost deals where the sales rep's response was still factually correct — the model should give accurate information regardless of deal outcome

Volume target: 50-100 extracted moments from 15-30 calls.

Watch out for: Sales reps sometimes make promises or claims that are not technically accurate. Verify the factual content of expected outputs before adding them to your eval set. An eval set built on incorrect expected outputs will train you to accept incorrect model outputs.

Source 4: Production Failure Logs

The most valuable eval examples come from cases where the current system failed. If the client already has a model in production (or a rule-based system, or a manual process that breaks), failure cases are gold.

What to extract:

Input: The exact input that caused the failure
Expected output: What should have happened (determined after the fact by a domain expert)
Failure mode: How the system failed (wrong classification, hallucinated response, format error, timeout)

Extraction process:

Collect reports of production failures, customer complaints, and escalations from the last 6 months
Reconstruct the exact input that triggered each failure
Have a domain expert determine the correct output for each case
Categorize failures by type: accuracy errors, format errors, latency issues, edge case failures, safety violations

Volume target: Every failure case you can find. Even 10-20 failure cases are disproportionately valuable because they represent the exact scenarios where the model needs to improve.

Why failures matter most: Your core eval set tells you how well the model handles normal traffic. Your failure cases tell you whether the model has fixed the specific problems the client cares about most. When a client says "the model needs to be better," they usually mean "the model needs to stop making these specific mistakes" — and failure cases capture those specific mistakes.

The Extraction and Labeling Process

Step 1: Anonymize

Before anything else, strip personally identifiable information from all conversation data. This is non-negotiable, both for privacy compliance and because PII in your eval set can leak into model outputs.

Replace:

Names with role identifiers (Customer, Agent, Manager)
Email addresses with user@example.com
Phone numbers with 555-0100 format
Account/order numbers with generic IDs (ORDER-001, ACCT-A)
Company-specific identifiers with generic labels

Automate this with regex patterns for the obvious cases (emails, phone numbers) and do a manual pass for domain-specific identifiers.

Step 2: Categorize

Tag each example with the type of task it represents. Common categories:

Classification: The correct output is a label from a fixed set
Extraction: The correct output is specific information pulled from the input
Generation: The correct output is a natural language response
Summarization: The correct output is a condensed version of the input
Decision: The correct output is a recommendation or judgment

This categorization lets you analyze model performance by task type rather than as a single accuracy number. A model might be 95% accurate on classification but only 78% on generation — that distinction drives different improvement strategies.

Step 3: Label Expected Outputs

For each example, define what a correct output looks like. This is the hardest and most important step.

For classification tasks: The label is unambiguous. Customer Support = "billing." Done.

For generation tasks: Write the ideal output. Then write 2-3 acceptable variations. Your scoring should accept any output that is semantically equivalent, not just exact matches.

For complex tasks: Define key criteria rather than an exact output. For example: "The response must (1) acknowledge the billing error, (2) state the correct amount, (3) describe the resolution steps, (4) provide a timeline." Score based on how many criteria the output meets.

Pro tip: Have two people independently label a subset of 20 examples. Measure their agreement rate. If they disagree on more than 15% of examples, your labeling criteria are too ambiguous — tighten the guidelines before labeling the rest.

Step 4: Format for Use

Store your eval set in JSONL format — one JSON object per line, each containing:

{
  "id": "eval-001",
  "input": "The customer message or query",
  "expected_output": "The ideal model response",
  "category": "classification",
  "source": "support_tickets",
  "difficulty": "standard",
  "key_criteria": ["mentions refund policy", "provides timeline"],
  "date_added": "2026-02-15"
}

Version this file. Every time you add or modify examples, increment the version. You need to know which eval set version produced which results.

How Many Examples You Need

The short answer: 50-200 for a solid eval set.

The longer answer depends on what you need to measure:

50 examples: Enough to detect major issues (>10% accuracy difference). Confidence intervals are wide. Useful for initial sanity checks.
100 examples: Standard for most agency deployments. Gives you accuracy estimates within plus or minus 6-8 percentage points.
200 examples: High-confidence evaluation. Accuracy estimates within plus or minus 4-5 percentage points. Worth the investment for high-stakes deployments.
500+ examples: Enterprise-grade evaluation. Only necessary for mission-critical applications or when you need to break down performance across many sub-categories.

Distribution within your eval set:

60% standard/common cases (representing the bulk of production traffic)
25% moderate complexity cases
15% edge cases and known failure modes

Do not overweight edge cases. An eval set with 50% edge cases will give you a pessimistic accuracy number that does not reflect real production performance. Edge cases matter, but they should be proportional to their actual frequency.

Maintaining Your Eval Set Over Time

An eval set is not a one-time artifact. It needs maintenance.

Monthly additions: Add 10-20 new examples each month from recent production data. Focus on cases the model got wrong or cases that represent new patterns not covered by existing examples.

Quarterly review: Every 3 months, review the full eval set for stale examples. Remove cases that no longer represent the client's use case (products discontinued, policies changed, processes updated).

Version tracking: Every update gets a new version number. Record which eval set version was used for each model evaluation. This lets you track whether apparent performance changes are due to the model improving or the eval set changing.

Never train on eval data. This rule is so important it bears repeating. The moment an eval example appears in your training set, that example stops measuring generalization and starts measuring memorization. Keep eval and training data in separate files, separate directories, and separate version control.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

From Eval Set to Improvement Loop

The eval set is not just for scoring. It drives improvement.

Run the model against the eval set and identify failure cases.
Categorize failures: Which task types fail? Which input patterns?
Collect or generate 50-100 new training examples targeting the failure categories.
Retrain the model with the augmented training set.
Re-run the eval set. Verify the failure cases are fixed.
Verify that other categories did not regress.

This loop — evaluate, identify gaps, generate targeted data, retrain, re-evaluate — is the core of production model improvement. The eval set is what makes it systematic rather than guesswork.

Ertas Studio supports this workflow directly: upload your eval set, run evaluations with a click, compare results across model versions, and identify exactly which categories need more training data. The platform tracks your eval history so you can see improvement trends over time.

Building an Eval Dataset from Client Conversations

Why Real Conversations Beat Synthetic Test Data

Source 1: Support Tickets

Source 2: Chat Logs

Source 3: Sales Call Transcripts

Source 4: Production Failure Logs

The Extraction and Labeling Process

Step 1: Anonymize

Step 2: Categorize

Step 3: Label Expected Outputs

Step 4: Format for Use

How Many Examples You Need

Maintaining Your Eval Set Over Time

From Eval Set to Improvement Loop

Further Reading

Ship AI that runs on your users' devices.

Keep reading

How to QA a Fine-Tuned Model Before Client Delivery

MCP Tools for AI Agency Client Workflows: Deliver Models as Tools, Not Files

Running 10+ Fine-Tuned Models for Different Clients: Operations Guide