Building a Training Dataset from Your App's User Interactions

The best training data for your AI model comes from your own app. Your users' real interactions, real questions, and real content represent exactly the domain your model needs to learn. No synthetic data or public dataset will match the quality of data from your actual use case.

This guide covers how to collect, clean, and format that data for fine-tuning.

What Counts as Training Data

Every user interaction in your app is a potential training example:

App Type	Raw Data	Training Example
Customer support	User question + agent response	Q&A pair
Note-taking	User notes + auto-generated summaries	Summarization pair
Finance	Transaction description + assigned category	Classification pair
Email	Incoming email + user's reply	Reply generation pair
E-commerce	Product + user review	Sentiment pair
Health	Symptom description + triage outcome	Classification pair

The pattern: any input-output pair where the "correct" output is known (either from explicit user action or expert judgment) is a training example.

Data Collection Strategy

Passive Collection (Recommended Start)

Log user interactions that naturally produce input-output pairs:

Search queries + clicked results: The clicked result is the "correct" answer
Categorization actions: When a user assigns a category to content, that is a labeled example
Corrections: When a user edits an AI-generated response, the edited version is the "correct" output
Completions: When a user accepts a suggestion, that is a positive example

// Log user corrections for training data
function onAiResponseEdited(original: string, edited: string, context: string) {
  logTrainingExample({
    input: context,
    output: edited,  // The user's correction is the training target
    source: "user_correction",
    timestamp: Date.now(),
  });
}

Active Collection

Prompt users to provide feedback that directly produces training data:

Thumbs up/down on AI responses: Filter for thumbs-up responses as positive examples
Correction interface: Let users fix AI responses; log the corrections
Template usage: When users select and use a template, the filled template is a training example

Synthetic Augmentation

Supplement real data with synthetic examples:

Take your best real examples
Use a larger model (GPT-4o, Claude Sonnet) to generate variations
Validate synthetic examples against real ones
Mix synthetic and real data (aim for at least 30% real data)

Legal Requirements

Before collecting any user data for training:

Update your privacy policy to disclose that anonymized interaction data may be used to improve AI features
Obtain consent where required (GDPR requires explicit consent for processing personal data)
Provide opt-out for users who do not want their interactions used for training
Anonymize data before using it for training. Remove names, emails, phone numbers, and other PII.

Technical Anonymization

import re

def anonymize(text: str) -> str:
    # Remove email addresses
    text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
    # Remove phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    # Remove names (requires NER or a name list)
    text = replace_names(text, '[NAME]')
    return text

On-Device Collection

The safest approach: collect training data on-device and only transmit anonymized, aggregated data. The raw interaction stays on the user's phone. Only the anonymized training example leaves the device.

Data Cleaning

Raw interaction data is noisy. Cleaning is the most important step in the pipeline.

Quality Filters

Remove too-short examples: Inputs under 10 characters or outputs under 20 characters rarely contain useful signal
Remove duplicates: Exact and near-duplicate examples add noise
Remove errors: Interactions where the app crashed or the user abandoned mid-flow
Remove off-topic: Interactions that do not match your target task
Remove PII that slipped through anonymization: Secondary pass with stricter patterns

Quality Scoring

Not all examples are equally useful. Score each example:

Signal	Weight	Rationale
User accepted the AI response	High	Direct positive signal
User edited then accepted	Highest	The edit is the ideal output
User rejected the AI response	Low (use sparingly)	Negative signal, useful for contrast
Long, detailed interaction	Medium	More context for the model
Common query pattern	Medium	High-frequency patterns matter most

Target Distribution

Your training set should roughly match your production query distribution. If 40% of user queries are about topic A and 10% about topic B, your training set should reflect that ratio. Over-representing rare topics can skew the model.

Formatting for Fine-Tuning

Chat Format (Standard)

Most fine-tuning frameworks expect the chat format:

{"messages": [
  {"role": "system", "content": "You are an assistant for FitTracker app."},
  {"role": "user", "content": "How many calories in a banana?"},
  {"role": "assistant", "content": "A medium banana has about 105 calories, 27g carbs, 1.3g protein, and 0.4g fat."}
]}

Multi-Turn Conversations

For chat features, include the full conversation:

{"messages": [
  {"role": "system", "content": "You are an assistant for FitTracker app."},
  {"role": "user", "content": "What should I eat before a workout?"},
  {"role": "assistant", "content": "A light meal 1-2 hours before works best. Good options: banana with peanut butter, oatmeal, or a small smoothie. Focus on easily digestible carbs."},
  {"role": "user", "content": "What about protein?"},
  {"role": "assistant", "content": "Add a small amount of protein: a scoop of whey in your smoothie, Greek yogurt with your oatmeal, or a handful of almonds. Keep it under 20g to avoid feeling heavy during the workout."}
]}

Classification Format

For classification tasks, the format is simpler:

{"messages": [
  {"role": "user", "content": "Classify: Morning run in the park"},
  {"role": "assistant", "content": "Cardio"}
]}

Dataset Size Guidelines

Task	Minimum	Good	Excellent
Classification (5-10 categories)	200	500-1,000	2,000+
Q&A (bounded domain)	300	1,000-2,000	3,000+
Chat (multi-turn)	500	2,000-3,000	5,000+
Summarization	300	1,000-2,000	3,000+
Content generation	500	1,500-3,000	5,000+

Quality matters more than quantity. 500 carefully curated examples outperform 5,000 noisy ones.

The Pipeline

Instrument your app to log interactions (with user consent)
Accumulate data over 2-4 weeks of normal usage
Export and anonymize the logged interactions
Clean and filter using the quality criteria above
Format into the chat JSON structure
Split into training (90%) and evaluation (10%) sets
Fine-tune using a platform like Ertas: upload the formatted dataset, select your base model, train with LoRA, export GGUF
Evaluate on the held-out set
Deploy the GGUF model on-device
Iterate by collecting more data and retraining periodically

Your app is generating training data right now. The question is whether you are capturing it.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Building a Training Dataset from Your App's User Interactions

What Counts as Training Data

Data Collection Strategy

Passive Collection (Recommended Start)

Active Collection

Synthetic Augmentation

Legal Requirements

Technical Anonymization

On-Device Collection

Data Cleaning

Quality Filters

Quality Scoring

Target Distribution

Formatting for Fine-Tuning

Chat Format (Standard)

Multi-Turn Conversations

Classification Format

Dataset Size Guidelines

The Pipeline

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment

Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment

API Logs to Training Data: Using Your Cloud AI History to Fine-Tune

What Counts as Training Data

Data Collection Strategy

Passive Collection (Recommended Start)

Active Collection

Synthetic Augmentation

Privacy and Consent

Legal Requirements

Technical Anonymization

On-Device Collection

Data Cleaning

Quality Filters

Quality Scoring

Target Distribution

Formatting for Fine-Tuning

Chat Format (Standard)

Multi-Turn Conversations

Classification Format

Dataset Size Guidelines

The Pipeline

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment

Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment

API Logs to Training Data: Using Your Cloud AI History to Fine-Tune