Back to blog
    Building a Training Dataset from Your App's User Interactions
    training datadatasetfine-tuningdata collectionmobile AIsegment:mobile-builder

    Building a Training Dataset from Your App's User Interactions

    Your app already generates the training data you need for fine-tuning. How to collect, clean, and format user interactions into a dataset that produces a high-quality on-device model.

    EErtas Team·

    The best training data for your AI model comes from your own app. Your users' real interactions, real questions, and real content represent exactly the domain your model needs to learn. No synthetic data or public dataset will match the quality of data from your actual use case.

    This guide covers how to collect, clean, and format that data for fine-tuning.

    What Counts as Training Data

    Every user interaction in your app is a potential training example:

    App TypeRaw DataTraining Example
    Customer supportUser question + agent responseQ&A pair
    Note-takingUser notes + auto-generated summariesSummarization pair
    FinanceTransaction description + assigned categoryClassification pair
    EmailIncoming email + user's replyReply generation pair
    E-commerceProduct + user reviewSentiment pair
    HealthSymptom description + triage outcomeClassification pair

    The pattern: any input-output pair where the "correct" output is known (either from explicit user action or expert judgment) is a training example.

    Data Collection Strategy

    Log user interactions that naturally produce input-output pairs:

    • Search queries + clicked results: The clicked result is the "correct" answer
    • Categorization actions: When a user assigns a category to content, that is a labeled example
    • Corrections: When a user edits an AI-generated response, the edited version is the "correct" output
    • Completions: When a user accepts a suggestion, that is a positive example
    // Log user corrections for training data
    function onAiResponseEdited(original: string, edited: string, context: string) {
      logTrainingExample({
        input: context,
        output: edited,  // The user's correction is the training target
        source: "user_correction",
        timestamp: Date.now(),
      });
    }
    

    Active Collection

    Prompt users to provide feedback that directly produces training data:

    • Thumbs up/down on AI responses: Filter for thumbs-up responses as positive examples
    • Correction interface: Let users fix AI responses; log the corrections
    • Template usage: When users select and use a template, the filled template is a training example

    Synthetic Augmentation

    Supplement real data with synthetic examples:

    1. Take your best real examples
    2. Use a larger model (GPT-4o, Claude Sonnet) to generate variations
    3. Validate synthetic examples against real ones
    4. Mix synthetic and real data (aim for at least 30% real data)

    Before collecting any user data for training:

    1. Update your privacy policy to disclose that anonymized interaction data may be used to improve AI features
    2. Obtain consent where required (GDPR requires explicit consent for processing personal data)
    3. Provide opt-out for users who do not want their interactions used for training
    4. Anonymize data before using it for training. Remove names, emails, phone numbers, and other PII.

    Technical Anonymization

    import re
    
    def anonymize(text: str) -> str:
        # Remove email addresses
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
        # Remove phone numbers
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
        # Remove names (requires NER or a name list)
        text = replace_names(text, '[NAME]')
        return text
    

    On-Device Collection

    The safest approach: collect training data on-device and only transmit anonymized, aggregated data. The raw interaction stays on the user's phone. Only the anonymized training example leaves the device.

    Data Cleaning

    Raw interaction data is noisy. Cleaning is the most important step in the pipeline.

    Quality Filters

    1. Remove too-short examples: Inputs under 10 characters or outputs under 20 characters rarely contain useful signal
    2. Remove duplicates: Exact and near-duplicate examples add noise
    3. Remove errors: Interactions where the app crashed or the user abandoned mid-flow
    4. Remove off-topic: Interactions that do not match your target task
    5. Remove PII that slipped through anonymization: Secondary pass with stricter patterns

    Quality Scoring

    Not all examples are equally useful. Score each example:

    SignalWeightRationale
    User accepted the AI responseHighDirect positive signal
    User edited then acceptedHighestThe edit is the ideal output
    User rejected the AI responseLow (use sparingly)Negative signal, useful for contrast
    Long, detailed interactionMediumMore context for the model
    Common query patternMediumHigh-frequency patterns matter most

    Target Distribution

    Your training set should roughly match your production query distribution. If 40% of user queries are about topic A and 10% about topic B, your training set should reflect that ratio. Over-representing rare topics can skew the model.

    Formatting for Fine-Tuning

    Chat Format (Standard)

    Most fine-tuning frameworks expect the chat format:

    {"messages": [
      {"role": "system", "content": "You are an assistant for FitTracker app."},
      {"role": "user", "content": "How many calories in a banana?"},
      {"role": "assistant", "content": "A medium banana has about 105 calories, 27g carbs, 1.3g protein, and 0.4g fat."}
    ]}
    

    Multi-Turn Conversations

    For chat features, include the full conversation:

    {"messages": [
      {"role": "system", "content": "You are an assistant for FitTracker app."},
      {"role": "user", "content": "What should I eat before a workout?"},
      {"role": "assistant", "content": "A light meal 1-2 hours before works best. Good options: banana with peanut butter, oatmeal, or a small smoothie. Focus on easily digestible carbs."},
      {"role": "user", "content": "What about protein?"},
      {"role": "assistant", "content": "Add a small amount of protein: a scoop of whey in your smoothie, Greek yogurt with your oatmeal, or a handful of almonds. Keep it under 20g to avoid feeling heavy during the workout."}
    ]}
    

    Classification Format

    For classification tasks, the format is simpler:

    {"messages": [
      {"role": "user", "content": "Classify: Morning run in the park"},
      {"role": "assistant", "content": "Cardio"}
    ]}
    

    Dataset Size Guidelines

    TaskMinimumGoodExcellent
    Classification (5-10 categories)200500-1,0002,000+
    Q&A (bounded domain)3001,000-2,0003,000+
    Chat (multi-turn)5002,000-3,0005,000+
    Summarization3001,000-2,0003,000+
    Content generation5001,500-3,0005,000+

    Quality matters more than quantity. 500 carefully curated examples outperform 5,000 noisy ones.

    The Pipeline

    1. Instrument your app to log interactions (with user consent)
    2. Accumulate data over 2-4 weeks of normal usage
    3. Export and anonymize the logged interactions
    4. Clean and filter using the quality criteria above
    5. Format into the chat JSON structure
    6. Split into training (90%) and evaluation (10%) sets
    7. Fine-tune using a platform like Ertas: upload the formatted dataset, select your base model, train with LoRA, export GGUF
    8. Evaluate on the held-out set
    9. Deploy the GGUF model on-device
    10. Iterate by collecting more data and retraining periodically

    Your app is generating training data right now. The question is whether you are capturing it.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading