Back to blog
    On-Device Content Generation: AI Drafts That Work Offline
    content generationon-device AIdraftingmobile AIimplementationsegment:mobile-builder

    On-Device Content Generation: AI Drafts That Work Offline

    How to build AI-powered drafting features that work without internet. Email replies, message suggestions, note expansion, and content templates generated entirely on the user's device.

    EErtas Team·

    Email reply suggestions. Message autocomplete. Note expansion. Social post drafts. These features share a pattern: the user provides a brief input, and the AI generates a longer, polished output.

    Content generation is the second most natural fit for on-device AI (after classification). It leverages the strength of language models while staying within the performance budget of mobile hardware.

    What On-Device Generation Handles Well

    Short-Form Content (Under 200 Words)

    Use CaseInputOutputModel Size
    Email repliesIncoming email + "Accept"2-3 sentence reply3B
    Message suggestionsConversation context3-5 reply options (1 sentence each)1-3B
    Note expansionBullet pointsParagraphs3B
    Social captionsPhoto context + keywords1-2 sentence caption1-3B
    Comment responsesPost + user sentiment1-2 sentence response1-3B
    Form fillingField labels + contextSuggested values1B

    Short-form generation is the sweet spot. The model produces 50-200 tokens (1-3 sentences to a short paragraph) in 2-5 seconds on a 3B model, 1-3 seconds on a 1B model.

    Medium-Form Content (200-500 Words)

    Use CaseInputOutputModel Size
    Email draftsSubject + key pointsFull email body3B
    Meeting summariesTranscript excerptSummary paragraph3B
    Product descriptionsProduct name + featuresMarketing copy3B
    Blog outlinesTopic + audienceStructured outline3B

    Medium-form takes 5-15 seconds on a 3B model. This is acceptable when the user explicitly requests a draft (tapping a "Draft email" button) but too slow for inline suggestions.

    Architecture Patterns

    One-Tap Drafts

    The highest-engagement pattern. Present the user with a button that generates a complete draft based on context:

    [Incoming email about scheduling a meeting]
    
        [Accept]  [Decline]  [Suggest Alternative]
    
        > Taps "Accept"
    
        Draft: "Thanks for reaching out. Tuesday at 2 PM works well for
        me. I'll send a calendar invite. Looking forward to it."
    
        [Send] [Edit]
    

    The AI generates the draft based on the action the user selected. No typing required. The user reviews and sends (or edits first).

    Inline Autocomplete

    As the user types, suggest completions:

    User types: "Thanks for the update. I'll review the..."
    Suggestion (gray text): "...document and get back to you by Friday."
    [Tab to accept]
    

    Autocomplete requires the lowest latency. The suggestion must appear within 200-300ms of the user pausing. This is achievable with 1B models on flagship devices (35-50 tok/s = 7-10 words in 200ms).

    Template Expansion

    The user selects a template and the AI fills in contextual details:

    Template: "Follow-up after meeting"
    Context: Meeting with Sarah about Q3 budget review
    
    Generated:
    "Hi Sarah, thanks for the productive discussion about the Q3
    budget review today. As discussed, I'll prepare the revised
    projections by next Wednesday. Let me know if you need anything
    else in the meantime."
    

    Implementation

    The Generation Interface

    // iOS: Draft generation
    class DraftGenerator {
        private let model: LlamaContext
    
        func generateReply(
            incomingMessage: String,
            action: ReplyAction,
            onToken: @escaping (String) -> Void
        ) async -> String {
            let prompt = buildPrompt(message: incomingMessage, action: action)
            var fullResponse = ""
    
            await model.generate(prompt: prompt, maxTokens: 256) { token in
                fullResponse += token
                onToken(token) // Stream to UI
            }
    
            return fullResponse
        }
    
        private func buildPrompt(message: String, action: ReplyAction) -> String {
            return """
            Write a brief reply to this message. Action: \(action.rawValue)
    
            Message: \(message)
    
            Reply:
            """
        }
    }
    

    Multiple Suggestions

    Generate 2-3 alternative drafts and let the user choose:

    // Android: Generate multiple suggestions
    suspend fun generateSuggestions(
        context: String,
        count: Int = 3
    ): List<String> {
        return (1..count).map {
            model.generate(
                prompt = buildSuggestionPrompt(context),
                maxTokens = 64,
                temperature = 0.8f // Higher temperature for variety
            )
        }
    }
    

    Use temperature 0.7-0.9 for variety between suggestions. Lower temperature (0.1-0.3) for situations where you want one consistent, high-quality draft.

    Context Management

    Good drafts require good context. Provide the model with:

    • The content being replied to (email, message, post)
    • The user's action or intent (accept, decline, ask question)
    • Relevant metadata (sender name, subject, date)
    • The user's writing style (from training data)

    Keep total context under 500 tokens for fast generation.

    Fine-Tuning for Content Generation

    Base models generate generic content. Fine-tuned models generate content in your app's style and for your app's specific use cases.

    Training Data Sources

    1. Existing user content (with permission): How do your users currently write emails, messages, and notes? Their style is the training target.
    2. Synthetic examples: Generate training pairs using a larger model, then validate for quality.
    3. Template-based generation: Create templates for common scenarios and generate variations.

    Training Focus Areas

    AreaTraining ExamplesImpact
    Tone/style consistency200-500High (makes output feel native)
    Domain vocabulary100-300High (uses correct terminology)
    Format adherence200-500High (correct structure every time)
    Length control100-200Medium (stays within target length)
    Edge case handling100-200Medium (graceful fallback)

    Total: 700-1,700 training examples for a well-tuned content generation model.

    Platforms like Ertas handle the full pipeline. Upload your training conversations, fine-tune with LoRA, export GGUF. The model learns your app's content style and generates drafts that feel native.

    Performance Expectations

    Generation Speed (3B Model, Q4_K_M)

    DeviceShort Draft (50 tokens)Medium Draft (200 tokens)
    iPhone 16 Pro1.5-2.5s6-10s
    iPhone 152-3.5s8-14s
    Galaxy S241.5-2.5s6-10s
    Mid-range Android3-5s12-20s

    Generation Speed (1B Model, Q4_K_M)

    DeviceShort Draft (50 tokens)Suggestion (20 tokens)
    iPhone 16 Pro1-1.5s0.4-0.6s
    iPhone 151.5-2s0.5-0.8s
    Galaxy S241-1.5s0.4-0.6s
    Mid-range Android2-3s0.8-1.2s

    For inline autocomplete, use a 1B model. For full draft generation, use 3B.

    Quality Considerations

    Hallucination Management

    Content generation models can invent details. For drafts, this means generating names, dates, or facts that were not in the context.

    Mitigation:

    • Provide complete context in the prompt (do not expect the model to know facts not given)
    • Use the fine-tuned model's tendency to follow training patterns (fine-tuned models hallucinate less on in-domain tasks)
    • Add a "Review before sending" step in the UI

    Length Control

    Fine-tune with examples at your target length. If you want 2-3 sentence replies, train on 2-3 sentence examples. The model learns the expected output length from training data, not from instructions.

    Regeneration

    Always offer a "Regenerate" button. If the first draft misses the mark, the user can get a new one. With temperature above 0, each generation produces different output.

    The combination of instant generation, offline support, and zero per-use cost makes on-device content generation a high-value feature for any app where users write.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading