On-Device Content Generation: AI Drafts That Work Offline

Email reply suggestions. Message autocomplete. Note expansion. Social post drafts. These features share a pattern: the user provides a brief input, and the AI generates a longer, polished output.

Content generation is the second most natural fit for on-device AI (after classification). It leverages the strength of language models while staying within the performance budget of mobile hardware.

What On-Device Generation Handles Well

Short-Form Content (Under 200 Words)

Use Case	Input	Output	Model Size
Email replies	Incoming email + "Accept"	2-3 sentence reply	3B
Message suggestions	Conversation context	3-5 reply options (1 sentence each)	1-3B
Note expansion	Bullet points	Paragraphs	3B
Social captions	Photo context + keywords	1-2 sentence caption	1-3B
Comment responses	Post + user sentiment	1-2 sentence response	1-3B
Form filling	Field labels + context	Suggested values	1B

Short-form generation is the sweet spot. The model produces 50-200 tokens (1-3 sentences to a short paragraph) in 2-5 seconds on a 3B model, 1-3 seconds on a 1B model.

Medium-Form Content (200-500 Words)

Use Case	Input	Output	Model Size
Email drafts	Subject + key points	Full email body	3B
Meeting summaries	Transcript excerpt	Summary paragraph	3B
Product descriptions	Product name + features	Marketing copy	3B
Blog outlines	Topic + audience	Structured outline	3B

Medium-form takes 5-15 seconds on a 3B model. This is acceptable when the user explicitly requests a draft (tapping a "Draft email" button) but too slow for inline suggestions.

Architecture Patterns

One-Tap Drafts

The highest-engagement pattern. Present the user with a button that generates a complete draft based on context:

[Incoming email about scheduling a meeting]

    [Accept]  [Decline]  [Suggest Alternative]

    > Taps "Accept"

    Draft: "Thanks for reaching out. Tuesday at 2 PM works well for
    me. I'll send a calendar invite. Looking forward to it."

    [Send] [Edit]

The AI generates the draft based on the action the user selected. No typing required. The user reviews and sends (or edits first).

Inline Autocomplete

As the user types, suggest completions:

User types: "Thanks for the update. I'll review the..."
Suggestion (gray text): "...document and get back to you by Friday."
[Tab to accept]

Autocomplete requires the lowest latency. The suggestion must appear within 200-300ms of the user pausing. This is achievable with 1B models on flagship devices (35-50 tok/s = 7-10 words in 200ms).

Template Expansion

The user selects a template and the AI fills in contextual details:

Template: "Follow-up after meeting"
Context: Meeting with Sarah about Q3 budget review

Generated:
"Hi Sarah, thanks for the productive discussion about the Q3
budget review today. As discussed, I'll prepare the revised
projections by next Wednesday. Let me know if you need anything
else in the meantime."

Implementation

The Generation Interface

// iOS: Draft generation
class DraftGenerator {
    private let model: LlamaContext

    func generateReply(
        incomingMessage: String,
        action: ReplyAction,
        onToken: @escaping (String) -> Void
    ) async -> String {
        let prompt = buildPrompt(message: incomingMessage, action: action)
        var fullResponse = ""

        await model.generate(prompt: prompt, maxTokens: 256) { token in
            fullResponse += token
            onToken(token) // Stream to UI
        }

        return fullResponse
    }

    private func buildPrompt(message: String, action: ReplyAction) -> String {
        return """
        Write a brief reply to this message. Action: \(action.rawValue)

        Message: \(message)

        Reply:
        """
    }
}

Multiple Suggestions

Generate 2-3 alternative drafts and let the user choose:

// Android: Generate multiple suggestions
suspend fun generateSuggestions(
    context: String,
    count: Int = 3
): List<String> {
    return (1..count).map {
        model.generate(
            prompt = buildSuggestionPrompt(context),
            maxTokens = 64,
            temperature = 0.8f // Higher temperature for variety
        )
    }
}

Use temperature 0.7-0.9 for variety between suggestions. Lower temperature (0.1-0.3) for situations where you want one consistent, high-quality draft.

Context Management

Good drafts require good context. Provide the model with:

The content being replied to (email, message, post)
The user's action or intent (accept, decline, ask question)
Relevant metadata (sender name, subject, date)
The user's writing style (from training data)

Keep total context under 500 tokens for fast generation.

Fine-Tuning for Content Generation

Base models generate generic content. Fine-tuned models generate content in your app's style and for your app's specific use cases.

Training Data Sources

Existing user content (with permission): How do your users currently write emails, messages, and notes? Their style is the training target.
Synthetic examples: Generate training pairs using a larger model, then validate for quality.
Template-based generation: Create templates for common scenarios and generate variations.

Training Focus Areas

Area	Training Examples	Impact
Tone/style consistency	200-500	High (makes output feel native)
Domain vocabulary	100-300	High (uses correct terminology)
Format adherence	200-500	High (correct structure every time)
Length control	100-200	Medium (stays within target length)
Edge case handling	100-200	Medium (graceful fallback)

Total: 700-1,700 training examples for a well-tuned content generation model.

Platforms like Ertas handle the full pipeline. Upload your training conversations, fine-tune with LoRA, export GGUF. The model learns your app's content style and generates drafts that feel native.

Performance Expectations

Generation Speed (3B Model, Q4_K_M)

Device	Short Draft (50 tokens)	Medium Draft (200 tokens)
iPhone 16 Pro	1.5-2.5s	6-10s
iPhone 15	2-3.5s	8-14s
Galaxy S24	1.5-2.5s	6-10s
Mid-range Android	3-5s	12-20s

Generation Speed (1B Model, Q4_K_M)

Device	Short Draft (50 tokens)	Suggestion (20 tokens)
iPhone 16 Pro	1-1.5s	0.4-0.6s
iPhone 15	1.5-2s	0.5-0.8s
Galaxy S24	1-1.5s	0.4-0.6s
Mid-range Android	2-3s	0.8-1.2s

For inline autocomplete, use a 1B model. For full draft generation, use 3B.

Quality Considerations

Hallucination Management

Content generation models can invent details. For drafts, this means generating names, dates, or facts that were not in the context.

Mitigation:

Provide complete context in the prompt (do not expect the model to know facts not given)
Use the fine-tuned model's tendency to follow training patterns (fine-tuned models hallucinate less on in-domain tasks)
Add a "Review before sending" step in the UI

Length Control

Fine-tune with examples at your target length. If you want 2-3 sentence replies, train on 2-3 sentence examples. The model learns the expected output length from training data, not from instructions.

Regeneration

Always offer a "Regenerate" button. If the first draft misses the mark, the user can get a new one. With temperature above 0, each generation produces different output.

The combination of instant generation, offline support, and zero per-use cost makes on-device content generation a high-value feature for any app where users write.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

On-Device Content Generation: AI Drafts That Work Offline

What On-Device Generation Handles Well

Short-Form Content (Under 200 Words)

Medium-Form Content (200-500 Words)

Architecture Patterns

One-Tap Drafts

Inline Autocomplete

Template Expansion

Implementation

The Generation Interface

Multiple Suggestions

Context Management

Fine-Tuning for Content Generation

Training Data Sources

Training Focus Areas

Performance Expectations

Generation Speed (3B Model, Q4_K_M)

Generation Speed (1B Model, Q4_K_M)

Quality Considerations

Hallucination Management

Length Control

Regeneration

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Building an On-Device AI Assistant for Your Mobile App

On-Device Semantic Search: AI-Powered Search Without a Server

On-Device Text Classification for Mobile Apps