
On-Device Content Generation: AI Drafts That Work Offline
How to build AI-powered drafting features that work without internet. Email replies, message suggestions, note expansion, and content templates generated entirely on the user's device.
Email reply suggestions. Message autocomplete. Note expansion. Social post drafts. These features share a pattern: the user provides a brief input, and the AI generates a longer, polished output.
Content generation is the second most natural fit for on-device AI (after classification). It leverages the strength of language models while staying within the performance budget of mobile hardware.
What On-Device Generation Handles Well
Short-Form Content (Under 200 Words)
| Use Case | Input | Output | Model Size |
|---|---|---|---|
| Email replies | Incoming email + "Accept" | 2-3 sentence reply | 3B |
| Message suggestions | Conversation context | 3-5 reply options (1 sentence each) | 1-3B |
| Note expansion | Bullet points | Paragraphs | 3B |
| Social captions | Photo context + keywords | 1-2 sentence caption | 1-3B |
| Comment responses | Post + user sentiment | 1-2 sentence response | 1-3B |
| Form filling | Field labels + context | Suggested values | 1B |
Short-form generation is the sweet spot. The model produces 50-200 tokens (1-3 sentences to a short paragraph) in 2-5 seconds on a 3B model, 1-3 seconds on a 1B model.
Medium-Form Content (200-500 Words)
| Use Case | Input | Output | Model Size |
|---|---|---|---|
| Email drafts | Subject + key points | Full email body | 3B |
| Meeting summaries | Transcript excerpt | Summary paragraph | 3B |
| Product descriptions | Product name + features | Marketing copy | 3B |
| Blog outlines | Topic + audience | Structured outline | 3B |
Medium-form takes 5-15 seconds on a 3B model. This is acceptable when the user explicitly requests a draft (tapping a "Draft email" button) but too slow for inline suggestions.
Architecture Patterns
One-Tap Drafts
The highest-engagement pattern. Present the user with a button that generates a complete draft based on context:
[Incoming email about scheduling a meeting]
[Accept] [Decline] [Suggest Alternative]
> Taps "Accept"
Draft: "Thanks for reaching out. Tuesday at 2 PM works well for
me. I'll send a calendar invite. Looking forward to it."
[Send] [Edit]
The AI generates the draft based on the action the user selected. No typing required. The user reviews and sends (or edits first).
Inline Autocomplete
As the user types, suggest completions:
User types: "Thanks for the update. I'll review the..."
Suggestion (gray text): "...document and get back to you by Friday."
[Tab to accept]
Autocomplete requires the lowest latency. The suggestion must appear within 200-300ms of the user pausing. This is achievable with 1B models on flagship devices (35-50 tok/s = 7-10 words in 200ms).
Template Expansion
The user selects a template and the AI fills in contextual details:
Template: "Follow-up after meeting"
Context: Meeting with Sarah about Q3 budget review
Generated:
"Hi Sarah, thanks for the productive discussion about the Q3
budget review today. As discussed, I'll prepare the revised
projections by next Wednesday. Let me know if you need anything
else in the meantime."
Implementation
The Generation Interface
// iOS: Draft generation
class DraftGenerator {
private let model: LlamaContext
func generateReply(
incomingMessage: String,
action: ReplyAction,
onToken: @escaping (String) -> Void
) async -> String {
let prompt = buildPrompt(message: incomingMessage, action: action)
var fullResponse = ""
await model.generate(prompt: prompt, maxTokens: 256) { token in
fullResponse += token
onToken(token) // Stream to UI
}
return fullResponse
}
private func buildPrompt(message: String, action: ReplyAction) -> String {
return """
Write a brief reply to this message. Action: \(action.rawValue)
Message: \(message)
Reply:
"""
}
}
Multiple Suggestions
Generate 2-3 alternative drafts and let the user choose:
// Android: Generate multiple suggestions
suspend fun generateSuggestions(
context: String,
count: Int = 3
): List<String> {
return (1..count).map {
model.generate(
prompt = buildSuggestionPrompt(context),
maxTokens = 64,
temperature = 0.8f // Higher temperature for variety
)
}
}
Use temperature 0.7-0.9 for variety between suggestions. Lower temperature (0.1-0.3) for situations where you want one consistent, high-quality draft.
Context Management
Good drafts require good context. Provide the model with:
- The content being replied to (email, message, post)
- The user's action or intent (accept, decline, ask question)
- Relevant metadata (sender name, subject, date)
- The user's writing style (from training data)
Keep total context under 500 tokens for fast generation.
Fine-Tuning for Content Generation
Base models generate generic content. Fine-tuned models generate content in your app's style and for your app's specific use cases.
Training Data Sources
- Existing user content (with permission): How do your users currently write emails, messages, and notes? Their style is the training target.
- Synthetic examples: Generate training pairs using a larger model, then validate for quality.
- Template-based generation: Create templates for common scenarios and generate variations.
Training Focus Areas
| Area | Training Examples | Impact |
|---|---|---|
| Tone/style consistency | 200-500 | High (makes output feel native) |
| Domain vocabulary | 100-300 | High (uses correct terminology) |
| Format adherence | 200-500 | High (correct structure every time) |
| Length control | 100-200 | Medium (stays within target length) |
| Edge case handling | 100-200 | Medium (graceful fallback) |
Total: 700-1,700 training examples for a well-tuned content generation model.
Platforms like Ertas handle the full pipeline. Upload your training conversations, fine-tune with LoRA, export GGUF. The model learns your app's content style and generates drafts that feel native.
Performance Expectations
Generation Speed (3B Model, Q4_K_M)
| Device | Short Draft (50 tokens) | Medium Draft (200 tokens) |
|---|---|---|
| iPhone 16 Pro | 1.5-2.5s | 6-10s |
| iPhone 15 | 2-3.5s | 8-14s |
| Galaxy S24 | 1.5-2.5s | 6-10s |
| Mid-range Android | 3-5s | 12-20s |
Generation Speed (1B Model, Q4_K_M)
| Device | Short Draft (50 tokens) | Suggestion (20 tokens) |
|---|---|---|
| iPhone 16 Pro | 1-1.5s | 0.4-0.6s |
| iPhone 15 | 1.5-2s | 0.5-0.8s |
| Galaxy S24 | 1-1.5s | 0.4-0.6s |
| Mid-range Android | 2-3s | 0.8-1.2s |
For inline autocomplete, use a 1B model. For full draft generation, use 3B.
Quality Considerations
Hallucination Management
Content generation models can invent details. For drafts, this means generating names, dates, or facts that were not in the context.
Mitigation:
- Provide complete context in the prompt (do not expect the model to know facts not given)
- Use the fine-tuned model's tendency to follow training patterns (fine-tuned models hallucinate less on in-domain tasks)
- Add a "Review before sending" step in the UI
Length Control
Fine-tune with examples at your target length. If you want 2-3 sentence replies, train on 2-3 sentence examples. The model learns the expected output length from training data, not from instructions.
Regeneration
Always offer a "Regenerate" button. If the first draft misses the mark, the user can get a new one. With temperature above 0, each generation produces different output.
The combination of instant generation, offline support, and zero per-use cost makes on-device content generation a high-value feature for any app where users write.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Building an On-Device AI Assistant for Your Mobile App
Architecture patterns for building a conversational AI assistant that runs entirely on the user's device. Model selection, conversation management, UI patterns, and production considerations.

On-Device Semantic Search: AI-Powered Search Without a Server
How to build semantic search that runs entirely on the user's phone. Local embeddings, vector similarity, and natural language queries across user content without a server or API.

On-Device Text Classification for Mobile Apps
How to build fast, accurate text classification that runs on the user's phone. Sentiment analysis, content categorization, intent detection, and spam filtering without an API call.