
Building an On-Device AI Assistant for Your Mobile App
Architecture patterns for building a conversational AI assistant that runs entirely on the user's device. Model selection, conversation management, UI patterns, and production considerations.
An on-device AI assistant is a conversational interface powered by a language model running locally on the user's phone. No cloud API. No network dependency. Instant responses. Complete privacy.
This guide covers the architecture from model selection through production deployment.
Architecture Overview
The on-device assistant has four layers:
- Model layer: llama.cpp loads and runs the GGUF model
- Conversation layer: Manages chat history, system prompts, and context windows
- Interface layer: Chat UI with streaming token display
- State layer: Persists conversations, manages model lifecycle
Model Selection for Chat
Conversational AI benefits from larger models. A 3B fine-tuned model is the recommended starting point:
| Model Size | Chat Quality | Multi-Turn Coherence | Recommended For |
|---|---|---|---|
| 1B | Adequate | 2-3 turns | Simple Q&A, FAQ bots |
| 3B | Good | 5-8 turns | Full chat assistants |
Fine-tuning is essential for chat. A base 3B model will generate generic responses. A fine-tuned 3B model speaks in your brand voice, knows your product, and handles your specific use cases.
Conversation Management
The Prompt Template
Every model family has a specific chat template. For Llama 3.2:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are the RecipeHelper assistant. Help users find and modify recipes. Always include prep and cook times.<|eot_id|><|start_header_id|>user<|end_header_id|>
Quick dinner for two?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
When fine-tuning, the training framework handles template formatting automatically. At inference time, your app must format the conversation into this template before sending to llama.cpp.
Context Window Management
Mobile models typically use 2048-4096 token context windows (configurable, but larger windows use more memory and slow down inference). A conversation can exceed this quickly.
Sliding window approach: Keep the system prompt and the most recent N turns. Drop the oldest turns when the window fills:
func buildPrompt(systemPrompt: String, turns: [Turn], maxTokens: Int) -> String {
var prompt = formatSystem(systemPrompt)
var tokenCount = countTokens(prompt)
// Add turns from newest to oldest, stop when window is full
var includedTurns: [Turn] = []
for turn in turns.reversed() {
let turnTokens = countTokens(formatTurn(turn))
if tokenCount + turnTokens > maxTokens - 512 { break } // Reserve 512 for response
includedTurns.insert(turn, at: 0)
tokenCount += turnTokens
}
for turn in includedTurns {
prompt += formatTurn(turn)
}
return prompt
}
Summary approach: When the conversation exceeds the window, summarize older turns into a compact context and prepend it to the system prompt. This preserves key information while staying within the token budget. However, this requires an additional inference call.
Conversation Persistence
Store conversations in local storage (Core Data on iOS, Room on Android) so users can resume where they left off:
- Save each message (role, content, timestamp)
- Save conversation metadata (title, created date, last active)
- Limit stored conversations to prevent unbounded storage growth
- Allow users to delete conversations
Streaming UI
Chat interfaces should display tokens as they generate. This creates the perception of a fast, responsive assistant.
iOS (SwiftUI)
struct ChatView: View {
@StateObject var viewModel = ChatViewModel()
var body: some View {
ScrollView {
ForEach(viewModel.messages) { message in
MessageBubble(message: message)
}
}
.onChange(of: viewModel.streamingText) { _ in
// Auto-scroll to bottom
}
HStack {
TextField("Message", text: $viewModel.input)
Button("Send") { viewModel.send() }
}
}
}
Android (Compose)
@Composable
fun ChatScreen(viewModel: ChatViewModel) {
val messages by viewModel.messages.collectAsState()
val streaming by viewModel.streamingText.collectAsState()
LazyColumn {
items(messages) { message ->
MessageBubble(message)
}
if (streaming.isNotEmpty()) {
item { StreamingBubble(streaming) }
}
}
Row {
TextField(value = input, onValueChange = { input = it })
Button(onClick = { viewModel.send(input) }) { Text("Send") }
}
}
Token Display Cadence
llama.cpp generates tokens one at a time. Displaying each token individually can cause visual jitter. Buffer 2-3 tokens before updating the UI for smoother text appearance:
private var tokenBuffer = StringBuilder()
private var bufferCount = 0
fun onToken(token: String) {
tokenBuffer.append(token)
bufferCount++
if (bufferCount >= 3 || token.contains("\n")) {
updateUI(tokenBuffer.toString())
tokenBuffer.clear()
bufferCount = 0
}
}
Model Lifecycle Management
Loading and Unloading
Model loading takes 1-3 seconds depending on model size and device. Unloading is instant. Manage the lifecycle to balance responsiveness with memory:
- Load on first AI interaction: Do not load at app launch. Load when the user opens the chat feature.
- Keep loaded during active session: While the user is in the chat, keep the model in memory.
- Unload on navigate away: When the user leaves the chat screen, unload the model to free RAM.
- Handle memory warnings: Register for system memory warnings and unload the model if triggered.
Loading Indicator
Show a brief loading state (1-3 seconds) when the model first loads. After that, responses start generating immediately. Users are accustomed to brief loading states for new features.
Fine-Tuning for Your Assistant
The quality gap between a base model and a fine-tuned model is dramatic for chat:
| Metric | Base 3B | Fine-Tuned 3B |
|---|---|---|
| On-topic responses | 60-70% | 92-96% |
| Format adherence | 55-65% | 94-98% |
| Domain accuracy | 50-60% | 88-94% |
| Tone consistency | 40-50% | 90-95% |
Training Data for Chat
Create training examples that cover:
- Common questions: The 50-100 questions users ask most frequently
- Edge cases: Questions that are out of scope, with graceful redirect responses
- Multi-turn patterns: Conversations that span 3-5 turns showing natural follow-up
- Style examples: Responses in your brand voice with your preferred formatting
500-2,000 training conversations produce a high-quality chat assistant. Platforms like Ertas handle the training pipeline: upload conversation examples, fine-tune with LoRA, export GGUF.
Production Considerations
Response Quality Guardrails
On-device models can generate off-topic or incorrect responses. Implement lightweight guardrails:
- Input validation: Check for extremely long inputs or obvious non-text content
- Output monitoring: Log response topics (locally) to identify quality issues
- Feedback mechanism: Let users flag bad responses. Use this feedback to improve training data
Performance Monitoring
Track on-device metrics:
- Time to first token (should be under 300ms)
- Tokens per second (should stay above 10)
- Model load time
- Memory usage during inference
- Crash rate during AI interactions
Model Updates
Push model improvements via your normal model delivery pipeline:
- Check for updates on app launch (when connected)
- Download in background
- Swap the model file on next chat session start
- Keep the old model as fallback until the new one is validated
The result is a chat assistant that works everywhere, responds instantly, costs nothing per conversation, and keeps user data completely private.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Device Semantic Search: AI-Powered Search Without a Server
How to build semantic search that runs entirely on the user's phone. Local embeddings, vector similarity, and natural language queries across user content without a server or API.

On-Device Text Classification for Mobile Apps
How to build fast, accurate text classification that runs on the user's phone. Sentiment analysis, content categorization, intent detection, and spam filtering without an API call.

On-Device Content Generation: AI Drafts That Work Offline
How to build AI-powered drafting features that work without internet. Email replies, message suggestions, note expansion, and content templates generated entirely on the user's device.