Building an On-Device AI Assistant for Your Mobile App

An on-device AI assistant is a conversational interface powered by a language model running locally on the user's phone. No cloud API. No network dependency. Instant responses. Complete privacy.

This guide covers the architecture from model selection through production deployment.

Architecture Overview

The on-device assistant has four layers:

Model layer: llama.cpp loads and runs the GGUF model
Conversation layer: Manages chat history, system prompts, and context windows
Interface layer: Chat UI with streaming token display
State layer: Persists conversations, manages model lifecycle

Model Selection for Chat

Conversational AI benefits from larger models. A 3B fine-tuned model is the recommended starting point:

Model Size	Chat Quality	Multi-Turn Coherence	Recommended For
1B	Adequate	2-3 turns	Simple Q&A, FAQ bots
3B	Good	5-8 turns	Full chat assistants

Fine-tuning is essential for chat. A base 3B model will generate generic responses. A fine-tuned 3B model speaks in your brand voice, knows your product, and handles your specific use cases.

Conversation Management

The Prompt Template

Every model family has a specific chat template. For Llama 3.2:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are the RecipeHelper assistant. Help users find and modify recipes. Always include prep and cook times.<|eot_id|><|start_header_id|>user<|end_header_id|>

Quick dinner for two?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

When fine-tuning, the training framework handles template formatting automatically. At inference time, your app must format the conversation into this template before sending to llama.cpp.

Context Window Management

Mobile models typically use 2048-4096 token context windows (configurable, but larger windows use more memory and slow down inference). A conversation can exceed this quickly.

Sliding window approach: Keep the system prompt and the most recent N turns. Drop the oldest turns when the window fills:

func buildPrompt(systemPrompt: String, turns: [Turn], maxTokens: Int) -> String {
    var prompt = formatSystem(systemPrompt)
    var tokenCount = countTokens(prompt)

    // Add turns from newest to oldest, stop when window is full
    var includedTurns: [Turn] = []
    for turn in turns.reversed() {
        let turnTokens = countTokens(formatTurn(turn))
        if tokenCount + turnTokens > maxTokens - 512 { break } // Reserve 512 for response
        includedTurns.insert(turn, at: 0)
        tokenCount += turnTokens
    }

    for turn in includedTurns {
        prompt += formatTurn(turn)
    }

    return prompt
}

Summary approach: When the conversation exceeds the window, summarize older turns into a compact context and prepend it to the system prompt. This preserves key information while staying within the token budget. However, this requires an additional inference call.

Conversation Persistence

Store conversations in local storage (Core Data on iOS, Room on Android) so users can resume where they left off:

Save each message (role, content, timestamp)
Save conversation metadata (title, created date, last active)
Limit stored conversations to prevent unbounded storage growth
Allow users to delete conversations

Streaming UI

Chat interfaces should display tokens as they generate. This creates the perception of a fast, responsive assistant.

iOS (SwiftUI)

struct ChatView: View {
    @StateObject var viewModel = ChatViewModel()

    var body: some View {
        ScrollView {
            ForEach(viewModel.messages) { message in
                MessageBubble(message: message)
            }
        }
        .onChange(of: viewModel.streamingText) { _ in
            // Auto-scroll to bottom
        }

        HStack {
            TextField("Message", text: $viewModel.input)
            Button("Send") { viewModel.send() }
        }
    }
}

Android (Compose)

@Composable
fun ChatScreen(viewModel: ChatViewModel) {
    val messages by viewModel.messages.collectAsState()
    val streaming by viewModel.streamingText.collectAsState()

    LazyColumn {
        items(messages) { message ->
            MessageBubble(message)
        }
        if (streaming.isNotEmpty()) {
            item { StreamingBubble(streaming) }
        }
    }

    Row {
        TextField(value = input, onValueChange = { input = it })
        Button(onClick = { viewModel.send(input) }) { Text("Send") }
    }
}

Token Display Cadence

llama.cpp generates tokens one at a time. Displaying each token individually can cause visual jitter. Buffer 2-3 tokens before updating the UI for smoother text appearance:

private var tokenBuffer = StringBuilder()
private var bufferCount = 0

fun onToken(token: String) {
    tokenBuffer.append(token)
    bufferCount++
    if (bufferCount >= 3 || token.contains("\n")) {
        updateUI(tokenBuffer.toString())
        tokenBuffer.clear()
        bufferCount = 0
    }
}

Model Lifecycle Management

Loading and Unloading

Model loading takes 1-3 seconds depending on model size and device. Unloading is instant. Manage the lifecycle to balance responsiveness with memory:

Load on first AI interaction: Do not load at app launch. Load when the user opens the chat feature.
Keep loaded during active session: While the user is in the chat, keep the model in memory.
Unload on navigate away: When the user leaves the chat screen, unload the model to free RAM.
Handle memory warnings: Register for system memory warnings and unload the model if triggered.

Loading Indicator

Show a brief loading state (1-3 seconds) when the model first loads. After that, responses start generating immediately. Users are accustomed to brief loading states for new features.

Fine-Tuning for Your Assistant

The quality gap between a base model and a fine-tuned model is dramatic for chat:

Metric	Base 3B	Fine-Tuned 3B
On-topic responses	60-70%	92-96%
Format adherence	55-65%	94-98%
Domain accuracy	50-60%	88-94%
Tone consistency	40-50%	90-95%

Training Data for Chat

Create training examples that cover:

Common questions: The 50-100 questions users ask most frequently
Edge cases: Questions that are out of scope, with graceful redirect responses
Multi-turn patterns: Conversations that span 3-5 turns showing natural follow-up
Style examples: Responses in your brand voice with your preferred formatting

500-2,000 training conversations produce a high-quality chat assistant. Platforms like Ertas handle the training pipeline: upload conversation examples, fine-tune with LoRA, export GGUF.

Production Considerations

Response Quality Guardrails

On-device models can generate off-topic or incorrect responses. Implement lightweight guardrails:

Input validation: Check for extremely long inputs or obvious non-text content
Output monitoring: Log response topics (locally) to identify quality issues
Feedback mechanism: Let users flag bad responses. Use this feedback to improve training data

Performance Monitoring

Track on-device metrics:

Time to first token (should be under 300ms)
Tokens per second (should stay above 10)
Model load time
Memory usage during inference
Crash rate during AI interactions

Model Updates

Push model improvements via your normal model delivery pipeline:

Check for updates on app launch (when connected)
Download in background
Swap the model file on next chat session start
Keep the old model as fallback until the new one is validated

The result is a chat assistant that works everywhere, responds instantly, costs nothing per conversation, and keeps user data completely private.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Building an On-Device AI Assistant for Your Mobile App

Architecture Overview

Model Selection for Chat

Conversation Management

The Prompt Template

Context Window Management

Conversation Persistence

Streaming UI

iOS (SwiftUI)

Android (Compose)

Token Display Cadence

Model Lifecycle Management

Loading and Unloading

Loading Indicator

Fine-Tuning for Your Assistant

Training Data for Chat

Production Considerations

Response Quality Guardrails

Performance Monitoring

Model Updates

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

On-Device Semantic Search: AI-Powered Search Without a Server

On-Device Text Classification for Mobile Apps

On-Device Content Generation: AI Drafts That Work Offline