Back to blog
    Building an On-Device AI Assistant for Your Mobile App
    AI assistanton-device AIchatmobile AIimplementationsegment:mobile-builder

    Building an On-Device AI Assistant for Your Mobile App

    Architecture patterns for building a conversational AI assistant that runs entirely on the user's device. Model selection, conversation management, UI patterns, and production considerations.

    EErtas Team·

    An on-device AI assistant is a conversational interface powered by a language model running locally on the user's phone. No cloud API. No network dependency. Instant responses. Complete privacy.

    This guide covers the architecture from model selection through production deployment.

    Architecture Overview

    The on-device assistant has four layers:

    1. Model layer: llama.cpp loads and runs the GGUF model
    2. Conversation layer: Manages chat history, system prompts, and context windows
    3. Interface layer: Chat UI with streaming token display
    4. State layer: Persists conversations, manages model lifecycle

    Model Selection for Chat

    Conversational AI benefits from larger models. A 3B fine-tuned model is the recommended starting point:

    Model SizeChat QualityMulti-Turn CoherenceRecommended For
    1BAdequate2-3 turnsSimple Q&A, FAQ bots
    3BGood5-8 turnsFull chat assistants

    Fine-tuning is essential for chat. A base 3B model will generate generic responses. A fine-tuned 3B model speaks in your brand voice, knows your product, and handles your specific use cases.

    Conversation Management

    The Prompt Template

    Every model family has a specific chat template. For Llama 3.2:

    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    
    You are the RecipeHelper assistant. Help users find and modify recipes. Always include prep and cook times.<|eot_id|><|start_header_id|>user<|end_header_id|>
    
    Quick dinner for two?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    

    When fine-tuning, the training framework handles template formatting automatically. At inference time, your app must format the conversation into this template before sending to llama.cpp.

    Context Window Management

    Mobile models typically use 2048-4096 token context windows (configurable, but larger windows use more memory and slow down inference). A conversation can exceed this quickly.

    Sliding window approach: Keep the system prompt and the most recent N turns. Drop the oldest turns when the window fills:

    func buildPrompt(systemPrompt: String, turns: [Turn], maxTokens: Int) -> String {
        var prompt = formatSystem(systemPrompt)
        var tokenCount = countTokens(prompt)
    
        // Add turns from newest to oldest, stop when window is full
        var includedTurns: [Turn] = []
        for turn in turns.reversed() {
            let turnTokens = countTokens(formatTurn(turn))
            if tokenCount + turnTokens > maxTokens - 512 { break } // Reserve 512 for response
            includedTurns.insert(turn, at: 0)
            tokenCount += turnTokens
        }
    
        for turn in includedTurns {
            prompt += formatTurn(turn)
        }
    
        return prompt
    }
    

    Summary approach: When the conversation exceeds the window, summarize older turns into a compact context and prepend it to the system prompt. This preserves key information while staying within the token budget. However, this requires an additional inference call.

    Conversation Persistence

    Store conversations in local storage (Core Data on iOS, Room on Android) so users can resume where they left off:

    • Save each message (role, content, timestamp)
    • Save conversation metadata (title, created date, last active)
    • Limit stored conversations to prevent unbounded storage growth
    • Allow users to delete conversations

    Streaming UI

    Chat interfaces should display tokens as they generate. This creates the perception of a fast, responsive assistant.

    iOS (SwiftUI)

    struct ChatView: View {
        @StateObject var viewModel = ChatViewModel()
    
        var body: some View {
            ScrollView {
                ForEach(viewModel.messages) { message in
                    MessageBubble(message: message)
                }
            }
            .onChange(of: viewModel.streamingText) { _ in
                // Auto-scroll to bottom
            }
    
            HStack {
                TextField("Message", text: $viewModel.input)
                Button("Send") { viewModel.send() }
            }
        }
    }
    

    Android (Compose)

    @Composable
    fun ChatScreen(viewModel: ChatViewModel) {
        val messages by viewModel.messages.collectAsState()
        val streaming by viewModel.streamingText.collectAsState()
    
        LazyColumn {
            items(messages) { message ->
                MessageBubble(message)
            }
            if (streaming.isNotEmpty()) {
                item { StreamingBubble(streaming) }
            }
        }
    
        Row {
            TextField(value = input, onValueChange = { input = it })
            Button(onClick = { viewModel.send(input) }) { Text("Send") }
        }
    }
    

    Token Display Cadence

    llama.cpp generates tokens one at a time. Displaying each token individually can cause visual jitter. Buffer 2-3 tokens before updating the UI for smoother text appearance:

    private var tokenBuffer = StringBuilder()
    private var bufferCount = 0
    
    fun onToken(token: String) {
        tokenBuffer.append(token)
        bufferCount++
        if (bufferCount >= 3 || token.contains("\n")) {
            updateUI(tokenBuffer.toString())
            tokenBuffer.clear()
            bufferCount = 0
        }
    }
    

    Model Lifecycle Management

    Loading and Unloading

    Model loading takes 1-3 seconds depending on model size and device. Unloading is instant. Manage the lifecycle to balance responsiveness with memory:

    • Load on first AI interaction: Do not load at app launch. Load when the user opens the chat feature.
    • Keep loaded during active session: While the user is in the chat, keep the model in memory.
    • Unload on navigate away: When the user leaves the chat screen, unload the model to free RAM.
    • Handle memory warnings: Register for system memory warnings and unload the model if triggered.

    Loading Indicator

    Show a brief loading state (1-3 seconds) when the model first loads. After that, responses start generating immediately. Users are accustomed to brief loading states for new features.

    Fine-Tuning for Your Assistant

    The quality gap between a base model and a fine-tuned model is dramatic for chat:

    MetricBase 3BFine-Tuned 3B
    On-topic responses60-70%92-96%
    Format adherence55-65%94-98%
    Domain accuracy50-60%88-94%
    Tone consistency40-50%90-95%

    Training Data for Chat

    Create training examples that cover:

    1. Common questions: The 50-100 questions users ask most frequently
    2. Edge cases: Questions that are out of scope, with graceful redirect responses
    3. Multi-turn patterns: Conversations that span 3-5 turns showing natural follow-up
    4. Style examples: Responses in your brand voice with your preferred formatting

    500-2,000 training conversations produce a high-quality chat assistant. Platforms like Ertas handle the training pipeline: upload conversation examples, fine-tune with LoRA, export GGUF.

    Production Considerations

    Response Quality Guardrails

    On-device models can generate off-topic or incorrect responses. Implement lightweight guardrails:

    • Input validation: Check for extremely long inputs or obvious non-text content
    • Output monitoring: Log response topics (locally) to identify quality issues
    • Feedback mechanism: Let users flag bad responses. Use this feedback to improve training data

    Performance Monitoring

    Track on-device metrics:

    • Time to first token (should be under 300ms)
    • Tokens per second (should stay above 10)
    • Model load time
    • Memory usage during inference
    • Crash rate during AI interactions

    Model Updates

    Push model improvements via your normal model delivery pipeline:

    • Check for updates on app launch (when connected)
    • Download in background
    • Swap the model file on next chat session start
    • Keep the old model as fallback until the new one is validated

    The result is a chat assistant that works everywhere, responds instantly, costs nothing per conversation, and keeps user data completely private.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading