Back to blog
    Offline AI: Building Mobile Features That Work Without Internet
    offline AIon-device AImobile developmentllama.cpparchitecturesegment:mobile-builder

    Offline AI: Building Mobile Features That Work Without Internet

    How to build AI features that work without an internet connection. On-device models, offline-first architecture patterns, and the use cases where offline AI is not optional.

    EErtas Team·

    Cloud AI has a hard dependency: the internet. When the connection drops, the feature breaks. For many mobile use cases, this is not just an inconvenience. It is a deal-breaker.

    A travel translation app that fails without roaming data. A field service assistant that stops working at a construction site with spotty coverage. A health app that cannot process data in a remote clinic. A note-taking app that loses its AI features on a plane.

    Offline AI is not a nice-to-have for these apps. It is a core requirement. And with on-device inference, it is entirely achievable.

    Where Offline AI Is Essential

    Travel and International

    Users traveling internationally often avoid roaming data charges. Airport WiFi is unreliable. Translation, navigation assistance, and travel planning need to work without a connection. A language app that only translates when connected to WiFi is useless at the exact moment users need it most.

    Field Work

    Construction sites, agricultural fields, mining operations, remote inspections. These environments frequently have poor or no cellular coverage. Workers using mobile apps for documentation, measurement, quality inspection, or reporting need AI features that work on-site.

    Healthcare

    Patient data should not leave the device for privacy and regulatory reasons (HIPAA, GDPR). Clinical decision support, note transcription, medical coding. These features must work in clinic environments where WiFi may be restricted or unreliable.

    Developing Markets

    Over 3 billion people worldwide have intermittent internet access. Apps targeting these markets cannot assume always-on connectivity. AI features that work offline have a dramatically larger addressable market.

    Everyday Interruptions

    Subways, elevators, airplanes, basements, rural areas. Even in developed markets, connectivity gaps are constant. An AI feature that shows a loading spinner or error in these moments trains users to not rely on it.

    The Architecture: Offline-First with On-Device Inference

    The key insight is architectural. Instead of building cloud-first with an offline fallback (which is fragile), build offline-first with optional cloud augmentation.

    Model Delivery

    The GGUF model file needs to be on the device before AI features work. Two approaches:

    Bundle with app: Include the model in the app binary or as an app asset. The user downloads it with the app. Simplest approach for models under 200MB (fits within cellular download limits). For larger models, use iOS App Thinning or Android Asset Delivery.

    Post-install download: App installs without the model. On first launch, detect the missing model and prompt the user to download it over WiFi. This keeps the initial app download small while still delivering the model before it is needed.

    // iOS: Check for model and prompt download
    func ensureModelAvailable() -> Bool {
        let modelPath = FileManager.default
            .urls(for: .documentDirectory, in: .userDomainMask)[0]
            .appendingPathComponent("model.gguf")
    
        if FileManager.default.fileExists(atPath: modelPath.path) {
            return true
        }
    
        // Show download UI
        showModelDownloadPrompt()
        return false
    }
    

    Local Inference

    Once the model is on-device, inference is entirely local. llama.cpp handles the computation using the device's CPU and GPU (Metal on iOS, Vulkan on Android). No network call required.

    The inference pipeline:

    1. User provides input (text, voice transcription, etc.)
    2. Input is tokenized locally
    3. Model generates response token by token
    4. Tokens are decoded and displayed in the UI

    Every step happens on the device. The internet is not involved.

    Sync-When-Available

    For features that benefit from cloud connectivity (analytics, model updates, user data sync), use a queue-and-sync pattern:

    • Queue analytics events locally (SQLite, Core Data, Room)
    • When connectivity returns, sync the queue to your backend
    • Check for model updates on app launch when connected
    • Never block the user experience on network availability

    What Works Offline

    FeatureOffline Feasible?Model Size NeededNotes
    Text classificationYes1B (~600MB)Fast, small model sufficient
    Chat assistantYes3B (~1.7GB)Quality generation needs 3B
    SummarizationYes3B (~1.7GB)Good quality at 3B
    TranslationYes1-3BDepends on language pair
    Content draftingYes3B (~1.7GB)Email replies, notes, messages
    AutocompleteYes1B (~600MB)Speed matters, 1B is fast
    Sentiment analysisYes1B (~600MB)Simple classification task
    Named entity extractionYes1-3BStructured output task
    Web searchNoN/ARequires live internet data
    Real-time data retrievalNoN/ARequires server connection
    Image generationPartially1-2GB+Possible but memory-intensive

    The pattern: any task that processes the user's input and generates output from the model's trained knowledge works offline. Tasks that require external, real-time data do not.

    Model Management for Offline-First

    Version Checking

    // Android: Check model version on app launch
    suspend fun checkModelUpdate() {
        if (!isNetworkAvailable()) return
    
        val manifest = fetchModelManifest() // JSON from your CDN
        val currentVersion = prefs.getString("model_version", "")
    
        if (manifest.version != currentVersion) {
            // Queue background download
            scheduleModelDownload(manifest.url, manifest.checksum)
        }
    }
    

    Storage Management

    GGUF models are 600MB-1.7GB. Storage management matters:

    • Check available space before download
    • Show the model's size to the user before they download
    • Allow users to delete and re-download the model
    • On iOS, mark the model file so it is not backed up to iCloud (it can be re-downloaded)
    • Handle low storage warnings by offering to remove the model

    Integrity Verification

    Always verify the model file after download:

    // SHA256 verification
    func verifyModel(at path: URL, expectedHash: String) -> Bool {
        guard let data = try? Data(contentsOf: path) else { return false }
        let hash = SHA256.hash(data: data)
        let hashString = hash.compactMap { String(format: "%02x", $0) }.joined()
        return hashString == expectedHash
    }
    

    UX Patterns for Offline AI

    Model Download State

    • Show a clear progress indicator during download
    • Allow pausing and resuming
    • Resume automatically on network reconnection
    • Show the download size before starting

    Feature Availability

    • When the model is not yet downloaded, show a clear message explaining why the AI feature is unavailable and how to enable it
    • When the model is loaded and ready, show no special indicator. Offline AI should feel native.
    • Never show "offline mode" badges or degraded UI. The on-device model IS the feature.

    Graceful Fallback

    For hybrid apps that use both on-device and cloud AI:

    • Default to on-device for all supported tasks
    • If connected and the task exceeds on-device capability, optionally route to cloud
    • Never fail a user request because the cloud is unavailable if on-device can handle it

    Building the Offline-First Pipeline

    The path to offline AI:

    1. Choose your model: 1B for simple tasks (classification, autocomplete), 3B for generation and chat
    2. Fine-tune on your domain: Use your app's data to train a model that excels at your specific tasks. Platforms like Ertas provide a visual fine-tuning pipeline that exports GGUF files ready for mobile deployment.
    3. Integrate llama.cpp: Add the inference library to your iOS or Android project
    4. Implement model delivery: Bundle or post-install download
    5. Build offline-first: Design every AI interaction to work without network, add cloud augmentation only where essential

    The result is an AI feature that works everywhere your users go. Subway. Airplane. Construction site. Rural clinic. No loading spinners. No error messages. Just instant, private, reliable AI.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading