Back to blog
    On-Device Semantic Search: AI-Powered Search Without a Server
    semantic searchon-device AIembeddingsmobile AIimplementationsegment:mobile-builder

    On-Device Semantic Search: AI-Powered Search Without a Server

    How to build semantic search that runs entirely on the user's phone. Local embeddings, vector similarity, and natural language queries across user content without a server or API.

    EErtas Team·

    Keyword search fails when users do not know the exact words. "That email about the budget meeting last week" does not match an email with subject "Q3 Financial Review." Semantic search understands meaning, not just keywords.

    The standard approach puts semantic search on a server with a vector database. But for mobile apps where user content is local (notes, messages, photos, documents), sending that content to a server defeats the purpose of keeping it on-device.

    On-device semantic search keeps everything local. The embeddings model runs on the phone. The vector index lives in local storage. The search query never leaves the device.

    How Semantic Search Works

    1. Indexing: Each piece of content is converted to an embedding vector (a list of numbers that represents its meaning) using a small model
    2. Storage: The embedding vectors are stored alongside the content in a local database
    3. Querying: The user's search query is converted to an embedding vector using the same model
    4. Matching: The query vector is compared against all stored vectors using cosine similarity
    5. Ranking: Results are returned ranked by similarity score

    The magic is in the embeddings. Two pieces of text about the same topic produce similar vectors, even if they share no keywords.

    The Embedding Model

    On-device embedding models are small and fast. Unlike generative LLMs (600MB-1.7GB), embedding models are typically 20-80MB:

    ModelSizeDimensionsSpeed (iPhone 15)
    all-MiniLM-L6-v223MB384500+ embeddings/sec
    nomic-embed-text-v1.555MB768200+ embeddings/sec
    bge-small-en-v1.533MB384400+ embeddings/sec

    At 200-500 embeddings per second, indexing 1,000 notes takes 2-5 seconds. Query embedding is near-instant (under 5ms).

    Running the Embedding Model

    You can run embedding models via:

    ONNX Runtime Mobile: Supports embedding models in ONNX format. Available for iOS (via Swift) and Android (via Kotlin). The most mature option for mobile embedding inference.

    // iOS with ONNX Runtime
    let session = try ORTSession(env: env, modelPath: embeddingModelPath)
    let inputTensor = try ORTValue(tensorData: tokenizedInput, shape: shape)
    let outputs = try session.run(withInputs: ["input_ids": inputTensor])
    let embedding = outputs["embeddings"]!.tensorData()
    

    llama.cpp embedding mode: llama.cpp can generate embeddings from GGUF models using the embedding flag. This lets you use the same inference engine for both generation and embedding.

    Vector Storage

    SQLite with Custom Extension

    The simplest approach for mobile: store vectors as BLOBs in SQLite and compute similarity in application code.

    // Android: Store embedding
    fun storeEmbedding(db: SQLiteDatabase, contentId: Long, embedding: FloatArray) {
        val blob = ByteBuffer.allocate(embedding.size * 4)
        embedding.forEach { blob.putFloat(it) }
        db.execSQL(
            "INSERT INTO embeddings (content_id, vector) VALUES (?, ?)",
            arrayOf(contentId, blob.array())
        )
    }
    
    // Search by similarity
    fun search(db: SQLiteDatabase, queryEmbedding: FloatArray, limit: Int): List<SearchResult> {
        val cursor = db.rawQuery("SELECT content_id, vector FROM embeddings", null)
        val results = mutableListOf<SearchResult>()
    
        while (cursor.moveToNext()) {
            val blob = cursor.getBlob(1)
            val stored = FloatArray(blob.size / 4)
            ByteBuffer.wrap(blob).asFloatBuffer().get(stored)
    
            val similarity = cosineSimilarity(queryEmbedding, stored)
            results.add(SearchResult(cursor.getLong(0), similarity))
        }
    
        return results.sortedByDescending { it.similarity }.take(limit)
    }
    

    This is simple and works for collections up to ~10,000 items. Beyond that, the linear scan becomes slow.

    SQLite with Vector Extension

    For larger collections, use a SQLite vector extension that supports approximate nearest neighbor (ANN) search:

    • sqlite-vss: SQLite extension using Faiss for vector search. Supports iOS and Android.
    • sqlite-vec: Lightweight vector search extension designed for embedded use.

    These extensions create an index over the vectors, enabling sub-millisecond search over hundreds of thousands of items.

    The Full Pipeline

    Step 1: Index Content

    When the user creates or modifies content (note, message, document), generate and store its embedding:

    func indexContent(_ content: Content) async {
        let embedding = await embeddingModel.encode(content.text)
        database.storeEmbedding(contentId: content.id, vector: embedding)
    }
    

    Run indexing in the background. Users should not wait for embeddings to be computed.

    When the user enters a search query:

    func search(query: String) async -> [Content] {
        let queryEmbedding = await embeddingModel.encode(query)
        let results = database.similaritySearch(queryEmbedding, limit: 10)
        return results.map { fetchContent($0.contentId) }
    }
    

    The search returns results ranked by semantic similarity. "Budget meeting notes" matches "Q3 Financial Review" because the embeddings capture the semantic relationship.

    Combine semantic search with keyword search for the best results:

    1. Run keyword search (SQLite FTS5) for exact matches
    2. Run semantic search for meaning-based matches
    3. Merge and deduplicate results
    4. Rank by combined score (keyword matches boosted)

    This handles both exact queries ("meeting with John") and fuzzy queries ("that email about the project timeline").

    Performance Budget

    ComponentStorageRAMSpeed
    Embedding model23-55MB50-100MB during inference200-500 embeddings/sec
    Vector index (10K items, 384d)~15MB~15MBUnder 5ms per search
    Vector index (100K items, 384d)~150MB~30MB (with ANN index)Under 10ms per search

    Total additional footprint for semantic search: 40-200MB storage, 65-130MB RAM during search. This is a fraction of what a generative LLM requires, making it practical even on constrained devices.

    Use Cases

    Note-Taking Apps

    Search across all notes by meaning. "Meeting notes from last week about the product launch" finds relevant notes regardless of exact wording.

    Email Clients

    Find emails by topic, not just sender or subject. "Conversation about the contract renewal" surfaces the right thread.

    Photo Apps

    Combine with image captioning (on-device) to enable text-based photo search. "Sunset at the beach" finds matching photos even without manual tags.

    Document Managers

    Search across PDFs, documents, and files by content and meaning.

    Combining with On-Device LLMs

    Semantic search pairs naturally with on-device generative models. Use the search results as context for the LLM:

    1. User asks a question
    2. Semantic search retrieves relevant content from their data
    3. The LLM generates an answer using the retrieved content as context

    This is on-device RAG. No server needed. The entire pipeline (embedding, search, generation) runs locally.

    For the generative component, fine-tune a model on your domain data using a platform like Ertas. The fine-tuned model combined with local semantic search creates a powerful, fully private AI assistant.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading