On-Device Semantic Search: AI-Powered Search Without a Server

Keyword search fails when users do not know the exact words. "That email about the budget meeting last week" does not match an email with subject "Q3 Financial Review." Semantic search understands meaning, not just keywords.

The standard approach puts semantic search on a server with a vector database. But for mobile apps where user content is local (notes, messages, photos, documents), sending that content to a server defeats the purpose of keeping it on-device.

On-device semantic search keeps everything local. The embeddings model runs on the phone. The vector index lives in local storage. The search query never leaves the device.

How Semantic Search Works

Indexing: Each piece of content is converted to an embedding vector (a list of numbers that represents its meaning) using a small model
Storage: The embedding vectors are stored alongside the content in a local database
Querying: The user's search query is converted to an embedding vector using the same model
Matching: The query vector is compared against all stored vectors using cosine similarity
Ranking: Results are returned ranked by similarity score

The magic is in the embeddings. Two pieces of text about the same topic produce similar vectors, even if they share no keywords.

The Embedding Model

On-device embedding models are small and fast. Unlike generative LLMs (600MB-1.7GB), embedding models are typically 20-80MB:

Model	Size	Dimensions	Speed (iPhone 15)
all-MiniLM-L6-v2	23MB	384	500+ embeddings/sec
nomic-embed-text-v1.5	55MB	768	200+ embeddings/sec
bge-small-en-v1.5	33MB	384	400+ embeddings/sec

At 200-500 embeddings per second, indexing 1,000 notes takes 2-5 seconds. Query embedding is near-instant (under 5ms).

Running the Embedding Model

You can run embedding models via:

ONNX Runtime Mobile: Supports embedding models in ONNX format. Available for iOS (via Swift) and Android (via Kotlin). The most mature option for mobile embedding inference.

// iOS with ONNX Runtime
let session = try ORTSession(env: env, modelPath: embeddingModelPath)
let inputTensor = try ORTValue(tensorData: tokenizedInput, shape: shape)
let outputs = try session.run(withInputs: ["input_ids": inputTensor])
let embedding = outputs["embeddings"]!.tensorData()

llama.cpp embedding mode: llama.cpp can generate embeddings from GGUF models using the embedding flag. This lets you use the same inference engine for both generation and embedding.

Vector Storage

SQLite with Custom Extension

The simplest approach for mobile: store vectors as BLOBs in SQLite and compute similarity in application code.

// Android: Store embedding
fun storeEmbedding(db: SQLiteDatabase, contentId: Long, embedding: FloatArray) {
    val blob = ByteBuffer.allocate(embedding.size * 4)
    embedding.forEach { blob.putFloat(it) }
    db.execSQL(
        "INSERT INTO embeddings (content_id, vector) VALUES (?, ?)",
        arrayOf(contentId, blob.array())
    )
}

// Search by similarity
fun search(db: SQLiteDatabase, queryEmbedding: FloatArray, limit: Int): List<SearchResult> {
    val cursor = db.rawQuery("SELECT content_id, vector FROM embeddings", null)
    val results = mutableListOf<SearchResult>()

    while (cursor.moveToNext()) {
        val blob = cursor.getBlob(1)
        val stored = FloatArray(blob.size / 4)
        ByteBuffer.wrap(blob).asFloatBuffer().get(stored)

        val similarity = cosineSimilarity(queryEmbedding, stored)
        results.add(SearchResult(cursor.getLong(0), similarity))
    }

    return results.sortedByDescending { it.similarity }.take(limit)
}

This is simple and works for collections up to ~10,000 items. Beyond that, the linear scan becomes slow.

SQLite with Vector Extension

For larger collections, use a SQLite vector extension that supports approximate nearest neighbor (ANN) search:

sqlite-vss: SQLite extension using Faiss for vector search. Supports iOS and Android.
sqlite-vec: Lightweight vector search extension designed for embedded use.

These extensions create an index over the vectors, enabling sub-millisecond search over hundreds of thousands of items.

The Full Pipeline

Step 1: Index Content

When the user creates or modifies content (note, message, document), generate and store its embedding:

func indexContent(_ content: Content) async {
    let embedding = await embeddingModel.encode(content.text)
    database.storeEmbedding(contentId: content.id, vector: embedding)
}

Run indexing in the background. Users should not wait for embeddings to be computed.

Step 2: Search

When the user enters a search query:

func search(query: String) async -> [Content] {
    let queryEmbedding = await embeddingModel.encode(query)
    let results = database.similaritySearch(queryEmbedding, limit: 10)
    return results.map { fetchContent($0.contentId) }
}

The search returns results ranked by semantic similarity. "Budget meeting notes" matches "Q3 Financial Review" because the embeddings capture the semantic relationship.

Step 3: Hybrid Search

Combine semantic search with keyword search for the best results:

Run keyword search (SQLite FTS5) for exact matches
Run semantic search for meaning-based matches
Merge and deduplicate results
Rank by combined score (keyword matches boosted)

This handles both exact queries ("meeting with John") and fuzzy queries ("that email about the project timeline").

Performance Budget

Component	Storage	RAM	Speed
Embedding model	23-55MB	50-100MB during inference	200-500 embeddings/sec
Vector index (10K items, 384d)	~15MB	~15MB	Under 5ms per search
Vector index (100K items, 384d)	~150MB	~30MB (with ANN index)	Under 10ms per search

Total additional footprint for semantic search: 40-200MB storage, 65-130MB RAM during search. This is a fraction of what a generative LLM requires, making it practical even on constrained devices.

Use Cases

Note-Taking Apps

Search across all notes by meaning. "Meeting notes from last week about the product launch" finds relevant notes regardless of exact wording.

Email Clients

Find emails by topic, not just sender or subject. "Conversation about the contract renewal" surfaces the right thread.

Photo Apps

Combine with image captioning (on-device) to enable text-based photo search. "Sunset at the beach" finds matching photos even without manual tags.

Document Managers

Search across PDFs, documents, and files by content and meaning.

Combining with On-Device LLMs

Semantic search pairs naturally with on-device generative models. Use the search results as context for the LLM:

User asks a question
Semantic search retrieves relevant content from their data
The LLM generates an answer using the retrieved content as context

This is on-device RAG. No server needed. The entire pipeline (embedding, search, generation) runs locally.

For the generative component, fine-tune a model on your domain data using a platform like Ertas. The fine-tuned model combined with local semantic search creates a powerful, fully private AI assistant.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

On-Device Semantic Search: AI-Powered Search Without a Server

How Semantic Search Works

The Embedding Model

Running the Embedding Model

Vector Storage

SQLite with Custom Extension

SQLite with Vector Extension

The Full Pipeline

Step 1: Index Content

Step 2: Search

Step 3: Hybrid Search

Performance Budget

Use Cases

Note-Taking Apps

Email Clients

Photo Apps

Document Managers

Combining with On-Device LLMs

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Building an On-Device AI Assistant for Your Mobile App

On-Device Text Classification for Mobile Apps

On-Device Content Generation: AI Drafts That Work Offline