
On-Device Semantic Search: AI-Powered Search Without a Server
How to build semantic search that runs entirely on the user's phone. Local embeddings, vector similarity, and natural language queries across user content without a server or API.
Keyword search fails when users do not know the exact words. "That email about the budget meeting last week" does not match an email with subject "Q3 Financial Review." Semantic search understands meaning, not just keywords.
The standard approach puts semantic search on a server with a vector database. But for mobile apps where user content is local (notes, messages, photos, documents), sending that content to a server defeats the purpose of keeping it on-device.
On-device semantic search keeps everything local. The embeddings model runs on the phone. The vector index lives in local storage. The search query never leaves the device.
How Semantic Search Works
- Indexing: Each piece of content is converted to an embedding vector (a list of numbers that represents its meaning) using a small model
- Storage: The embedding vectors are stored alongside the content in a local database
- Querying: The user's search query is converted to an embedding vector using the same model
- Matching: The query vector is compared against all stored vectors using cosine similarity
- Ranking: Results are returned ranked by similarity score
The magic is in the embeddings. Two pieces of text about the same topic produce similar vectors, even if they share no keywords.
The Embedding Model
On-device embedding models are small and fast. Unlike generative LLMs (600MB-1.7GB), embedding models are typically 20-80MB:
| Model | Size | Dimensions | Speed (iPhone 15) |
|---|---|---|---|
| all-MiniLM-L6-v2 | 23MB | 384 | 500+ embeddings/sec |
| nomic-embed-text-v1.5 | 55MB | 768 | 200+ embeddings/sec |
| bge-small-en-v1.5 | 33MB | 384 | 400+ embeddings/sec |
At 200-500 embeddings per second, indexing 1,000 notes takes 2-5 seconds. Query embedding is near-instant (under 5ms).
Running the Embedding Model
You can run embedding models via:
ONNX Runtime Mobile: Supports embedding models in ONNX format. Available for iOS (via Swift) and Android (via Kotlin). The most mature option for mobile embedding inference.
// iOS with ONNX Runtime
let session = try ORTSession(env: env, modelPath: embeddingModelPath)
let inputTensor = try ORTValue(tensorData: tokenizedInput, shape: shape)
let outputs = try session.run(withInputs: ["input_ids": inputTensor])
let embedding = outputs["embeddings"]!.tensorData()
llama.cpp embedding mode: llama.cpp can generate embeddings from GGUF models using the embedding flag. This lets you use the same inference engine for both generation and embedding.
Vector Storage
SQLite with Custom Extension
The simplest approach for mobile: store vectors as BLOBs in SQLite and compute similarity in application code.
// Android: Store embedding
fun storeEmbedding(db: SQLiteDatabase, contentId: Long, embedding: FloatArray) {
val blob = ByteBuffer.allocate(embedding.size * 4)
embedding.forEach { blob.putFloat(it) }
db.execSQL(
"INSERT INTO embeddings (content_id, vector) VALUES (?, ?)",
arrayOf(contentId, blob.array())
)
}
// Search by similarity
fun search(db: SQLiteDatabase, queryEmbedding: FloatArray, limit: Int): List<SearchResult> {
val cursor = db.rawQuery("SELECT content_id, vector FROM embeddings", null)
val results = mutableListOf<SearchResult>()
while (cursor.moveToNext()) {
val blob = cursor.getBlob(1)
val stored = FloatArray(blob.size / 4)
ByteBuffer.wrap(blob).asFloatBuffer().get(stored)
val similarity = cosineSimilarity(queryEmbedding, stored)
results.add(SearchResult(cursor.getLong(0), similarity))
}
return results.sortedByDescending { it.similarity }.take(limit)
}
This is simple and works for collections up to ~10,000 items. Beyond that, the linear scan becomes slow.
SQLite with Vector Extension
For larger collections, use a SQLite vector extension that supports approximate nearest neighbor (ANN) search:
- sqlite-vss: SQLite extension using Faiss for vector search. Supports iOS and Android.
- sqlite-vec: Lightweight vector search extension designed for embedded use.
These extensions create an index over the vectors, enabling sub-millisecond search over hundreds of thousands of items.
The Full Pipeline
Step 1: Index Content
When the user creates or modifies content (note, message, document), generate and store its embedding:
func indexContent(_ content: Content) async {
let embedding = await embeddingModel.encode(content.text)
database.storeEmbedding(contentId: content.id, vector: embedding)
}
Run indexing in the background. Users should not wait for embeddings to be computed.
Step 2: Search
When the user enters a search query:
func search(query: String) async -> [Content] {
let queryEmbedding = await embeddingModel.encode(query)
let results = database.similaritySearch(queryEmbedding, limit: 10)
return results.map { fetchContent($0.contentId) }
}
The search returns results ranked by semantic similarity. "Budget meeting notes" matches "Q3 Financial Review" because the embeddings capture the semantic relationship.
Step 3: Hybrid Search
Combine semantic search with keyword search for the best results:
- Run keyword search (SQLite FTS5) for exact matches
- Run semantic search for meaning-based matches
- Merge and deduplicate results
- Rank by combined score (keyword matches boosted)
This handles both exact queries ("meeting with John") and fuzzy queries ("that email about the project timeline").
Performance Budget
| Component | Storage | RAM | Speed |
|---|---|---|---|
| Embedding model | 23-55MB | 50-100MB during inference | 200-500 embeddings/sec |
| Vector index (10K items, 384d) | ~15MB | ~15MB | Under 5ms per search |
| Vector index (100K items, 384d) | ~150MB | ~30MB (with ANN index) | Under 10ms per search |
Total additional footprint for semantic search: 40-200MB storage, 65-130MB RAM during search. This is a fraction of what a generative LLM requires, making it practical even on constrained devices.
Use Cases
Note-Taking Apps
Search across all notes by meaning. "Meeting notes from last week about the product launch" finds relevant notes regardless of exact wording.
Email Clients
Find emails by topic, not just sender or subject. "Conversation about the contract renewal" surfaces the right thread.
Photo Apps
Combine with image captioning (on-device) to enable text-based photo search. "Sunset at the beach" finds matching photos even without manual tags.
Document Managers
Search across PDFs, documents, and files by content and meaning.
Combining with On-Device LLMs
Semantic search pairs naturally with on-device generative models. Use the search results as context for the LLM:
- User asks a question
- Semantic search retrieves relevant content from their data
- The LLM generates an answer using the retrieved content as context
This is on-device RAG. No server needed. The entire pipeline (embedding, search, generation) runs locally.
For the generative component, fine-tune a model on your domain data using a platform like Ertas. The fine-tuned model combined with local semantic search creates a powerful, fully private AI assistant.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Building an On-Device AI Assistant for Your Mobile App
Architecture patterns for building a conversational AI assistant that runs entirely on the user's device. Model selection, conversation management, UI patterns, and production considerations.

On-Device Text Classification for Mobile Apps
How to build fast, accurate text classification that runs on the user's phone. Sentiment analysis, content categorization, intent detection, and spam filtering without an API call.

On-Device Content Generation: AI Drafts That Work Offline
How to build AI-powered drafting features that work without internet. Email replies, message suggestions, note expansion, and content templates generated entirely on the user's device.