Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server

Retrieval-augmented generation (RAG) is the standard answer to "how do I give my AI domain knowledge?" Retrieve relevant documents, inject them into the prompt, let the model answer with context. It works well for web applications with server infrastructure.

On mobile, RAG has a structural problem. The retrieval step requires a vector database. That database either lives on a server (reintroducing the server dependency) or on the device (consuming significant storage and RAM). Neither option is clean for mobile.

Fine-tuning takes a different approach. Instead of looking up knowledge at inference time, you bake it into the model weights during training. The model knows your domain without retrieving anything.

How RAG Works (And Why It Needs Infrastructure)

The standard RAG pipeline:

Index phase: Chunk your documents, generate embedding vectors, store in a vector database
Query phase: Convert the user's question to an embedding, search the vector DB for similar chunks, retrieve the top 3-5 results
Generation phase: Inject the retrieved chunks into the prompt alongside the user's question, send to the LLM

The Server Dependency

Step 2 requires a vector similarity search. On the server side, this uses databases like Pinecone, Weaviate, or pgvector. These are server-side services.

For mobile, you have two options:

Server-side RAG: The user's question is sent to your server, which performs the retrieval and calls the LLM API. This is the most common architecture, but it means every query requires a network round trip. You are back to cloud dependency with all its problems: latency, offline failure, privacy concerns, and ongoing infrastructure costs.

On-device RAG: Store the vector database locally on the phone. This eliminates the server but creates new problems:

The vector database consumes 100MB-1GB+ of additional storage depending on your corpus size
Embedding generation for queries requires running a separate model (typically 50-100MB)
SQLite with vector extensions is the most practical option but has limited capability
The total on-device footprint (LLM + embeddings model + vector DB) can exceed 3-4GB
Updating the knowledge base requires downloading new vectors, not just a model swap

How Fine-Tuning Works

Fine-tuning teaches the model your domain knowledge during training:

Data preparation: Format your domain knowledge as question-answer pairs, conversations, or instruction-response examples
Training: Run LoRA fine-tuning on a base model (1-3B parameters) with your data
Export: Convert to GGUF, quantize, deploy on-device
Inference: The model answers from learned knowledge. No retrieval step.

The model weight file is the only artifact. No vector database, no embeddings model, no retrieval infrastructure.

The Comparison

Factor	RAG (Server)	RAG (On-Device)	Fine-Tuning
Server required	Yes	No	No
Offline support	No	Yes	Yes
On-device storage	N/A	1-4GB (LLM + vectors + embeddings)	600MB-1.7GB (LLM only)
Knowledge updates	Update vector DB	Re-download vectors	Re-download model
Latency per query	500-3,000ms (network)	200-500ms (retrieval + generation)	50-200ms (generation only)
Infrastructure cost	Vector DB + API costs	None	None
Privacy	Data sent to server	On-device	On-device
Knowledge freshness	Real-time	Periodic updates	Periodic updates

When RAG Is Better

RAG has genuine advantages in specific scenarios:

Rapidly changing knowledge base. If your content changes daily (news, inventory, pricing), fine-tuning cannot keep up. RAG retrieves the latest documents. But consider: how many mobile AI features actually need real-time knowledge updates?

Very large knowledge bases. If your domain knowledge spans millions of documents, fine-tuning a 1-3B model cannot absorb it all. RAG retrieves the relevant subset. But again: how many mobile apps need to search millions of documents locally?

Source attribution. RAG can point to the specific document that informed its answer. Fine-tuned models cannot easily cite their sources. If your feature requires "here is the source" alongside the answer, RAG has an edge.

When Fine-Tuning Is Better

For most mobile AI use cases, fine-tuning wins:

Your knowledge is stable. Product documentation, company policies, domain terminology, industry rules. These change monthly or quarterly, not daily. Fine-tuning absorbs this knowledge cleanly.

Your task is specific. Classification, summarization, Q&A about a bounded domain, content generation in a specific style. These tasks benefit from deep domain knowledge baked into the model, not from document retrieval.

You need offline support. Fine-tuned models work identically offline. RAG (even on-device) is slower and more complex offline.

You want simplicity. One file (the GGUF model) vs three components (LLM + embeddings model + vector DB). Less complexity means fewer things to break.

You care about speed. Fine-tuned inference has no retrieval step. Time to first token is 50-200ms vs 200-500ms for on-device RAG.

The Hybrid Approach

Some mobile apps benefit from a hybrid:

Fine-tuned model for domain knowledge, style, and task execution
Local search (keyword or simple SQLite FTS5) for user-specific content (notes, messages, documents)
Server-side RAG as an optional enhancement when connected, for tasks that exceed the on-device model's knowledge

The fine-tuned model handles 90% of queries. Local search handles user-specific content. Server-side RAG handles the rare edge cases that need broader knowledge, but only when connectivity is available.

Practical Example

Scenario: A customer support chatbot for a mobile app.

RAG approach: Store your 500 support articles in a vector database. On each user question, retrieve the 3 most relevant articles, inject into the prompt, generate a response. Requires server infrastructure or 500MB+ of vectors on-device.

Fine-tuning approach: Convert your 500 support articles into 2,000 Q&A training examples. Fine-tune a 3B model. The model learns your product, your terminology, and your support style. Deploy the 1.7GB GGUF file. No retrieval needed.

Result: The fine-tuned model answers from internalized knowledge. Response latency: 100-200ms. Storage: 1.7GB (model only). Offline: works. Infrastructure cost: $0/month after deployment.

Platforms like Ertas simplify the fine-tuning path: upload your support articles or Q&A pairs, train with LoRA, export GGUF, deploy. The domain knowledge is in the model, not in a database.

The Decision

If your mobile app needs AI with domain knowledge:

Is the knowledge stable (updated monthly or less)? Fine-tune.
Does the app need to work offline? Fine-tune.
Is the knowledge base under 10,000 documents? Fine-tune.
Does the knowledge change daily? Consider server-side RAG with fine-tuned fallback.
Do you need source citation? Consider RAG for those specific features.

For the majority of mobile AI use cases, fine-tuning is simpler, faster, cheaper, and more reliable than RAG.