Back to blog
    Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server
    fine-tuningRAGmobile AIarchitectureon-device AIsegment:mobile-builder

    Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server

    RAG is the go-to solution for giving AI domain knowledge. But on mobile, RAG reintroduces the server dependency you are trying to eliminate. Fine-tuning bakes the knowledge into the model itself.

    EErtas Team·

    Retrieval-augmented generation (RAG) is the standard answer to "how do I give my AI domain knowledge?" Retrieve relevant documents, inject them into the prompt, let the model answer with context. It works well for web applications with server infrastructure.

    On mobile, RAG has a structural problem. The retrieval step requires a vector database. That database either lives on a server (reintroducing the server dependency) or on the device (consuming significant storage and RAM). Neither option is clean for mobile.

    Fine-tuning takes a different approach. Instead of looking up knowledge at inference time, you bake it into the model weights during training. The model knows your domain without retrieving anything.

    How RAG Works (And Why It Needs Infrastructure)

    The standard RAG pipeline:

    1. Index phase: Chunk your documents, generate embedding vectors, store in a vector database
    2. Query phase: Convert the user's question to an embedding, search the vector DB for similar chunks, retrieve the top 3-5 results
    3. Generation phase: Inject the retrieved chunks into the prompt alongside the user's question, send to the LLM

    The Server Dependency

    Step 2 requires a vector similarity search. On the server side, this uses databases like Pinecone, Weaviate, or pgvector. These are server-side services.

    For mobile, you have two options:

    Server-side RAG: The user's question is sent to your server, which performs the retrieval and calls the LLM API. This is the most common architecture, but it means every query requires a network round trip. You are back to cloud dependency with all its problems: latency, offline failure, privacy concerns, and ongoing infrastructure costs.

    On-device RAG: Store the vector database locally on the phone. This eliminates the server but creates new problems:

    • The vector database consumes 100MB-1GB+ of additional storage depending on your corpus size
    • Embedding generation for queries requires running a separate model (typically 50-100MB)
    • SQLite with vector extensions is the most practical option but has limited capability
    • The total on-device footprint (LLM + embeddings model + vector DB) can exceed 3-4GB
    • Updating the knowledge base requires downloading new vectors, not just a model swap

    How Fine-Tuning Works

    Fine-tuning teaches the model your domain knowledge during training:

    1. Data preparation: Format your domain knowledge as question-answer pairs, conversations, or instruction-response examples
    2. Training: Run LoRA fine-tuning on a base model (1-3B parameters) with your data
    3. Export: Convert to GGUF, quantize, deploy on-device
    4. Inference: The model answers from learned knowledge. No retrieval step.

    The model weight file is the only artifact. No vector database, no embeddings model, no retrieval infrastructure.

    The Comparison

    FactorRAG (Server)RAG (On-Device)Fine-Tuning
    Server requiredYesNoNo
    Offline supportNoYesYes
    On-device storageN/A1-4GB (LLM + vectors + embeddings)600MB-1.7GB (LLM only)
    Knowledge updatesUpdate vector DBRe-download vectorsRe-download model
    Latency per query500-3,000ms (network)200-500ms (retrieval + generation)50-200ms (generation only)
    Infrastructure costVector DB + API costsNoneNone
    PrivacyData sent to serverOn-deviceOn-device
    Knowledge freshnessReal-timePeriodic updatesPeriodic updates

    When RAG Is Better

    RAG has genuine advantages in specific scenarios:

    Rapidly changing knowledge base. If your content changes daily (news, inventory, pricing), fine-tuning cannot keep up. RAG retrieves the latest documents. But consider: how many mobile AI features actually need real-time knowledge updates?

    Very large knowledge bases. If your domain knowledge spans millions of documents, fine-tuning a 1-3B model cannot absorb it all. RAG retrieves the relevant subset. But again: how many mobile apps need to search millions of documents locally?

    Source attribution. RAG can point to the specific document that informed its answer. Fine-tuned models cannot easily cite their sources. If your feature requires "here is the source" alongside the answer, RAG has an edge.

    When Fine-Tuning Is Better

    For most mobile AI use cases, fine-tuning wins:

    Your knowledge is stable. Product documentation, company policies, domain terminology, industry rules. These change monthly or quarterly, not daily. Fine-tuning absorbs this knowledge cleanly.

    Your task is specific. Classification, summarization, Q&A about a bounded domain, content generation in a specific style. These tasks benefit from deep domain knowledge baked into the model, not from document retrieval.

    You need offline support. Fine-tuned models work identically offline. RAG (even on-device) is slower and more complex offline.

    You want simplicity. One file (the GGUF model) vs three components (LLM + embeddings model + vector DB). Less complexity means fewer things to break.

    You care about speed. Fine-tuned inference has no retrieval step. Time to first token is 50-200ms vs 200-500ms for on-device RAG.

    The Hybrid Approach

    Some mobile apps benefit from a hybrid:

    • Fine-tuned model for domain knowledge, style, and task execution
    • Local search (keyword or simple SQLite FTS5) for user-specific content (notes, messages, documents)
    • Server-side RAG as an optional enhancement when connected, for tasks that exceed the on-device model's knowledge

    The fine-tuned model handles 90% of queries. Local search handles user-specific content. Server-side RAG handles the rare edge cases that need broader knowledge, but only when connectivity is available.

    Practical Example

    Scenario: A customer support chatbot for a mobile app.

    RAG approach: Store your 500 support articles in a vector database. On each user question, retrieve the 3 most relevant articles, inject into the prompt, generate a response. Requires server infrastructure or 500MB+ of vectors on-device.

    Fine-tuning approach: Convert your 500 support articles into 2,000 Q&A training examples. Fine-tune a 3B model. The model learns your product, your terminology, and your support style. Deploy the 1.7GB GGUF file. No retrieval needed.

    Result: The fine-tuned model answers from internalized knowledge. Response latency: 100-200ms. Storage: 1.7GB (model only). Offline: works. Infrastructure cost: $0/month after deployment.

    Platforms like Ertas simplify the fine-tuning path: upload your support articles or Q&A pairs, train with LoRA, export GGUF, deploy. The domain knowledge is in the model, not in a database.

    The Decision

    If your mobile app needs AI with domain knowledge:

    1. Is the knowledge stable (updated monthly or less)? Fine-tune.
    2. Does the app need to work offline? Fine-tune.
    3. Is the knowledge base under 10,000 documents? Fine-tune.
    4. Does the knowledge change daily? Consider server-side RAG with fine-tuned fallback.
    5. Do you need source citation? Consider RAG for those specific features.

    For the majority of mobile AI use cases, fine-tuning is simpler, faster, cheaper, and more reliable than RAG.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading