Performance tips

    Cross-platform tips for fast and predictable on-device inference: lifecycle, caching, sampling, GPU offload, and memory hygiene.

    Most performance wins on-device come from a small number of habits that apply across platforms: load the model once, cache aggressively, size your output budget realistically, and let llama.cpp pick its own thread count. The exotic levers (manual mmap tuning, speculative decoding, batch decoding) rarely move the needle for a single-user chat app, while the basics here move it a lot.

    This page is a cross-platform reference. Platform-specific notes live in iOS, Android, Desktop, and Web. For the size-vs-quality story behind quantization choices, see Quantization.

    Load the model once

    A Q4_K_M model takes 1 to 3 seconds to load and allocates 0.5 to 9 GB of native RAM, depending on size. Load it exactly once, at app startup or during a splash screen, and reuse the loaded model and session across every inference call.

    // Good: load at splash, hold for app lifetime
    class LlamaService {
      LlamaEngine? _engine;
      ChatSession? _session;
      bool _isLoaded = false;
    
      Future<void> loadModel(String path) async { /* ... */ }
      Future<String> generate(String prompt) async { /* uses cached _engine */ }
    }
    // Bad: lazy-load on first user request
    Future<String> generateBadly(String prompt) async {
      final engine = LlamaEngine(LlamaBackend()); // 700 MB allocation
      await engine.loadModel(path);                // 1 to 3 second wait
      // ... user is staring at a frozen UI
    }

    Lazy-loading on first request makes the user's first interaction unpredictable: load time is invisible to them, so the wait gets blamed on the AI, not on initialization. Splash-screen load lets you show a progress UI that sets expectations.

    If your app's process gets killed by the OS while backgrounded (common on Android under memory pressure, possible on iOS), always check isLoaded before each generation call and reload defensively. The language-level object stays valid but the native pointer is dead; without the guard, the first post-resume call crashes.

    Reuse the chat session, reset between unrelated calls

    The chat session holds the KV (key-value) cache for the conversation. Re-using the session across turns of the same conversation is what makes successive turns fast: only the new tokens need attention computation, the previous tokens reuse the cached KV.

    But the session also holds the system prompt and conversation history that was processed into the cache. If you start a new, unrelated conversation, the old context contaminates the new one:

    // Good: reset between unrelated calls
    final reply = await session.generate(userPrompt);
    session.reset(); // clears KV cache and history
    return reply.trim();
    // Bad: history leaks into the next call
    final reply1 = await session.generate("Translate to French: Hello.");
    // session still holds the Hello/Bonjour exchange in its KV cache
    final reply2 = await session.generate("What is 2 + 2?");
    // model has "translate to French" context, may answer "Deux" instead of "4"

    session.reset() is the single most common bug to forget when integrating an FFI binding to llama.cpp. The failure mode is subtle: outputs get progressively worse instead of erroring, and the cause is invisible without instrumenting the session state.

    Cache results aggressively

    For deterministic or near-deterministic inputs (same prompt, same temperature, same model), the model produces the same output. A file-based output cache turns a 1 to 3 second inference into a ~10 ms disk read.

    Suggested cache key shape: {normalized_prompt_hash}_{model_version}_{key_params}.json. Include the model version so a model update invalidates stale cached outputs.

    Future<String> generateOrCache(String prompt) async {
      final key = cacheKey(prompt, modelVersion: "v3");
      final cached = await cache.read(key);
      if (cached != null) return cached;
    
      final fresh = await llamaService.generate(prompt);
      await cache.write(key, fresh);
      return fresh;
    }

    Use file-based JSON or a small embedded DB for cache storage. Do not use SharedPreferences (Android) or UserDefaults (iOS) for output cache; both load the entire backing store into memory on first access, which works fine for a few settings but turns into a memory bomb at hundreds of cached outputs. SQLite is overkill for a key-value cache; plain getFilesDir() files (one per key) is enough.

    Clear the cache when the model version updates. Stale outputs from a previous fine-tune do not match the new model's style or quality, and the inconsistency is hard to debug from user reports.

    Size maxTokens to the expected output

    maxTokens (or nPredict, depending on the binding) is a perf knob, not just a quality knob. Each token generated takes a fixed amount of time at a given hardware/quantization combination. A 256-token budget for a task that produces 30-token answers means 90% of the generation budget is wasted on stop-token-detection attempts.

    Set maxTokens to roughly the longest output your task actually produces, plus a small buffer for variation. Per task class:

    • Single-answer / classification / extraction: 32 to 128 tokens. The model should commit to an answer quickly; longer budgets invite hedging or filler.
    • Single-paragraph response (chat, support, persona reply): 64 to 256 tokens. Most fine-tunes fall in this range.
    • Multi-paragraph explanation or summary: 256 to 512 tokens. Headroom for the response without inviting rambling.
    • Code generation or long-form content: 512 to 1024 tokens. Necessary for tasks with consistently long targets; evaluate whether you actually need this much.

    The "1.5x the longest training-data output" heuristic from the on-device-AI playbook works as a baseline: it gives the model headroom for natural variation without space to ramble. If your training data has long-tail outputs much larger than the typical case, anchor on the p90 length rather than the absolute max, then check that the truncated cases at the tail are not the ones users will care about.

    Leave sampling parameters at defaults

    Ertas bakes temperature and top_p into the bundled Ollama Modelfile based on what you set in Training Config (default 0.7 and 0.9). For most fine-tuned models, the defaults are fine. The strong heuristic from the on-device-AI playbook:

    If you need heavy parameter tuning to get acceptable output, the issue is almost certainly in your training data, not in the parameters.

    That said, two adjustments are common and safe:

    • Lower temperature (0.3 to 0.5) for structured-output or classification tasks where deterministic, on-format outputs matter more than creativity.
    • Higher temperature (0.9 to 1.1) for creative tasks where you actively want variety across generations.

    Past 1.2, outputs degrade into nonsense for most small models. Past 0.95 in top_p, you start sampling from the very long tail and risk off-distribution words.

    Manage context length thoughtfully

    Larger context windows let you pass more text in but cost more memory and slower attention. The KV cache scales linearly with context length; doubling the context roughly doubles the per-token attention cost during generation.

    For a typical chat or extraction task, 1024 to 4096 tokens of context is the right range. The trade-off:

    • 1024 tokens: smallest KV cache, fastest inference, lowest memory. Forces aggressive truncation of conversation history. Right for single-turn tasks.
    • 2048 tokens: the sweet spot for chat fine-tunes with a few turns of history. Most production fine-tunes land here.
    • 4096 tokens: useful for long single-turn prompts (long documents to summarise, code files to analyse) at roughly 2x the KV cache memory of 2048.

    Reaching for 8K or 32K contexts on a small model is rarely worthwhile: the model's effective attention quality on small models degrades well before its nominal context limit, and KV cache memory grows linearly with context length so the cost scales fast.

    If you must hold a long conversation history:

    • Truncate the oldest messages rather than passing the entire history every turn.
    • Summarize older context into a system-message prefix once it reaches a token budget.
    • Use the model's own chat template for proper turn boundaries; do not concatenate raw strings.

    GPU offload (cross-platform)

    GPU acceleration helps roughly 2 to 4x on most hardware vs CPU-only, with diminishing returns at small model sizes:

    PlatformAccelerationNotes
    Apple Silicon (macOS, iOS)MetalEnabled by default in current llama.cpp builds. The biggest single-step performance win available.
    NVIDIA (Windows, Linux)CUDARequires a CUDA-enabled build. Most third-party bindings have a CUDA tag.
    AMD (Windows, Linux)Vulkan or ROCmVulkan is more portable; ROCm is faster but Linux-specific.
    Intel iGPUSYCL / Level ZeroAvailable in llama.cpp; gain is modest for small models.
    Android GPUVulkanInconsistent across vendors; the practical default is CPU-only via NEON SIMD.
    iOS GPUMetalEffectively required for usable performance on iPhone 14 and newer.

    GPU offload is more battery-intensive than CPU on mobile devices. For background or scheduled inference (e.g. nightly summaries), CPU often wins on energy per token. For interactive chat, GPU is worth the battery cost.

    Memory hygiene

    • The loaded model is held in native memory that the platform's garbage collector cannot see. Skipping dispose() leaks the full model size. Wire it into your app's lifecycle (e.g. Flutter Widget.dispose(), iOS applicationWillTerminate, Electron app.on('window-all-closed')).
    • Never run two models loaded simultaneously unless you have explicitly budgeted for the doubled memory. Most consumer devices cannot spare 1.4 GB for two 1B models at once. dispose() model A before loading model B.
    • Memory-mapped loading is the llama.cpp default. The file is mapped into the process's address space; the OS pages weights in as needed. This makes load time fast but does not reduce the working-set RAM; at steady state, all weights are resident.

    Anti-patterns to avoid

    • Tuning sampling parameters to mask training-data problems. Heavy temperature/top-p tweaking to "fix" bad output is a smell; the dataset usually needs improvement instead.
    • Lazy-loading on first user request. Predictable load times beat invisible ones.
    • Bundling the model inside the app's install package. See Model delivery and UX for why post-install download wins.
    • Skipping dispose() in test code. Memory leaks in test suites compound and produce flaky CI.
    • Catching exceptions silently around inference. A failed loadModel() should not result in a generic "AI unavailable" toast; users deserve to know what went wrong so they can retry.

    What's next