Web

    Run an Ertas GGUF in the browser via WebAssembly or WebGPU, and the honest limits of each path today.

    The browser is the least mature on-device target for a fine-tuned GGUF. WebAssembly inference works but is meaningfully slower than native. WebGPU inference is fast but does not load GGUF directly; you have to re-quantise to a separate format. Storage quotas, multi-threading constraints, and memory ceilings all bite in ways that mobile and desktop do not.

    This page covers the two practical paths today (wllama via WebAssembly, WebLLM via WebGPU), the browser-side delivery and storage story, and when to fall back to server-side inference. For other Ship targets, see iOS, Android, and Desktop.

    Path 1: wllama (WebAssembly + WebGPU, GGUF-native)

    wllama is a WebAssembly port of llama.cpp that loads .gguf files directly. As of v3, wllama also has a WebGPU acceleration path alongside the WASM SIMD backend. It is the most direct browser path for an Ertas-exported GGUF: no conversion step, no separate quantisation format.

    npm install @wllama/wllama
    import { Wllama } from "@wllama/wllama";
    
    const wllama = new Wllama({
      "single-thread/wllama.wasm": "/wllama/single-thread/wllama.wasm",
      "multi-thread/wllama.wasm": "/wllama/multi-thread/wllama.wasm",
    });
    
    await wllama.loadModelFromUrl("https://yourcdn.com/model.gguf");
    
    const output = await wllama.createCompletion("Hello.", {
      nPredict: 64,
    });

    Practical wllama considerations:

    • Multi-threading requires SharedArrayBuffer, which requires the page to serve Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers. Without them, wllama falls back to single-threaded mode (much slower). Configure your host to serve these headers on the routes that load the model.
    • SIMD support varies by browser. Modern Chromium, Firefox, and Safari all support the SIMD instructions wllama uses, but older browsers will refuse to load the WASM. Set a minimum browser version in your app's compatibility check.
    • Model load time scales with WASM compilation. A 1 GB GGUF takes 30 to 60 seconds to compile and load on a typical laptop, more on lower-end hardware. Show a clear progress UI; users will close the tab if the page appears frozen.
    • Inference speed varies sharply by backend. Pure WASM SIMD inference is typically 5 to 10x slower than native (Phi-3 mini at around 3 to 8 tok/s on a typical laptop). The newer WebGPU path is closer to WebLLM's native-GPU performance on browsers where it works. Acceptable for short outputs and interactive UI; long-form generation is still a stretch even on WebGPU.

    Path 2: WebLLM (WebGPU, format conversion required)

    WebLLM by MLC is the fastest browser inference engine, using WebGPU to run TVM-compiled kernels. It is significantly faster than wllama (often 2 to 4x), but it does not load GGUF directly; models must be re-quantised to MLC's format.

    WebLLM works well when:

    • Your target audience uses modern browsers with WebGPU enabled (current Chrome and Edge on desktop and Android; Safari Technology Preview; not yet stable Safari).
    • You have an offline conversion step to produce the MLC artifacts from your Ertas-exported LoRA.
    • You can budget the conversion as part of your release pipeline.

    WebLLM is the path to pick if browser performance is the constraint and you are willing to maintain a second model artifact format alongside the GGUF. The MLC project publishes a Python conversion pipeline you would run against the merged-and-saved Hugging Face checkpoint (use the local-conversion workflow from Quantization) to produce the WebLLM bundle.

    Browser storage

    The model file is too large to fit in localStorage. Three storage options:

    StorageWhere to use it
    Cache APIFetch the GGUF from your CDN and cache.put() it. Survives across sessions, fast to read. Good default for most apps.
    IndexedDBStores the model as a blob. More flexible (allows partial reads) but more complex API. wllama supports loading from IndexedDB directly.
    Origin Private File System (OPFS)Faster than IndexedDB on supported browsers, but support is incomplete (Safari is the lagging one). Use as a progressive enhancement, with IndexedDB or Cache API as the fallback.

    Browser storage is per-origin quota and varies wildly:

    • Chromium-based browsers (Chrome, Edge): up to 60% of total disk for an origin, in both best-effort and persistent modes.
    • Firefox: best-effort mode is the smaller of 10% of disk or 10 GiB; persistent storage extends up to 50% of disk on user gesture.
    • Safari (macOS 14+ / iOS 17+): approximately 60% of total disk for browser apps, similar to Chromium. Older Safari versions gave an initial 1 GiB quota and prompted for more interactively.

    The browser-storage landscape has converged on the 60% of disk number for the major engines on current OS versions, but older OS / Safari combos and Safari embedded in non-browser WebKit apps (~15% quota) are still in the field. Always check the quota before downloading with navigator.storage.estimate(), and request persistent storage with navigator.storage.persist() so the browser does not evict your model under storage pressure.

    Coming soon: ONNX export and transformers.js support. transformers.js is the most popular client-side ML library and ships excellent browser ergonomics, but it consumes ONNX rather than GGUF. Ertas's exports are GGUF-only today; an ONNX export option is on the roadmap and would open the transformers.js path. Until it ships, browser deployments stay on wllama or WebLLM.

    When to fall back to server-side

    Browser-side on-device inference is the right choice when:

    • The model is small (1B class). 3B+ models are usable on high-end machines but a poor experience on mid-range laptops.
    • The use case is interactive and tolerates roughly 3 to 8 tokens per second (wllama WASM SIMD backend) up to 10 to 25 tok/s on the WebGPU backends (wllama v3 and WebLLM both reach this range where WebGPU is available). Varies by browser, GPU, and model size.
    • Privacy or offline capability is a core product feature you want to advertise.

    For everything else, server-side inference behind your own API is more reliable. The same Ertas-exported GGUF runs on a llama.cpp HTTP server, Ollama in server mode, or vLLM (with conversion). See the FAQ deployment section for the conversion paths.

    A reasonable production pattern: server-side as the default, browser as a progressive enhancement on capable devices. Detect WebGPU support and fall through to the server when it is unavailable; users with the right hardware get the privacy and latency wins, users without get a working baseline.

    What's next