Model delivery and UX

How to get a 0.5 to 9 GB GGUF onto a user's device without breaking install conversion, plus the first-run UX patterns that actually work.

Every Ertas-trained GGUF is between roughly 0.5 GB and 9 GB. That is too big to bundle inside an app for most stores, and large enough that the download UX is a substantial part of the product, not an afterthought. This page is about everything that happens between "the GGUF exists in Hub" and "the app can run inference against it."

For per-platform integration code (how the app actually loads and queries the model after it is on disk), see iOS, Android, Desktop, and Web.

Bundle vs post-install download

For a fine-tuned model in the 100 MB to 2 GB range and above, post-install download is almost always the right answer. Bundling the model inside the app's install package has three costs that compound:

Install conversion drops as install size grows. Adding several gigabytes to the install download is the worst conversion cost you can pay; it happens before the user has seen any value.
Store-imposed size limits. Google Play caps base APKs at 200 MB. Apple's App Store accepts larger binaries but trims them aggressively via App Thinning, and very large apps fail Apple's "consider Wi-Fi" download warning threshold.
Updates become painful. Every model revision forces a full app update, and the user pays the full app download again unless you have configured incremental updates carefully.

Bundling does make sense when:

The model is under about 30 MB (small classifier or embedding model).
The model is mission-critical for first inference and cannot be deferred.
You have a clear reason to keep the model artifact opaque.

Most Ertas-trained fine-tunes fail all three. Default to post-install download. A small "bootstrap" model bundled with the app is a useful pattern when first-inference latency really matters; see "Demo-mode bootstrap" below.

Per-platform delivery mechanisms

Each major platform has at least one platform-native delivery path and a public-artifact-host path. Pick based on whether your model is private to the app or a public asset.

Android

Mechanism	When to use
Play Asset Delivery (PAD)	Default for app-internal models distributed only through Google Play. Three modes: install-time, fast-follow, on-demand.
Hugging Face (or other public host)	Default when the GGUF is a public artifact you want other builders to be able to grab. Free, simple, model has its own URL.
Custom CDN (R2, B2, S3 + CloudFront, etc.)	When you need signed URLs, regional routing, or A/B by user segment. Defer until you have a specific reason.
APK Expansion Files (.obb)	Deprecated. Do not use for new apps.
AICore / Gemini Nano	Side-steps the delivery problem entirely, but limits you to Google-tuned models. Not a fit for custom fine-tunes today.

A common hybrid: PAD as the default Play install path, Hugging Face as a fallback for sideload and independent fetch. Lets you ship the seamless Play UX without locking the model behind the store. See Android for code.

iOS

Mechanism	When to use
Background `URLSession` direct download	The default. Works with Hugging Face URLs, your CDN, or any HTTPS endpoint. Survives app suspend; the system completes the transfer and wakes the app when done. This is the path the Ertas reference Flutter POC uses.
On-Demand Resources (ODR)	Apple's equivalent of PAD. Apple servers handle delivery, retry, and incremental updates. Useful if you ship exclusively through the App Store and want Apple-managed caching, but the configuration overhead is heavy for a single large GGUF. Most production apps stay on direct download.
App Thinning / asset slicing	Useful for shipping multiple per-device variants of a small model. Not a primary path for a single 2 GB+ GGUF.
Custom CDN	Same trade-offs as Android. Defer until you have a specific reason.

A wrinkle worth knowing: the App Store presents a "consider Wi-Fi" warning to users when an in-app download exceeds Apple's threshold (around 200 MB or higher; Apple updates the exact number periodically). It is not a hard block, but it is friction. Surface a Wi-Fi-only consent screen in the app before triggering the download to set expectations. See iOS for code.

A few additional iOS specifics worth knowing:

URLSessionConfiguration.background(withIdentifier:).isDiscretionary = true lets iOS defer the download until the device is on Wi-Fi, plugged in, and has low CPU load. Use it for non-time-sensitive prefetch where you can wait hours for the model to arrive.
App Groups (group.com.yourapp.shared) give you a shared container accessible from both your main app and any extensions or widgets, so you only need to store one copy of the model even if multiple targets use it.
App Clips are capped at 15 MB total, which is far too small for any Ertas-trained GGUF. Plan to prompt the user to install the full app before invoking any model-dependent feature.

Desktop (macOS, Windows, Linux)

Desktop platforms have no equivalent of PAD or ODR. Large model files ship as direct downloads from your CDN, Hugging Face, or wherever the artifact lives. The user-facing UX is simpler because desktop users tolerate larger downloads (or even multi-gigabyte installers) better than mobile users do.

If your distribution is through the Mac App Store or Microsoft Store, those stores impose their own size considerations:

Mac App Store: 200 GB uncompressed app size limit per Apple's current published maximum. Effectively unlimited for any single Ertas-trained GGUF, so you could bundle the model in the app if you wanted. The Background Assets framework hosts larger separate assets outside the build if needed.
Microsoft Store (MSIX): no narrowly-published cap in current Microsoft documentation, but very large packages take longer to publish, longer to download, and longer to validate. Plan to keep MSIX packages under a few GB and download the model post-install for predictable submission times.

Outside the stores (direct DMG, MSI, AppImage), there is no platform-imposed cap. The model can ship inside the installer or be downloaded on first launch.

The Ertas GGUF bundle is Ollama-ready out of the box, so the easiest desktop integration is to assume the user has Ollama installed (or prompt them to) and run the bundled install.sh or install.bat to register the model. That sidesteps the model-delivery problem entirely; Ollama manages model storage in ~/.ollama/models.

Web

The browser is the most constrained platform for on-device AI. Two practical paths today:

WebLLM (MLC): native WebGPU inference for a small set of supported models. Does not load GGUF directly today; models are pre-quantised to WebLLM's MLC format. The conversion is not part of Ertas's export pipeline.
WASM ports of llama.cpp (wllama and friends): can load GGUF directly via Cache API or IndexedDB. Slower than native, but works on browsers without WebGPU.

Web is the least mature on-device path. If your product is browser-first and AI is core, plan to keep the model size small (1B-class at most) and to fall back to a server-side model on browsers that cannot run inference. See Web.

First-run UX

The single biggest determinant of "did the model successfully reach the user" is the first-run UX, not the underlying delivery mechanism. Two patterns to know:

Full-app gate (do not default to this). Block the entire app behind a "Downloading model..." screen. Simple to build, bad for retention. Users have not seen any value yet, so any download wait competes directly with "close it and forget."

Feature-level placeholders (recommended). Let the user explore non-AI surfaces of the app while the model downloads in the background. Each AI-dependent feature shows a clear placeholder explaining what is coming when the download finishes. Production apps shipping large native assets converge on this pattern.

The exception is single-purpose AI apps where the AI feature is the only feature. Even then, a tiny demo model bundled in the app lets the user experience the feature once before being asked to commit to the full download.

Required UX elements

Before the download starts:

Model size in megabytes. Render the actual byte count as MB to one decimal place (e.g. "770.4 MB") rather than rounding aggressively. Storage-constrained users need to make an informed choice.
Free disk space available. Read the device's free space. If the model will not fit, block with a clear message rather than starting a download that will fail mid-way.
Time estimate based on connection type. Best-effort using a connectivity probe or the connection type reported by the platform.
Why this download is needed. One sentence explaining what feature it enables. Reduces the "wait, what am I downloading" anxiety.

During the download:

Progress percentage and byte count. "342 MB of 770 MB" plus a percentage bar.
Estimated time remaining. Compute from rolling-average throughput over the last 10 seconds. Avoid showing "calculating..." for more than 5 seconds.
Pause / Cancel controls. Allow pause without losing progress. Cancel should require explicit confirmation since it discards the work.

Wi-Fi-only default

Default the download to Wi-Fi-only with a visible toggle. Cellular data is expensive in most markets, and a 1 GB+ download over cellular can easily exceed a monthly cap. The toggle should be a single tap, not buried in settings. If the user enables cellular, surface a one-time confirmation showing the size again; after that, respect the toggle silently.

Demo-mode bootstrap

A pattern worth considering: ship a tiny (~50 to 100 MB) bootstrap model inside the app so users get one inference instantly. Trigger the full download in the background. The user upgrades to the production-quality model when they keep using the app.

The trade-off: most users who never upgrade keep using the lower-quality model. If the demo model's output is noticeably worse, you are systematically delivering a worse product to the segment that does not upgrade. Worth A/B testing whether the demo-model output is "good enough" for the average user or whether it visibly underperforms.

Background download mechanics

Use platform primitives rather than rolling your own HTTP client. The OS will kill your background process, doze modes will throttle network connections, and connectivity changes need careful handling. The platform handles all of this.

Platform	Recommended primitive
Android	`WorkManager` for orchestration + `DownloadManager` for the transfer. Survives process death, doze mode, network changes.
iOS	`URLSession` configured with `URLSessionConfiguration.background(withIdentifier:)`. Survives app suspend; the system completes the transfer and wakes the app when done.
Desktop (Electron, Tauri)	The framework's HTTP client + a `download` event listener. No OS-level orchestration needed; the app process is long-lived.
Web	`fetch()` with the Cache API or `IndexedDB` for storage. A service worker can retry on connectivity restoration.

HTTP Range requests give byte-accurate resume on every platform:

Track the byte offset of the last verified chunk written to disk.
On resume, send Range: bytes=<offset>- and append to the file.
Update the offset only after the write to disk is confirmed.

If the server does not support Range requests, your resume strategy is "start over." Verify support during the initial probe.

Storage location

Platform	Where to store the model
Android	`Context.getFilesDir()` (internal app storage). Do not use `getCacheDir()`; the OS evicts it under storage pressure. Do not use external storage unless you have a specific reason.
iOS	The app's Documents directory (`FileManager.default.urls(for: .documentDirectory, ...)`). Mark with `URLResourceKey.isExcludedFromBackupKey = true` to keep the model out of iCloud backups.
Desktop (Electron, Tauri)	The platform's user-data directory (`app.getPath('userData')` in Electron, `data_dir()` in Tauri's `dirs` crate).
Web	Origin Private File System (OPFS) for browsers that support it, or IndexedDB / Cache API. Quota is per-origin and limited; check it before downloading.

A useful file layout that works on every platform:

<app-data-dir>/models/<model-name>/
    model.gguf           // active, verified model
    model.gguf.tmp       // download in progress
    manifest.json        // version, hash, source URL, size

The manifest is the source of truth for "do I need to download" decisions on subsequent launches.

Integrity: the atomic download pattern

Every platform should follow the same pattern to ensure the active model file is never a partial download:

Download to a `.tmp` path

Never write directly to the active model path. Use model.gguf.tmp (or equivalent) as the staging file.

Verify with SHA-256

After the download completes (full content-length received), compute SHA-256 over the staging file and compare against the expected hash from a manifest or hardcoded constant.

Atomic rename on success

If the hash matches, rename .tmp to model.gguf. On the same filesystem, this rename is atomic on every major OS: it either happened or it did not, never half. Update the manifest with the version and hash.

Delete on failure

If the hash mismatches, delete the staging file, surface a clear error, allow retry. After 3 consecutive checksum failures, surface a diagnostic mode (network type, ISP, etc.) so you can investigate whether a specific carrier is mutating content.

This matters because:

Partial files becoming the active path mean silent corruption at inference time. GGUF models will load (the format is forgiving about trailing data) but generate garbage outputs. Users cannot distinguish a corrupted model from "the AI is just bad."
Bit-flips during transfer are rare but real, especially on flaky cellular or aggressive corporate proxies that mutate content.
Atomic move semantics give you crash safety. If the device powers off mid-rename, the move either happened or did not.

The expected hash should come from a source you control: a server-side manifest (JSON at a known URL alongside the artifact) for fixed models you want to rotate, or hardcoded in the app for never-changing models. Do not derive the hash from anything in the artifact itself; that defeats the integrity check.

Coming soon: Ertas-published artifact manifests. Ertas does not publish a SHA-256 alongside the GGUF download today. Until it does, compute one yourself on the first download and pin it in your app's build pipeline. See Verifying exports.

Edge cases worth handling explicitly

Scenario	What to do
App killed mid-download	Resume from the last verified byte position on next launch. Use platform primitives (WorkManager, background `URLSession`) and this is mostly automatic.
Airplane mode or network loss mid-download	Pause cleanly, surface "Waiting for network," resume automatically when connectivity returns.
Disk full mid-download	Detect before starting (free-space check during consent screen) and during the download (catch IO errors on write). Surface "Free up X MB to install AI features" rather than a generic error.
Checksum mismatch	Delete the staging file, allow retry. After 3 consecutive failures, log diagnostics.
Wi-Fi turning into cellular mid-download	Respect the Wi-Fi-only setting; pause when Wi-Fi drops, resume when it returns.
Old model present from previous version	Check the manifest hash against the expected hash on every launch. If they do not match, re-download. Do not assume the file is the right version.
App reinstalled	Internal app storage is cleared on uninstall. Plan for a fresh download path; do not assume the model survives reinstall.

What's next

iOS

Integration code, ODR setup, App Store submission notes.

Android

Integration code, PAD setup, Play Console submission notes.

Desktop

Ollama, Electron, Tauri, native llama.cpp paths.

Performance tips

Context length, threading, KV cache.