Why Your Fine-Tuning Dataset Won't Work for On-Device AI

You fine-tuned a 70B model on your enterprise data. It performs well. Now you distill it into a 0.5B model for deployment on mobile NPUs. The accuracy drops from 92% to 61%.

This is not a distillation problem. It is a data problem.

Most fine-tuning datasets today are optimized for large models. The examples are long, complex, and assume the model has billions of parameters to encode nuanced patterns. When you compress that knowledge into a model with 140x fewer parameters, the dataset becomes a liability rather than an asset.

The fix is not better distillation techniques. It is building datasets that are designed for the target model from the start.

Why Large-Model Datasets Fail at Small Scale

A 70B model has roughly 70 billion parameters. A 0.5B model has 500 million. That is a 140:1 ratio. Consider what that means for learning capacity.

Attention head limits. A 70B model might have 64 attention heads across 80 layers. A 0.5B model might have 16 heads across 24 layers. Complex multi-step reasoning chains that a 70B model handles effortlessly exceed the attention capacity of a 0.5B model. Training examples that require 5-step reasoning waste capacity on a model that can reliably handle 2-3 steps.

Context window constraints. Large models in production often use 8K–32K token contexts. On-device models typically operate with 512–2048 token contexts due to memory constraints. If your training data averages 3,000 tokens per example, the model learns patterns that extend beyond its production context window. It is learning skills it can never use.

Vocabulary utilization. Small models have smaller effective vocabularies. Technical jargon, rare terminology, and domain-specific abbreviations that a 70B model handles through its massive embedding space become noise for a 0.5B model. Training data with 50,000 unique tokens is asking a small model to spread its limited capacity too thin.

Distribution sensitivity. A 70B model handles class imbalance gracefully. If 80% of training examples are category A and 20% are category B, the large model still learns category B adequately. A 0.5B model in the same scenario may effectively ignore the minority class. Distribution imbalance at small scale produces models that only work for the majority case.

The Pipeline Is Not Train → Deploy

The standard enterprise AI pipeline assumes: prepare data → train model → deploy. This works when training and deployment use the same architecture.

For on-device AI, the actual pipeline is:

Teacher model (70B+) defines the quality ceiling
Synthetic data generation using the teacher, calibrated for the student
Data filtering to remove examples beyond the student's capacity
Fine-tuning the student model (0.5B–8B)
Quantization for the target hardware (Q4/Q5/Q8)
Runtime export (ExecuTorch, LiteRT, ONNX, Qualcomm AI Hub)
On-device validation against production constraints

Data preparation spans steps 2 and 3 — and these steps determine the outcome more than the fine-tuning itself. A well-prepared dataset with mediocre fine-tuning hyperparameters outperforms a poorly-prepared dataset with optimal hyperparameters at this scale.

What Distillation-Aware Data Prep Looks Like

Step 1: Define target constraints before touching the data.

Before you prepare a single training example, document:

Target model size (0.5B, 1B, 3B, 8B)
Target hardware (Snapdragon NPU, Apple Neural Engine, Intel NPU, etc.)
Production context window (512, 1024, 2048 tokens)
Latency budget (how fast must inference be?)
Quantization level (Q4, Q5, Q8)

These constraints shape every data decision that follows.

Step 2: Generate synthetic data at the right complexity level.

Use your teacher model (70B+) to generate training examples, but constrain the generation:

Maximum output length: match the student's production context window
Reasoning depth: limit to 2-3 step chains for sub-1B models, 3-5 steps for 3B–8B
Vocabulary: restrict to the terms the student model will encounter in production
Format consistency: use identical output templates across all examples

A 70B teacher generating a 500-word contract analysis is useless for training a 0.5B model that will produce 50-word classifications in production.

Step 3: Filter aggressively.

For large models, more data is usually better. For sub-1B models, more data can be actively harmful if it dilutes the distribution.

Apply:

Length filtering: remove examples outside the 10th–90th percentile of your production input distribution
Complexity scoring: use perplexity from the student model itself — high-perplexity examples are beyond its capacity
Deduplication: at small scale, near-duplicates consume disproportionate capacity
Domain relevance scoring: score every example against the specific task the model will perform
Balance enforcement: ensure class distributions match expected production distributions

Target 5,000–20,000 high-quality examples for a sub-1B model. This is counterintuitive — but 10,000 perfectly-calibrated examples consistently outperform 100,000 noisy ones at this scale.

Step 4: Validate on target hardware before scaling.

Take your filtered dataset, fine-tune a small sample (1,000 examples), deploy on the actual target device, and measure real-world performance. If accuracy is below threshold, the issue is almost always data distribution — not model architecture or training hyperparameters.

The On-Premise Requirement

There is an additional wrinkle for enterprise teams: the source data for these datasets is usually sensitive. Clinical notes, legal documents, financial records, proprietary business data.

You cannot send 700GB of construction BOQs to a cloud annotation tool just because your target deployment is on-device. The training data preparation must happen on-premise even when the final model runs on-device.

This creates a workflow where:

Data preparation happens on-premise (no data egress)
Fine-tuning happens on cloud GPUs (model weights, not raw data)
Deployment happens on-device (inference data stays local)

Each step has a different infrastructure requirement, but the data preparation step — the one that determines whether the on-device model actually works — must be fully air-gapped.

What Ertas Data Suite Does Here

Ertas Data Suite runs as a native desktop application, entirely on-premise. For on-device AI data preparation specifically:

The Clean module provides quality scoring, length filtering, and deduplication calibrated to your target model size. Set your target parameters (0.5B, 1024 context window, Q4 quantization) and the quality scores adjust to flag examples that exceed the model's capacity.

The Augment module generates synthetic training data using local LLMs, with generation constraints matched to your student model's specifications. No data leaves the building. No cloud API calls. The synthetic data is designed for the model that will actually use it.

The Export module outputs JSONL formatted for your fine-tuning framework, with metadata tracking which examples passed which quality filters — so you can iterate on the dataset when on-device performance does not meet targets.

Book a Discovery Call to discuss your on-device AI data preparation requirements and see how Ertas Data Suite fits your pipeline.

Why Your Fine-Tuning Dataset Won't Work for On-Device AI — And How to Fix It

Why Large-Model Datasets Fail at Small Scale

The Pipeline Is Not Train → Deploy

What Distillation-Aware Data Prep Looks Like

The On-Premise Requirement

What Ertas Data Suite Does Here

Turn unstructured data into AI-ready datasets — without it leaving the building.

Keep reading

Synthetic Data Generation Optimized for Small Model Distillation

The Cloud-to-Edge AI Pipeline: How Data Prep Fits Between Training and Deployment

From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation