
针对小模型蒸馏优化的合成数据生成
构建移动 NPU 部署的 0.5B-1B 模型时,合成数据质量的重要性成指数级增长。以下是如何生成、过滤和验证专为小模型蒸馏设计的合成训练数据。
构建移动 NPU 部署的 0.5B 参数模型与构建云推理的 70B 模型根本不同。模型小 140 倍。对噪声或不对齐训练数据的容忍度接近零。
标准方法——大教师模型生成示例,用于训练小学生模型——在学生为 7B-13B 时可以接受。但在学生为 0.5B-1B 时就崩溃了,因为教师生成的文本复杂度超出了学生能再现的水平。
关键过滤策略
长度分布匹配:匹配生产输入/输出分布。 复杂度评分:在学生模型上运行困惑度,丢弃高于阈值的示例。 领域相关性评分:嵌入相似度过滤。 格式一致性强制:零容忍格式变异。
数据
| 指标 | 朴素方法 | 优化方法 |
|---|---|---|
| 生成合成示例 | 100,000 | 100,000 |
| 最终训练集 | 80,000 | 20,000 |
| 端侧准确率 | 61-68% | 84-91% |
优化方法使用 75% 更少的训练示例,达到高 20-30 个百分点的准确率。质量优于数量不是空话——在低于 1B 的规模下它是全部策略。
预约探索通话 讨论你端侧 AI 部署的合成数据策略。
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Why Your Fine-Tuning Dataset Won't Work for On-Device AI — And How to Fix It
Most fine-tuning datasets are built for large cloud models. When distilled to 0.5B–1B models for mobile NPUs, the data distribution breaks. Here's why, and how to build datasets that actually work for on-device deployment.

The Cloud-to-Edge AI Pipeline: How Data Prep Fits Between Training and Deployment
The full cloud-to-edge AI pipeline spans raw data through on-device deployment. Data preparation is the step between raw enterprise data and cloud training — and it's where most edge AI projects fail.

From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation
A step-by-step workflow for preparing training data when your target is an edge device with constrained compute. From defining hardware constraints to validating on-device performance.