
生产中的模型路由:何时用微调 vs API vs RAG
微调、RAG和云API各解决不同问题。以下是按请求选择正确方法的实用路由框架——以及如何在一个系统中组合三者。
大多数生产AI系统不应该对每个请求使用单一方法。构建盈利AI功能的团队三者都用——将每个请求路由到最适合该特定工作的方法。
三种方法
- 微调模型:固定基础设施成本,近零每请求。适合高量、定义明确、重复性任务。
- RAG:中等成本。适合需要访问大型变化知识库的任务。
- 云API:最高每请求成本。适合复杂推理、创意任务。
成本分析
300,000请求/月:
- 全部云API:AU$9,000/月
- 路由(60%微调/25%RAG/15%API):AU$4,250/月(节省53%)
600,000请求/月时差距更大:$18,000 vs $5,800。
何时RAG胜过微调
- 知识库变化比重训快
- 语料库太大无法训练进模型
- 用户询问特定文件
- 需要引用和可追溯性
何时微调胜过RAG
- 任务关于如何回应而非用什么回应
- 延迟很重要(50-200ms vs 600ms+)
- 量高且任务重复
- 需要零每请求成本
迭代循环
月1保守路由 → 月2分析移动到本地 → 月3扩展微调 → 月6大部分稳定流量在本地。成熟SaaS产品最终状态:60-80%微调,15-25%RAG,5-15%云API。
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
延伸阅读
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning vs RAG: When to Use Each (and When to Combine Them)
Fine-tuning and retrieval-augmented generation solve different problems. This guide explains when to use each approach, the trade-offs involved, and how to combine them for the best results.

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server
RAG is the go-to solution for giving AI domain knowledge. But on mobile, RAG reintroduces the server dependency you are trying to eliminate. Fine-tuning bakes the knowledge into the model itself.

SLM-First Architecture: The 80/20 Routing Strategy That Cuts AI Costs 75%
Most AI features don't need GPT-4. An SLM-first architecture routes 80% of requests to fine-tuned local models and 20% to cloud APIs — cutting costs by 60-75% while maintaining quality.