
API Logs to Training Data: Using Your Cloud AI History to Fine-Tune
Your existing cloud AI API logs are a ready-made training dataset. How to extract, clean, and format API interaction logs into fine-tuning data for an on-device model.
If you are currently using a cloud AI API (OpenAI, Anthropic, Google Gemini), you are already generating training data. Every API call you make contains an input (the user's request) and an output (the model's response). That is a training example.
Your API logs are the fastest path from cloud AI to on-device AI. You do not need to create a dataset from scratch. You already have one.
What API Logs Contain
A typical API call log entry:
{
"timestamp": "2026-03-15T14:22:03Z",
"model": "gpt-4o-mini",
"messages": [
{
"role": "system",
"content": "You are the shopping assistant for StyleApp..."
},
{
"role": "user",
"content": "Find me a blue dress for a summer wedding"
},
{
"role": "assistant",
"content": "Here are some suggestions for a summer wedding...\n\n1. Floral midi dress in navy blue...\n2. Light blue chiffon maxi dress..."
}
],
"tokens_used": {"input": 1842, "output": 387},
"latency_ms": 1203
}
This log entry is already in the exact format needed for fine-tuning. The messages array is a training conversation.
Extraction Pipeline
Step 1: Export Your Logs
Where your logs live depends on your architecture:
If you log API calls yourself: Export from your database (PostgreSQL, MongoDB, etc.) or log aggregation service (Datadog, CloudWatch, etc.).
If you use OpenAI's API: The API does not store logs by default. You need your own logging middleware. If you do not have one, set it up now. Every future API call is a potential training example.
# Simple logging middleware example
import json
import datetime
def log_api_call(request_messages, response_content, model, tokens):
log_entry = {
"timestamp": datetime.datetime.utcnow().isoformat(),
"model": model,
"messages": request_messages + [
{"role": "assistant", "content": response_content}
],
"tokens_used": tokens,
}
with open("api_logs.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
Step 2: Filter for Quality
Not every API response is good training data. Filter out:
Failed responses: Timeout errors, malformed output, refusals.
Low-quality outputs: Responses where the user immediately retried (indicating dissatisfaction), or where the output was truncated.
Outliers: Unusually long or short responses that do not represent typical behavior.
Off-task interactions: If users occasionally ask off-topic questions, exclude those unless you want the model to handle them.
def is_quality_example(log_entry):
messages = log_entry["messages"]
assistant_msg = next(
(m for m in reversed(messages) if m["role"] == "assistant"), None
)
if not assistant_msg:
return False
content = assistant_msg["content"]
# Filter too-short responses
if len(content) < 50:
return False
# Filter error responses
if "I apologize" in content and "I cannot" in content:
return False
# Filter truncated responses
if log_entry.get("finish_reason") == "length":
return False
return True
Step 3: Remove the System Prompt (Optional)
If you are fine-tuning, the model learns the behavior from training examples. You may not need the system prompt in the training data because the fine-tuned model will internalize the instructions.
Two approaches:
Keep the system prompt: The model learns to follow these specific instructions. Good if your system prompt is short and stable.
Remove the system prompt: The model learns the behavior pattern without explicit instructions. Good if your system prompt is long (saves tokens in training) or if you want the behavior to be intrinsic rather than instruction-dependent.
Step 4: Anonymize
Remove PII from the training data:
import re
def anonymize_message(content):
# Emails
content = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', content)
# Phone numbers
content = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', content)
# Credit card numbers
content = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', content)
# Addresses (basic pattern)
content = re.sub(r'\d+\s+[\w\s]+(?:Street|St|Avenue|Ave|Road|Rd|Drive|Dr)\b', '[ADDRESS]', content)
return content
def anonymize_log(log_entry):
for msg in log_entry["messages"]:
msg["content"] = anonymize_message(msg["content"])
return log_entry
Step 5: Format for Training
Convert your filtered, anonymized logs to the standard fine-tuning format:
def log_to_training_example(log_entry):
messages = []
for msg in log_entry["messages"]:
if msg["role"] in ("system", "user", "assistant"):
messages.append({
"role": msg["role"],
"content": msg["content"]
})
return {"messages": messages}
# Process all logs
training_data = []
for log_entry in load_logs("api_logs.jsonl"):
if is_quality_example(log_entry):
anonymized = anonymize_log(log_entry)
example = log_to_training_example(anonymized)
training_data.append(example)
# Write training file
with open("training_data.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")
How Many Logs Do You Need?
| API Calls Per Day | Time to Collect 1,000 Examples | Time to Collect 5,000 Examples |
|---|---|---|
| 100 | 2-3 weeks (after quality filtering) | 2-3 months |
| 500 | 3-5 days | 2-3 weeks |
| 1,000 | 2-3 days | 1-2 weeks |
| 5,000 | 1 day | 3-5 days |
Assume 50-70% of raw API calls survive quality filtering. At 500 calls per day, you accumulate 250-350 quality examples daily.
For most tasks, 1,000 quality examples are sufficient for a well-performing fine-tuned model. You can start training within days to weeks of setting up logging.
The Distillation Advantage
When you fine-tune a small model (1-3B) on outputs from a larger model (GPT-4o, Claude Sonnet), you are performing knowledge distillation. The small model learns to reproduce the behavior of the large model on your specific task.
The result: a 3B fine-tuned model that matches the large model's performance on your domain tasks while running on-device. This is not theoretical. Research and production deployments consistently show that fine-tuned small models match or exceed prompted large models on narrow, domain-specific tasks.
Your API logs are the distillation dataset. The large model has already done the work. You just need to teach a small model to replicate it.
From Logs to Deployment
The end-to-end pipeline:
- Set up API logging (if not already in place)
- Accumulate 1,000+ quality examples (days to weeks)
- Extract, filter, anonymize, and format the logs
- Upload to a fine-tuning platform like Ertas
- Select a base model (Llama 3.2 3B recommended)
- Fine-tune with LoRA (30 min - 3 hours)
- Export GGUF (Q4_K_M quantization)
- Integrate llama.cpp in your mobile app
- A/B test the on-device model against your cloud API
- Migrate when the on-device model meets your quality bar
Your API logs are not just a cost center. They are the bridge to on-device AI.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment
How to use Google's Gemma 3 models for on-device mobile AI. Model selection, fine-tuning with LoRA, GGUF export, and deployment via llama.cpp on iOS and Android.

Building a Training Dataset from Your App's User Interactions
Your app already generates the training data you need for fine-tuning. How to collect, clean, and format user interactions into a dataset that produces a high-quality on-device model.

A/B Testing Cloud API vs On-Device AI in Production
How to run a fair A/B test between your cloud API and on-device model in a live mobile app. Metrics, cohort design, statistical significance, and the metrics that actually matter.