API Logs to Training Data: Using Your Cloud AI History to Fine-Tune

If you are currently using a cloud AI API (OpenAI, Anthropic, Google Gemini), you are already generating training data. Every API call you make contains an input (the user's request) and an output (the model's response). That is a training example.

Your API logs are the fastest path from cloud AI to on-device AI. You do not need to create a dataset from scratch. You already have one.

What API Logs Contain

A typical API call log entry:

{
  "timestamp": "2026-03-15T14:22:03Z",
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "system",
      "content": "You are the shopping assistant for StyleApp..."
    },
    {
      "role": "user",
      "content": "Find me a blue dress for a summer wedding"
    },
    {
      "role": "assistant",
      "content": "Here are some suggestions for a summer wedding...\n\n1. Floral midi dress in navy blue...\n2. Light blue chiffon maxi dress..."
    }
  ],
  "tokens_used": {"input": 1842, "output": 387},
  "latency_ms": 1203
}

This log entry is already in the exact format needed for fine-tuning. The messages array is a training conversation.

Extraction Pipeline

Step 1: Export Your Logs

Where your logs live depends on your architecture:

If you log API calls yourself: Export from your database (PostgreSQL, MongoDB, etc.) or log aggregation service (Datadog, CloudWatch, etc.).

If you use OpenAI's API: The API does not store logs by default. You need your own logging middleware. If you do not have one, set it up now. Every future API call is a potential training example.

# Simple logging middleware example
import json
import datetime

def log_api_call(request_messages, response_content, model, tokens):
    log_entry = {
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "model": model,
        "messages": request_messages + [
            {"role": "assistant", "content": response_content}
        ],
        "tokens_used": tokens,
    }
    with open("api_logs.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

Step 2: Filter for Quality

Not every API response is good training data. Filter out:

Failed responses: Timeout errors, malformed output, refusals.

Low-quality outputs: Responses where the user immediately retried (indicating dissatisfaction), or where the output was truncated.

Outliers: Unusually long or short responses that do not represent typical behavior.

Off-task interactions: If users occasionally ask off-topic questions, exclude those unless you want the model to handle them.

def is_quality_example(log_entry):
    messages = log_entry["messages"]
    assistant_msg = next(
        (m for m in reversed(messages) if m["role"] == "assistant"), None
    )
    if not assistant_msg:
        return False

    content = assistant_msg["content"]

    # Filter too-short responses
    if len(content) < 50:
        return False

    # Filter error responses
    if "I apologize" in content and "I cannot" in content:
        return False

    # Filter truncated responses
    if log_entry.get("finish_reason") == "length":
        return False

    return True

Step 3: Remove the System Prompt (Optional)

If you are fine-tuning, the model learns the behavior from training examples. You may not need the system prompt in the training data because the fine-tuned model will internalize the instructions.

Two approaches:

Keep the system prompt: The model learns to follow these specific instructions. Good if your system prompt is short and stable.

Remove the system prompt: The model learns the behavior pattern without explicit instructions. Good if your system prompt is long (saves tokens in training) or if you want the behavior to be intrinsic rather than instruction-dependent.

Step 4: Anonymize

Remove PII from the training data:

import re

def anonymize_message(content):
    # Emails
    content = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', content)
    # Phone numbers
    content = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', content)
    # Credit card numbers
    content = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', content)
    # Addresses (basic pattern)
    content = re.sub(r'\d+\s+[\w\s]+(?:Street|St|Avenue|Ave|Road|Rd|Drive|Dr)\b', '[ADDRESS]', content)
    return content

def anonymize_log(log_entry):
    for msg in log_entry["messages"]:
        msg["content"] = anonymize_message(msg["content"])
    return log_entry

Step 5: Format for Training

Convert your filtered, anonymized logs to the standard fine-tuning format:

def log_to_training_example(log_entry):
    messages = []
    for msg in log_entry["messages"]:
        if msg["role"] in ("system", "user", "assistant"):
            messages.append({
                "role": msg["role"],
                "content": msg["content"]
            })
    return {"messages": messages}

# Process all logs
training_data = []
for log_entry in load_logs("api_logs.jsonl"):
    if is_quality_example(log_entry):
        anonymized = anonymize_log(log_entry)
        example = log_to_training_example(anonymized)
        training_data.append(example)

# Write training file
with open("training_data.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

How Many Logs Do You Need?

API Calls Per Day	Time to Collect 1,000 Examples	Time to Collect 5,000 Examples
100	2-3 weeks (after quality filtering)	2-3 months
500	3-5 days	2-3 weeks
1,000	2-3 days	1-2 weeks
5,000	1 day	3-5 days

Assume 50-70% of raw API calls survive quality filtering. At 500 calls per day, you accumulate 250-350 quality examples daily.

For most tasks, 1,000 quality examples are sufficient for a well-performing fine-tuned model. You can start training within days to weeks of setting up logging.

The Distillation Advantage

When you fine-tune a small model (1-3B) on outputs from a larger model (GPT-4o, Claude Sonnet), you are performing knowledge distillation. The small model learns to reproduce the behavior of the large model on your specific task.

The result: a 3B fine-tuned model that matches the large model's performance on your domain tasks while running on-device. This is not theoretical. Research and production deployments consistently show that fine-tuned small models match or exceed prompted large models on narrow, domain-specific tasks.

Your API logs are the distillation dataset. The large model has already done the work. You just need to teach a small model to replicate it.

From Logs to Deployment

The end-to-end pipeline:

Set up API logging (if not already in place)
Accumulate 1,000+ quality examples (days to weeks)
Extract, filter, anonymize, and format the logs
Upload to a fine-tuning platform like Ertas
Select a base model (Llama 3.2 3B recommended)
Fine-tune with LoRA (30 min - 3 hours)
Export GGUF (Q4_K_M quantization)
Integrate llama.cpp in your mobile app
A/B test the on-device model against your cloud API
Migrate when the on-device model meets your quality bar

Your API logs are not just a cost center. They are the bridge to on-device AI.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

API Logs to Training Data: Using Your Cloud AI History to Fine-Tune

What API Logs Contain

Extraction Pipeline

Step 1: Export Your Logs

Step 2: Filter for Quality

Step 3: Remove the System Prompt (Optional)

Step 4: Anonymize

Step 5: Format for Training

How Many Logs Do You Need?

The Distillation Advantage

From Logs to Deployment

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment

Building a Training Dataset from Your App's User Interactions

A/B Testing Cloud API vs On-Device AI in Production