
Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas
JSON mode gets you valid JSON. Fine-tuning gets you guaranteed schema compliance — every field, every type, every time. Here's how to train models that output exactly the structure your app expects.
Your app expects a JSON object with exactly 8 fields, specific types for each field, enums for two of them, and a nested array of objects with their own schema. You ask GPT-4 to produce it. Most of the time, you get what you asked for. Sometimes you get 7 fields. Occasionally you get a string where you expected an integer. Once in a while, the model invents an enum value that doesn't exist.
At 95% schema compliance, 1 in 20 API calls produces output your parser can't handle. If your app makes 10,000 structured output calls per day, that's 500 failures. Every day. Your error handling code becomes more complex than your actual business logic. You add retries, fallback prompts, post-processing fixers. The system works, but it's fragile — held together by duct tape and retry loops.
Fine-tuning changes the equation. A model trained on 500-1,000 examples of your exact schema doesn't drift, doesn't hallucinate fields, doesn't invent enum values. Schema compliance goes from 95% with prompting to 99.5%+ with fine-tuning. At 10,000 calls per day, that's the difference between 500 failures and 50.
The Structured Output Spectrum
There's a hierarchy of structured output approaches, from least reliable to most:
Level 1: Prompt-Based ("Please output JSON")
Ask the model to produce JSON in your prompt. Include an example. Hope for the best.
- Compliance rate: 80-90%
- Failure modes: Invalid JSON (missing quotes, trailing commas), missing fields, wrong types, extra fields, markdown wrapping
- When to use: Prototyping only
Level 2: JSON Mode
OpenAI's JSON mode, or equivalent settings in other APIs. Forces the model to output syntactically valid JSON.
- Compliance rate: 95-98% (valid JSON, but schema compliance varies)
- Failure modes: Valid JSON that doesn't match your schema — missing required fields, wrong field names, type mismatches, extra fields
- When to use: When you need valid JSON but can tolerate schema drift
Level 3: Function Calling / Structured Outputs API
OpenAI's structured outputs with a JSON schema, or function-calling endpoints. The API enforces the schema at the decoding level.
- Compliance rate: 99%+ for schema structure
- Failure modes: Correct structure but wrong values — hallucinated enum values, semantically wrong content, empty strings where content is expected
- When to use: When you need schema compliance from cloud APIs and can accept the per-token cost
Level 4: Fine-Tuned Model
A model trained on hundreds of examples of your exact schema. Knows the field names, types, valid values, and semantic expectations.
- Compliance rate: 99.5-99.9%
- Failure modes: Rare edge cases on inputs far outside training distribution
- When to use: Production systems with high volume where reliability and cost matter
Level 5: Fine-Tuned Model + Constrained Decoding
Fine-tuned model with constrained decoding (llama.cpp grammar, Outlines, or guidance) that makes invalid tokens impossible.
- Compliance rate: 100% structural compliance
- Failure modes: Structurally perfect JSON with semantically wrong values (rare with fine-tuning)
- When to use: When you need zero structural failures and are running local inference
Why Prompting Hits a Ceiling
The fundamental problem with prompt-based structured output: the model is generating tokens sequentially, and nothing prevents it from generating a token that violates your schema.
When you prompt GPT-4 to output {"status": "approved" | "denied" | "pending"}, the model might generate "status": "approved" on one call and "status": "Approved" on the next. Or "status": "approve". Each is a valid JSON string — the model has no constraint that says "only these three exact values are acceptable."
Longer prompts with more detailed schemas help, but they hit diminishing returns:
- 1-line schema description: ~85% compliance
- Detailed schema with examples: ~92% compliance
- Schema + few-shot examples + explicit constraints: ~95-97% compliance
You can't prompt your way to 99.5%. The model's token generation is probabilistic. No matter how good your prompt is, the probability of generating the wrong token is never exactly zero.
Fine-tuning changes the model's probability distribution directly. After training on 500 examples where status is always exactly "approved", "denied", or "pending", the model assigns near-zero probability to any other value for that field. The compliance comes from the weights, not the prompt.
Building Schema-Compliant Training Datasets
The quality of your training data determines the quality of your schema compliance. Here's how to build a dataset that produces reliable structured output.
Step 1: Define Your Schema Formally
Start with a JSON Schema specification:
{
"type": "object",
"required": ["id", "status", "score", "category", "findings", "metadata"],
"properties": {
"id": {"type": "string", "pattern": "^RPT-[0-9]{6}quot;},
"status": {"type": "string", "enum": ["approved", "denied", "pending", "review"]},
"score": {"type": "number", "minimum": 0, "maximum": 100},
"category": {"type": "string", "enum": ["financial", "operational", "compliance", "technical"]},
"findings": {
"type": "array",
"items": {
"type": "object",
"required": ["description", "severity", "recommendation"],
"properties": {
"description": {"type": "string"},
"severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
"recommendation": {"type": "string"}
}
}
},
"metadata": {
"type": "object",
"required": ["analyst", "date", "version"],
"properties": {
"analyst": {"type": "string"},
"date": {"type": "string", "format": "date"},
"version": {"type": "string"}
}
}
}
}
Step 2: Generate Diverse, Valid Examples
You need 500-1,000 training examples. Each must be a perfectly valid instance of your schema. There are three approaches to generating them:
From production data: If you have existing data that follows this schema (from a database, from previous API outputs), convert it to training examples. This is the best source because it reflects real-world value distributions.
From GPT-4 with validation: Use GPT-4 to generate diverse examples, then validate each one against your JSON Schema. Discard any that don't validate. At 95% compliance from GPT-4, you'll discard about 1 in 20 — acceptable for dataset generation. Run the validated examples through your actual parser to double-check.
Programmatic generation: Write a script that generates random valid instances of your schema. This guarantees structural validity but may lack semantic coherence. Use it to supplement real examples, not as your only source.
Step 3: Vary Inputs, Not Just Outputs
Your training data needs diverse inputs paired with schema-compliant outputs. If every input is "Analyze this financial report," the model learns the schema but can't generalize to new input phrasings.
Include variations:
- Different input lengths (one sentence to three paragraphs)
- Different input formats (formal requests, casual questions, structured instructions)
- Different levels of context (complete information vs. partial information)
- Edge cases (ambiguous inputs, inputs missing information, inputs that should produce empty arrays)
Step 4: Include Edge Cases Explicitly
The cases where prompted models fail most are edge cases. Make sure your training data covers:
- Empty arrays: When
findingsshould be[]because there are no findings - Minimum/maximum values:
score: 0andscore: 100— models tend to avoid extremes - Long string values: Fields with 200+ character values (models sometimes truncate)
- Special characters: Quotes, newlines, unicode in string fields
- Null handling: If your schema allows nullable fields, include examples with null values
Step 5: Format as Training Pairs
Structure each example as a system/user/assistant conversation:
{
"messages": [
{
"role": "system",
"content": "You are a report analysis system. Analyze the provided report and output a structured assessment."
},
{
"role": "user",
"content": "Report: Q3 revenue declined 12% due to supply chain disruptions. Two compliance violations were identified in the procurement process. The technical infrastructure handled the load increase from the holiday promotion without incidents."
},
{
"role": "assistant",
"content": "{\"id\": \"RPT-004521\", \"status\": \"review\", \"score\": 62, \"category\": \"operational\", \"findings\": [{\"description\": \"Revenue decline of 12% attributed to supply chain disruptions\", \"severity\": \"high\", \"recommendation\": \"Diversify supplier base and establish backup procurement channels\"}, {\"description\": \"Two compliance violations in procurement process\", \"severity\": \"medium\", \"recommendation\": \"Audit procurement workflows and retrain staff on compliance requirements\"}], \"metadata\": {\"analyst\": \"system\", \"date\": \"2026-03-15\", \"version\": \"1.0\"}}"
}
]
}
The assistant's response must be raw JSON — no markdown formatting, no explanatory text, no code blocks. Train the model to output JSON and nothing else.
Measuring Schema Compliance
Don't rely on "it looks right." Measure compliance programmatically:
Structural Compliance
Run every output through a JSON Schema validator. Count the percentage that pass. This catches missing fields, wrong types, invalid enum values, and schema structure violations.
Semantic Compliance
Structural validity doesn't mean the content is correct. A model can output {"status": "approved", "score": 0} — structurally valid but semantically suspect (approved with a score of 0?). Build semantic checks:
- Do enum values correlate correctly with other fields?
- Are numeric values in realistic ranges?
- Do string values contain relevant content (not lorem ipsum or empty strings)?
- Are array lengths appropriate for the input?
Comparison Metrics
After fine-tuning, compare against your baseline:
| Metric | Prompted GPT-4o | Fine-Tuned 8B | Fine-Tuned 8B + Grammar |
|---|---|---|---|
| Valid JSON | 99.5% | 99.8% | 100% |
| Schema compliance | 93-96% | 99.2-99.7% | 100% |
| Semantic accuracy | 90-94% | 93-97% | 93-97% |
| Avg. tokens/response | 350 | 280 | 280 |
| Cost per 1K calls | $2.50-$8.00 | $0 (local) | $0 (local) |
The fine-tuned model is more schema-compliant, uses fewer tokens (it doesn't pad responses with unnecessary verbosity), and costs nothing per call.
Combining Fine-Tuning with Constrained Decoding
For applications where even 99.5% isn't enough — financial transactions, medical records, legal documents — combine fine-tuning with constrained decoding.
Constrained decoding (also called grammar-guided generation) restricts the model's output tokens to only those that produce valid output according to a grammar specification. llama.cpp supports this via GBNF grammars. Outlines and guidance provide similar functionality for Python-based inference.
A GBNF grammar for your schema:
root ::= "{" ws "\"id\":" ws string "," ws "\"status\":" ws status "," ws "\"score\":" ws number "," ws "\"category\":" ws category "," ws "\"findings\":" ws findings "," ws "\"metadata\":" ws metadata ws "}"
status ::= "\"approved\"" | "\"denied\"" | "\"pending\"" | "\"review\""
category ::= "\"financial\"" | "\"operational\"" | "\"compliance\"" | "\"technical\""
...
With a fine-tuned model + grammar, you get:
- 100% structural compliance (grammar prevents invalid structures)
- 99.5%+ semantic accuracy (fine-tuning teaches correct values)
- Zero per-token cost (local inference)
- Fast inference (grammar constraints actually speed up generation slightly by reducing the token search space)
The grammar handles the structure. The fine-tuning handles the content. Together, they produce output that your parser can trust unconditionally.
Performance Impact
Fine-tuned models generating structured output are typically faster than prompted models, for two reasons:
-
Shorter outputs: Prompted models often include explanatory text, markdown formatting, or meta-commentary around the JSON. Fine-tuned models output raw JSON only. This reduces output tokens by 20-40%.
-
More confident generation: When a model "knows" the schema (from fine-tuning), it generates each token with higher confidence. Less backtracking in the sampling process. Measurably lower time-to-first-token and faster token generation rate.
Benchmark on a customer support ticket classification schema (8 fields, 2 enums, 1 nested object):
| Setup | Avg. Output Tokens | Avg. Latency | Structural Validity |
|---|---|---|---|
| GPT-4o + prompt | 420 tokens | 1.8s | 96.2% |
| GPT-4o + structured outputs API | 310 tokens | 1.4s | 99.8% |
| Fine-tuned 8B (Ollama) | 270 tokens | 0.4s | 99.5% |
| Fine-tuned 8B + grammar | 265 tokens | 0.35s | 100% |
The fine-tuned local model is 4x faster and produces shorter, more reliable output.
Common Mistakes in Schema Fine-Tuning
Mistake 1: Inconsistent Training Data
If 80% of your training examples use "date": "2026-03-15" and 20% use "date": "March 15, 2026", the model learns both formats and probabilistically switches between them. Every field, every format, every convention must be 100% consistent across all training examples.
Mistake 2: Not Training on Empty/Null Cases
If your schema allows optional fields or empty arrays, but all training examples have populated values, the model will hallucinate values rather than output null or []. Include 10-15% of examples with each optional field set to null and each array empty.
Mistake 3: Overly Homogeneous Inputs
If all your training inputs are roughly the same length, complexity, and topic, the model overfits to that input distribution. When it sees a significantly different input in production, schema compliance drops. Vary your inputs aggressively.
Mistake 4: Training on Pretty-Printed JSON
If you train on formatted JSON with indentation, the model wastes tokens on whitespace. Train on minified JSON: {"id":"RPT-004521","status":"approved",...}. This reduces output tokens by 15-25% and improves generation speed.
Mistake 5: Not Validating the Training Data
If even one training example has a schema violation, the model learns that violations are acceptable. Validate every single training example against your JSON Schema before including it in the dataset. Automate this. No exceptions.
Getting Started
- Define your target schema as a formal JSON Schema
- Collect or generate 500-1,000 validated training examples
- Format as JSONL training pairs (system/user/assistant)
- Validate every example programmatically — discard any with schema violations
- Fine-tune on Ertas — Llama 3.1 8B or Qwen 2.5 7B are both strong choices for structured output
- Evaluate on 100+ held-out examples, measuring structural and semantic compliance
- Deploy via Ollama, optionally with GBNF grammar for guaranteed structure
- Monitor production outputs and add failure cases to your training data for the next iteration
The goal is simple: your app's JSON parser should never fail because of the model. Fine-tuning gets you there.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Fine-Tuning for JSON Output — Foundational guide to training models that produce valid, parseable JSON.
- Fine-Tuning for Tool Calling: How to Build Reliable AI Agents with Small Models — Structured output applied specifically to tool-calling schemas in agent pipelines.
- How to Fine-Tune an LLM — End-to-end walkthrough of the fine-tuning process from dataset preparation to deployment.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning Phi-4: Microsoft's Best Small Model for Enterprise Tasks
Phi-4 14B outperforms GPT-4 on math benchmarks while running 15x faster on local hardware. Here's how to fine-tune it for classification, extraction, and structured output tasks.

Fine-Tuning Qwen 2.5 for Multilingual Applications
Qwen 2.5 covers 29 languages with 18 trillion training tokens. Here's how to fine-tune it for multilingual classification, support, and content generation without separate models per language.

Fine-Tuning Gemma 3: Google's Lightweight Model for On-Device Deployment
Gemma 3 is optimized for on-device inference — phones, tablets, edge hardware. Here's how to fine-tune it for mobile AI features and IoT applications that run without a server.