Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas

Your app expects a JSON object with exactly 8 fields, specific types for each field, enums for two of them, and a nested array of objects with their own schema. You ask GPT-4 to produce it. Most of the time, you get what you asked for. Sometimes you get 7 fields. Occasionally you get a string where you expected an integer. Once in a while, the model invents an enum value that doesn't exist.

At 95% schema compliance, 1 in 20 API calls produces output your parser can't handle. If your app makes 10,000 structured output calls per day, that's 500 failures. Every day. Your error handling code becomes more complex than your actual business logic. You add retries, fallback prompts, post-processing fixers. The system works, but it's fragile — held together by duct tape and retry loops.

Fine-tuning changes the equation. A model trained on 500-1,000 examples of your exact schema doesn't drift, doesn't hallucinate fields, doesn't invent enum values. Schema compliance goes from 95% with prompting to 99.5%+ with fine-tuning. At 10,000 calls per day, that's the difference between 500 failures and 50.

The Structured Output Spectrum

There's a hierarchy of structured output approaches, from least reliable to most:

Level 1: Prompt-Based ("Please output JSON")

Ask the model to produce JSON in your prompt. Include an example. Hope for the best.

Compliance rate: 80-90%
Failure modes: Invalid JSON (missing quotes, trailing commas), missing fields, wrong types, extra fields, markdown wrapping
When to use: Prototyping only

Level 2: JSON Mode

OpenAI's JSON mode, or equivalent settings in other APIs. Forces the model to output syntactically valid JSON.

Compliance rate: 95-98% (valid JSON, but schema compliance varies)
Failure modes: Valid JSON that doesn't match your schema — missing required fields, wrong field names, type mismatches, extra fields
When to use: When you need valid JSON but can tolerate schema drift

Level 3: Function Calling / Structured Outputs API

OpenAI's structured outputs with a JSON schema, or function-calling endpoints. The API enforces the schema at the decoding level.

Compliance rate: 99%+ for schema structure
Failure modes: Correct structure but wrong values — hallucinated enum values, semantically wrong content, empty strings where content is expected
When to use: When you need schema compliance from cloud APIs and can accept the per-token cost

Level 4: Fine-Tuned Model

A model trained on hundreds of examples of your exact schema. Knows the field names, types, valid values, and semantic expectations.

Compliance rate: 99.5-99.9%
Failure modes: Rare edge cases on inputs far outside training distribution
When to use: Production systems with high volume where reliability and cost matter

Level 5: Fine-Tuned Model + Constrained Decoding

Fine-tuned model with constrained decoding (llama.cpp grammar, Outlines, or guidance) that makes invalid tokens impossible.

Compliance rate: 100% structural compliance
Failure modes: Structurally perfect JSON with semantically wrong values (rare with fine-tuning)
When to use: When you need zero structural failures and are running local inference

Why Prompting Hits a Ceiling

The fundamental problem with prompt-based structured output: the model is generating tokens sequentially, and nothing prevents it from generating a token that violates your schema.

When you prompt GPT-4 to output {"status": "approved" | "denied" | "pending"}, the model might generate "status": "approved" on one call and "status": "Approved" on the next. Or "status": "approve". Each is a valid JSON string — the model has no constraint that says "only these three exact values are acceptable."

Longer prompts with more detailed schemas help, but they hit diminishing returns:

1-line schema description: ~85% compliance
Detailed schema with examples: ~92% compliance
Schema + few-shot examples + explicit constraints: ~95-97% compliance

You can't prompt your way to 99.5%. The model's token generation is probabilistic. No matter how good your prompt is, the probability of generating the wrong token is never exactly zero.

Fine-tuning changes the model's probability distribution directly. After training on 500 examples where status is always exactly "approved", "denied", or "pending", the model assigns near-zero probability to any other value for that field. The compliance comes from the weights, not the prompt.

Building Schema-Compliant Training Datasets

The quality of your training data determines the quality of your schema compliance. Here's how to build a dataset that produces reliable structured output.

Step 1: Define Your Schema Formally

Start with a JSON Schema specification:

{
  "type": "object",
  "required": ["id", "status", "score", "category", "findings", "metadata"],
  "properties": {
    "id": {"type": "string", "pattern": "^RPT-[0-9]{6}quot;},
    "status": {"type": "string", "enum": ["approved", "denied", "pending", "review"]},
    "score": {"type": "number", "minimum": 0, "maximum": 100},
    "category": {"type": "string", "enum": ["financial", "operational", "compliance", "technical"]},
    "findings": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["description", "severity", "recommendation"],
        "properties": {
          "description": {"type": "string"},
          "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
          "recommendation": {"type": "string"}
        }
      }
    },
    "metadata": {
      "type": "object",
      "required": ["analyst", "date", "version"],
      "properties": {
        "analyst": {"type": "string"},
        "date": {"type": "string", "format": "date"},
        "version": {"type": "string"}
      }
    }
  }
}

Step 2: Generate Diverse, Valid Examples

You need 500-1,000 training examples. Each must be a perfectly valid instance of your schema. There are three approaches to generating them:

From production data: If you have existing data that follows this schema (from a database, from previous API outputs), convert it to training examples. This is the best source because it reflects real-world value distributions.

From GPT-4 with validation: Use GPT-4 to generate diverse examples, then validate each one against your JSON Schema. Discard any that don't validate. At 95% compliance from GPT-4, you'll discard about 1 in 20 — acceptable for dataset generation. Run the validated examples through your actual parser to double-check.

Programmatic generation: Write a script that generates random valid instances of your schema. This guarantees structural validity but may lack semantic coherence. Use it to supplement real examples, not as your only source.

Step 3: Vary Inputs, Not Just Outputs

Your training data needs diverse inputs paired with schema-compliant outputs. If every input is "Analyze this financial report," the model learns the schema but can't generalize to new input phrasings.

Include variations:

Different input lengths (one sentence to three paragraphs)
Different input formats (formal requests, casual questions, structured instructions)
Different levels of context (complete information vs. partial information)
Edge cases (ambiguous inputs, inputs missing information, inputs that should produce empty arrays)

Step 4: Include Edge Cases Explicitly

The cases where prompted models fail most are edge cases. Make sure your training data covers:

Empty arrays: When findings should be [] because there are no findings
Minimum/maximum values: score: 0 and score: 100 — models tend to avoid extremes
Long string values: Fields with 200+ character values (models sometimes truncate)
Special characters: Quotes, newlines, unicode in string fields
Null handling: If your schema allows nullable fields, include examples with null values

Step 5: Format as Training Pairs

Structure each example as a system/user/assistant conversation:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a report analysis system. Analyze the provided report and output a structured assessment."
    },
    {
      "role": "user",
      "content": "Report: Q3 revenue declined 12% due to supply chain disruptions. Two compliance violations were identified in the procurement process. The technical infrastructure handled the load increase from the holiday promotion without incidents."
    },
    {
      "role": "assistant",
      "content": "{\"id\": \"RPT-004521\", \"status\": \"review\", \"score\": 62, \"category\": \"operational\", \"findings\": [{\"description\": \"Revenue decline of 12% attributed to supply chain disruptions\", \"severity\": \"high\", \"recommendation\": \"Diversify supplier base and establish backup procurement channels\"}, {\"description\": \"Two compliance violations in procurement process\", \"severity\": \"medium\", \"recommendation\": \"Audit procurement workflows and retrain staff on compliance requirements\"}], \"metadata\": {\"analyst\": \"system\", \"date\": \"2026-03-15\", \"version\": \"1.0\"}}"
    }
  ]
}

The assistant's response must be raw JSON — no markdown formatting, no explanatory text, no code blocks. Train the model to output JSON and nothing else.

Measuring Schema Compliance

Don't rely on "it looks right." Measure compliance programmatically:

Structural Compliance

Run every output through a JSON Schema validator. Count the percentage that pass. This catches missing fields, wrong types, invalid enum values, and schema structure violations.

Semantic Compliance

Structural validity doesn't mean the content is correct. A model can output {"status": "approved", "score": 0} — structurally valid but semantically suspect (approved with a score of 0?). Build semantic checks:

Do enum values correlate correctly with other fields?
Are numeric values in realistic ranges?
Do string values contain relevant content (not lorem ipsum or empty strings)?
Are array lengths appropriate for the input?

Comparison Metrics

After fine-tuning, compare against your baseline:

Metric	Prompted GPT-4o	Fine-Tuned 8B	Fine-Tuned 8B + Grammar
Valid JSON	99.5%	99.8%	100%
Schema compliance	93-96%	99.2-99.7%	100%
Semantic accuracy	90-94%	93-97%	93-97%
Avg. tokens/response	350	280	280
Cost per 1K calls	$2.50-$8.00	$0 (local)	$0 (local)

The fine-tuned model is more schema-compliant, uses fewer tokens (it doesn't pad responses with unnecessary verbosity), and costs nothing per call.

Combining Fine-Tuning with Constrained Decoding

For applications where even 99.5% isn't enough — financial transactions, medical records, legal documents — combine fine-tuning with constrained decoding.

Constrained decoding (also called grammar-guided generation) restricts the model's output tokens to only those that produce valid output according to a grammar specification. llama.cpp supports this via GBNF grammars. Outlines and guidance provide similar functionality for Python-based inference.

A GBNF grammar for your schema:

root   ::= "{" ws "\"id\":" ws string "," ws "\"status\":" ws status "," ws "\"score\":" ws number "," ws "\"category\":" ws category "," ws "\"findings\":" ws findings "," ws "\"metadata\":" ws metadata ws "}"
status ::= "\"approved\"" | "\"denied\"" | "\"pending\"" | "\"review\""
category ::= "\"financial\"" | "\"operational\"" | "\"compliance\"" | "\"technical\""
...

With a fine-tuned model + grammar, you get:

100% structural compliance (grammar prevents invalid structures)
99.5%+ semantic accuracy (fine-tuning teaches correct values)
Zero per-token cost (local inference)
Fast inference (grammar constraints actually speed up generation slightly by reducing the token search space)

The grammar handles the structure. The fine-tuning handles the content. Together, they produce output that your parser can trust unconditionally.

Performance Impact

Fine-tuned models generating structured output are typically faster than prompted models, for two reasons:

Shorter outputs: Prompted models often include explanatory text, markdown formatting, or meta-commentary around the JSON. Fine-tuned models output raw JSON only. This reduces output tokens by 20-40%.
More confident generation: When a model "knows" the schema (from fine-tuning), it generates each token with higher confidence. Less backtracking in the sampling process. Measurably lower time-to-first-token and faster token generation rate.

Benchmark on a customer support ticket classification schema (8 fields, 2 enums, 1 nested object):

Setup	Avg. Output Tokens	Avg. Latency	Structural Validity
GPT-4o + prompt	420 tokens	1.8s	96.2%
GPT-4o + structured outputs API	310 tokens	1.4s	99.8%
Fine-tuned 8B (Ollama)	270 tokens	0.4s	99.5%
Fine-tuned 8B + grammar	265 tokens	0.35s	100%

The fine-tuned local model is 4x faster and produces shorter, more reliable output.

Common Mistakes in Schema Fine-Tuning

Mistake 1: Inconsistent Training Data

If 80% of your training examples use "date": "2026-03-15" and 20% use "date": "March 15, 2026", the model learns both formats and probabilistically switches between them. Every field, every format, every convention must be 100% consistent across all training examples.

Mistake 2: Not Training on Empty/Null Cases

If your schema allows optional fields or empty arrays, but all training examples have populated values, the model will hallucinate values rather than output null or []. Include 10-15% of examples with each optional field set to null and each array empty.

Mistake 3: Overly Homogeneous Inputs

If all your training inputs are roughly the same length, complexity, and topic, the model overfits to that input distribution. When it sees a significantly different input in production, schema compliance drops. Vary your inputs aggressively.

Mistake 4: Training on Pretty-Printed JSON

If you train on formatted JSON with indentation, the model wastes tokens on whitespace. Train on minified JSON: {"id":"RPT-004521","status":"approved",...}. This reduces output tokens by 15-25% and improves generation speed.

Mistake 5: Not Validating the Training Data

If even one training example has a schema violation, the model learns that violations are acceptable. Validate every single training example against your JSON Schema before including it in the dataset. Automate this. No exceptions.

Getting Started

Define your target schema as a formal JSON Schema
Collect or generate 500-1,000 validated training examples
Format as JSONL training pairs (system/user/assistant)
Validate every example programmatically — discard any with schema violations
Fine-tune on Ertas — Llama 3.1 8B or Qwen 2.5 7B are both strong choices for structured output
Evaluate on 100+ held-out examples, measuring structural and semantic compliance
Deploy via Ollama, optionally with GBNF grammar for guaranteed structure
Monitor production outputs and add failure cases to your training data for the next iteration

The goal is simple: your app's JSON parser should never fail because of the model. Fine-tuning gets you there.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →