Back to blog
    Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas
    fine-tuningstructured-outputjsonschemasegment:developer

    Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas

    JSON mode gets you valid JSON. Fine-tuning gets you guaranteed schema compliance — every field, every type, every time. Here's how to train models that output exactly the structure your app expects.

    EErtas Team·

    Your app expects a JSON object with exactly 8 fields, specific types for each field, enums for two of them, and a nested array of objects with their own schema. You ask GPT-4 to produce it. Most of the time, you get what you asked for. Sometimes you get 7 fields. Occasionally you get a string where you expected an integer. Once in a while, the model invents an enum value that doesn't exist.

    At 95% schema compliance, 1 in 20 API calls produces output your parser can't handle. If your app makes 10,000 structured output calls per day, that's 500 failures. Every day. Your error handling code becomes more complex than your actual business logic. You add retries, fallback prompts, post-processing fixers. The system works, but it's fragile — held together by duct tape and retry loops.

    Fine-tuning changes the equation. A model trained on 500-1,000 examples of your exact schema doesn't drift, doesn't hallucinate fields, doesn't invent enum values. Schema compliance goes from 95% with prompting to 99.5%+ with fine-tuning. At 10,000 calls per day, that's the difference between 500 failures and 50.

    The Structured Output Spectrum

    There's a hierarchy of structured output approaches, from least reliable to most:

    Level 1: Prompt-Based ("Please output JSON")

    Ask the model to produce JSON in your prompt. Include an example. Hope for the best.

    • Compliance rate: 80-90%
    • Failure modes: Invalid JSON (missing quotes, trailing commas), missing fields, wrong types, extra fields, markdown wrapping
    • When to use: Prototyping only

    Level 2: JSON Mode

    OpenAI's JSON mode, or equivalent settings in other APIs. Forces the model to output syntactically valid JSON.

    • Compliance rate: 95-98% (valid JSON, but schema compliance varies)
    • Failure modes: Valid JSON that doesn't match your schema — missing required fields, wrong field names, type mismatches, extra fields
    • When to use: When you need valid JSON but can tolerate schema drift

    Level 3: Function Calling / Structured Outputs API

    OpenAI's structured outputs with a JSON schema, or function-calling endpoints. The API enforces the schema at the decoding level.

    • Compliance rate: 99%+ for schema structure
    • Failure modes: Correct structure but wrong values — hallucinated enum values, semantically wrong content, empty strings where content is expected
    • When to use: When you need schema compliance from cloud APIs and can accept the per-token cost

    Level 4: Fine-Tuned Model

    A model trained on hundreds of examples of your exact schema. Knows the field names, types, valid values, and semantic expectations.

    • Compliance rate: 99.5-99.9%
    • Failure modes: Rare edge cases on inputs far outside training distribution
    • When to use: Production systems with high volume where reliability and cost matter

    Level 5: Fine-Tuned Model + Constrained Decoding

    Fine-tuned model with constrained decoding (llama.cpp grammar, Outlines, or guidance) that makes invalid tokens impossible.

    • Compliance rate: 100% structural compliance
    • Failure modes: Structurally perfect JSON with semantically wrong values (rare with fine-tuning)
    • When to use: When you need zero structural failures and are running local inference

    Why Prompting Hits a Ceiling

    The fundamental problem with prompt-based structured output: the model is generating tokens sequentially, and nothing prevents it from generating a token that violates your schema.

    When you prompt GPT-4 to output {"status": "approved" | "denied" | "pending"}, the model might generate "status": "approved" on one call and "status": "Approved" on the next. Or "status": "approve". Each is a valid JSON string — the model has no constraint that says "only these three exact values are acceptable."

    Longer prompts with more detailed schemas help, but they hit diminishing returns:

    • 1-line schema description: ~85% compliance
    • Detailed schema with examples: ~92% compliance
    • Schema + few-shot examples + explicit constraints: ~95-97% compliance

    You can't prompt your way to 99.5%. The model's token generation is probabilistic. No matter how good your prompt is, the probability of generating the wrong token is never exactly zero.

    Fine-tuning changes the model's probability distribution directly. After training on 500 examples where status is always exactly "approved", "denied", or "pending", the model assigns near-zero probability to any other value for that field. The compliance comes from the weights, not the prompt.

    Building Schema-Compliant Training Datasets

    The quality of your training data determines the quality of your schema compliance. Here's how to build a dataset that produces reliable structured output.

    Step 1: Define Your Schema Formally

    Start with a JSON Schema specification:

    {
      "type": "object",
      "required": ["id", "status", "score", "category", "findings", "metadata"],
      "properties": {
        "id": {"type": "string", "pattern": "^RPT-[0-9]{6}quot;},
        "status": {"type": "string", "enum": ["approved", "denied", "pending", "review"]},
        "score": {"type": "number", "minimum": 0, "maximum": 100},
        "category": {"type": "string", "enum": ["financial", "operational", "compliance", "technical"]},
        "findings": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["description", "severity", "recommendation"],
            "properties": {
              "description": {"type": "string"},
              "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
              "recommendation": {"type": "string"}
            }
          }
        },
        "metadata": {
          "type": "object",
          "required": ["analyst", "date", "version"],
          "properties": {
            "analyst": {"type": "string"},
            "date": {"type": "string", "format": "date"},
            "version": {"type": "string"}
          }
        }
      }
    }
    

    Step 2: Generate Diverse, Valid Examples

    You need 500-1,000 training examples. Each must be a perfectly valid instance of your schema. There are three approaches to generating them:

    From production data: If you have existing data that follows this schema (from a database, from previous API outputs), convert it to training examples. This is the best source because it reflects real-world value distributions.

    From GPT-4 with validation: Use GPT-4 to generate diverse examples, then validate each one against your JSON Schema. Discard any that don't validate. At 95% compliance from GPT-4, you'll discard about 1 in 20 — acceptable for dataset generation. Run the validated examples through your actual parser to double-check.

    Programmatic generation: Write a script that generates random valid instances of your schema. This guarantees structural validity but may lack semantic coherence. Use it to supplement real examples, not as your only source.

    Step 3: Vary Inputs, Not Just Outputs

    Your training data needs diverse inputs paired with schema-compliant outputs. If every input is "Analyze this financial report," the model learns the schema but can't generalize to new input phrasings.

    Include variations:

    • Different input lengths (one sentence to three paragraphs)
    • Different input formats (formal requests, casual questions, structured instructions)
    • Different levels of context (complete information vs. partial information)
    • Edge cases (ambiguous inputs, inputs missing information, inputs that should produce empty arrays)

    Step 4: Include Edge Cases Explicitly

    The cases where prompted models fail most are edge cases. Make sure your training data covers:

    • Empty arrays: When findings should be [] because there are no findings
    • Minimum/maximum values: score: 0 and score: 100 — models tend to avoid extremes
    • Long string values: Fields with 200+ character values (models sometimes truncate)
    • Special characters: Quotes, newlines, unicode in string fields
    • Null handling: If your schema allows nullable fields, include examples with null values

    Step 5: Format as Training Pairs

    Structure each example as a system/user/assistant conversation:

    {
      "messages": [
        {
          "role": "system",
          "content": "You are a report analysis system. Analyze the provided report and output a structured assessment."
        },
        {
          "role": "user",
          "content": "Report: Q3 revenue declined 12% due to supply chain disruptions. Two compliance violations were identified in the procurement process. The technical infrastructure handled the load increase from the holiday promotion without incidents."
        },
        {
          "role": "assistant",
          "content": "{\"id\": \"RPT-004521\", \"status\": \"review\", \"score\": 62, \"category\": \"operational\", \"findings\": [{\"description\": \"Revenue decline of 12% attributed to supply chain disruptions\", \"severity\": \"high\", \"recommendation\": \"Diversify supplier base and establish backup procurement channels\"}, {\"description\": \"Two compliance violations in procurement process\", \"severity\": \"medium\", \"recommendation\": \"Audit procurement workflows and retrain staff on compliance requirements\"}], \"metadata\": {\"analyst\": \"system\", \"date\": \"2026-03-15\", \"version\": \"1.0\"}}"
        }
      ]
    }
    

    The assistant's response must be raw JSON — no markdown formatting, no explanatory text, no code blocks. Train the model to output JSON and nothing else.

    Measuring Schema Compliance

    Don't rely on "it looks right." Measure compliance programmatically:

    Structural Compliance

    Run every output through a JSON Schema validator. Count the percentage that pass. This catches missing fields, wrong types, invalid enum values, and schema structure violations.

    Semantic Compliance

    Structural validity doesn't mean the content is correct. A model can output {"status": "approved", "score": 0} — structurally valid but semantically suspect (approved with a score of 0?). Build semantic checks:

    • Do enum values correlate correctly with other fields?
    • Are numeric values in realistic ranges?
    • Do string values contain relevant content (not lorem ipsum or empty strings)?
    • Are array lengths appropriate for the input?

    Comparison Metrics

    After fine-tuning, compare against your baseline:

    MetricPrompted GPT-4oFine-Tuned 8BFine-Tuned 8B + Grammar
    Valid JSON99.5%99.8%100%
    Schema compliance93-96%99.2-99.7%100%
    Semantic accuracy90-94%93-97%93-97%
    Avg. tokens/response350280280
    Cost per 1K calls$2.50-$8.00$0 (local)$0 (local)

    The fine-tuned model is more schema-compliant, uses fewer tokens (it doesn't pad responses with unnecessary verbosity), and costs nothing per call.

    Combining Fine-Tuning with Constrained Decoding

    For applications where even 99.5% isn't enough — financial transactions, medical records, legal documents — combine fine-tuning with constrained decoding.

    Constrained decoding (also called grammar-guided generation) restricts the model's output tokens to only those that produce valid output according to a grammar specification. llama.cpp supports this via GBNF grammars. Outlines and guidance provide similar functionality for Python-based inference.

    A GBNF grammar for your schema:

    root   ::= "{" ws "\"id\":" ws string "," ws "\"status\":" ws status "," ws "\"score\":" ws number "," ws "\"category\":" ws category "," ws "\"findings\":" ws findings "," ws "\"metadata\":" ws metadata ws "}"
    status ::= "\"approved\"" | "\"denied\"" | "\"pending\"" | "\"review\""
    category ::= "\"financial\"" | "\"operational\"" | "\"compliance\"" | "\"technical\""
    ...
    

    With a fine-tuned model + grammar, you get:

    • 100% structural compliance (grammar prevents invalid structures)
    • 99.5%+ semantic accuracy (fine-tuning teaches correct values)
    • Zero per-token cost (local inference)
    • Fast inference (grammar constraints actually speed up generation slightly by reducing the token search space)

    The grammar handles the structure. The fine-tuning handles the content. Together, they produce output that your parser can trust unconditionally.

    Performance Impact

    Fine-tuned models generating structured output are typically faster than prompted models, for two reasons:

    1. Shorter outputs: Prompted models often include explanatory text, markdown formatting, or meta-commentary around the JSON. Fine-tuned models output raw JSON only. This reduces output tokens by 20-40%.

    2. More confident generation: When a model "knows" the schema (from fine-tuning), it generates each token with higher confidence. Less backtracking in the sampling process. Measurably lower time-to-first-token and faster token generation rate.

    Benchmark on a customer support ticket classification schema (8 fields, 2 enums, 1 nested object):

    SetupAvg. Output TokensAvg. LatencyStructural Validity
    GPT-4o + prompt420 tokens1.8s96.2%
    GPT-4o + structured outputs API310 tokens1.4s99.8%
    Fine-tuned 8B (Ollama)270 tokens0.4s99.5%
    Fine-tuned 8B + grammar265 tokens0.35s100%

    The fine-tuned local model is 4x faster and produces shorter, more reliable output.

    Common Mistakes in Schema Fine-Tuning

    Mistake 1: Inconsistent Training Data

    If 80% of your training examples use "date": "2026-03-15" and 20% use "date": "March 15, 2026", the model learns both formats and probabilistically switches between them. Every field, every format, every convention must be 100% consistent across all training examples.

    Mistake 2: Not Training on Empty/Null Cases

    If your schema allows optional fields or empty arrays, but all training examples have populated values, the model will hallucinate values rather than output null or []. Include 10-15% of examples with each optional field set to null and each array empty.

    Mistake 3: Overly Homogeneous Inputs

    If all your training inputs are roughly the same length, complexity, and topic, the model overfits to that input distribution. When it sees a significantly different input in production, schema compliance drops. Vary your inputs aggressively.

    Mistake 4: Training on Pretty-Printed JSON

    If you train on formatted JSON with indentation, the model wastes tokens on whitespace. Train on minified JSON: {"id":"RPT-004521","status":"approved",...}. This reduces output tokens by 15-25% and improves generation speed.

    Mistake 5: Not Validating the Training Data

    If even one training example has a schema violation, the model learns that violations are acceptable. Validate every single training example against your JSON Schema before including it in the dataset. Automate this. No exceptions.

    Getting Started

    1. Define your target schema as a formal JSON Schema
    2. Collect or generate 500-1,000 validated training examples
    3. Format as JSONL training pairs (system/user/assistant)
    4. Validate every example programmatically — discard any with schema violations
    5. Fine-tune on Ertas — Llama 3.1 8B or Qwen 2.5 7B are both strong choices for structured output
    6. Evaluate on 100+ held-out examples, measuring structural and semantic compliance
    7. Deploy via Ollama, optionally with GBNF grammar for guaranteed structure
    8. Monitor production outputs and add failure cases to your training data for the next iteration

    The goal is simple: your app's JSON parser should never fail because of the model. Fine-tuning gets you there.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading