
Fine-Tuning for Better JSON Output: Why Small Models Struggle and How to Fix It
How fine-tuning dramatically improves JSON output reliability in small models — from 60% valid JSON to 99%+ compliance, with practical techniques for structured output tasks.
If you have tried to get a 7B or 8B parameter model to produce reliable JSON, you know the problem. You write a careful prompt specifying the exact schema. The model produces valid JSON 60-70% of the time. The other 30-40% is a mix of trailing commas, missing closing braces, unquoted keys, and the occasional hallucinated field that was not in your schema.
This is not a prompting problem. It is a training data problem. And fine-tuning fixes it decisively.
The JSON Problem in Small Models
Large frontier models — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro — produce valid JSON with high reliability. They were trained on massive datasets with extensive structured output examples, and their sheer parameter count gives them enough capacity to internalize JSON syntax alongside everything else they know.
Small models (1B-8B parameters) face a different situation. They saw JSON during pre-training, but their limited capacity means the structural rules compete with language fluency for the same weights. The result is a model that "sort of" knows JSON — it gets the general shape right but fails on details.
Here are the concrete failure modes, measured on Llama 3.1 8B base with a structured extraction task across 1,000 test inputs:
| Failure Type | Frequency |
|---|---|
| Valid JSON | 64.2% |
| Missing closing brace/bracket | 12.1% |
| Trailing comma after last element | 8.7% |
| Unquoted string values | 5.3% |
| Extra text before/after JSON | 4.9% |
| Wrong field names | 3.1% |
| Truncated output (incomplete) | 1.7% |
That 64.2% success rate is not production-ready. Even with retry logic, you are wasting tokens and adding latency. And some failure modes — like wrong field names or extra text wrapping the JSON — are hard to recover from programmatically.
Why Small Models Struggle with JSON
JSON syntax is rigid and unforgiving. Natural language is tolerant of small errors — a misspelled word or a grammar mistake rarely changes meaning. JSON has zero tolerance. A single missing quote, one extra comma, or a misplaced bracket produces invalid output that parsers reject entirely. Small models optimized for language fluency do not weight syntactic precision highly enough.
Smaller context windows lose track of nesting. When a model generates a deeply nested JSON object, it needs to remember every open brace and bracket to close them correctly. In a 3-level nested object with arrays, the model might be 200+ tokens into generation before it needs to close the outermost brace. Small models with limited attention capacity lose track of this structure.
Pre-training data mixes JSON with surrounding text. In the training corpus, JSON often appears inside markdown code blocks, documentation, API responses with headers, and other wrappers. The model learns that JSON "usually" has text around it, which is why it sometimes generates explanatory text before or after the JSON object.
Token boundaries do not align with JSON syntax. The tokenizer splits {"name": into tokens that do not map cleanly to JSON structural elements. The model must learn JSON structure at a level of abstraction above its token vocabulary, which requires more capacity than small models have to spare.
Why Fine-Tuning Fixes It
Fine-tuning on thousands of correct JSON examples does something that prompting cannot: it shifts the model's weights to prioritize structural correctness for your specific schema.
When you fine-tune on 2,000 examples of correctly formatted JSON, the model learns:
- The exact fields in your schema. It stops hallucinating field names because it has never seen any other fields in training.
- Correct nesting patterns. Seeing thousands of examples of properly nested objects builds robust structural understanding that generalizes to new inputs.
- Where the JSON starts and ends. Training examples that are pure JSON (no wrapper text) teach the model to output JSON directly.
- Edge case handling. Empty arrays, null values, strings containing special characters — all become reliable when explicitly represented in training data.
The difference is dramatic. Here are results from fine-tuning Llama 3.1 8B with LoRA (rank 16) on 2,000 examples of a product catalog extraction task:
| Metric | Base Model + Prompt | Fine-Tuned |
|---|---|---|
| Valid JSON | 64.2% | 99.2% |
| Correct schema (all fields present) | 58.1% | 98.7% |
| Correct field values | 71.3% | 94.6% |
| Median latency | 1.2s | 0.8s |
The fine-tuned model is not just producing valid JSON — it is producing the right JSON. Field names match the schema exactly. Value types are correct. The output parses cleanly on the first attempt 99.2% of the time.
Building the Training Dataset
The quality of your training data determines the quality of your JSON output. Here is the systematic approach:
Define Your Schema Precisely
Start with a JSON schema definition. Be explicit about every field:
{
"type": "object",
"properties": {
"product_name": { "type": "string" },
"price": { "type": "number" },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
"features": { "type": "array", "items": { "type": "string" } },
"in_stock": { "type": "boolean" },
"metadata": {
"type": "object",
"properties": {
"sku": { "type": "string" },
"weight_kg": { "type": ["number", "null"] }
}
}
},
"required": ["product_name", "price", "currency", "features", "in_stock"]
}
Generate Diverse Examples
Use a frontier model (GPT-4o, Claude) to generate training pairs. The input is the raw text; the output is the correctly formatted JSON. Generate at least 2,000 examples and aim for diversity:
- Vary input length and complexity. Short product descriptions, long ones, ones with missing information.
- Include every edge case. Empty feature arrays. Null weights. Prices in every supported currency. Products with 1 feature and products with 20.
- Include adversarial inputs. Text that mentions JSON, text with brackets and braces in the content, text that could confuse the model about where the JSON should start.
Validate Every Training Example
This is the step most people skip, and it is the most important. Run every training example through a JSON validator and a schema validator:
import json
import jsonschema
for example in training_data:
# 1. Is it valid JSON?
parsed = json.loads(example["output"])
# 2. Does it match the schema?
jsonschema.validate(parsed, your_schema)
# 3. Are the values reasonable?
assert parsed["price"] > 0
assert len(parsed["product_name"]) > 0
Reject any example that fails validation. A single malformed example in training teaches the model that malformed output is acceptable. Zero tolerance on training data quality.
Balance Your Edge Cases
If 95% of your training examples have non-null weight_kg and only 5% have null, the model will underperform on null cases. Deliberately oversample edge cases:
- 10-15% of examples should have empty arrays
- 10-15% should have null optional fields
- 5-10% should have very long string values
- 5-10% should have deeply nested structures
Constrained Decoding: The Perfect Complement
Fine-tuning teaches the model to want to produce valid JSON. Constrained decoding (also called grammar-based sampling) forces the model to produce valid JSON by restricting token selection at each generation step to only tokens that maintain valid JSON syntax.
Used alone, constrained decoding has a problem: the model might produce syntactically valid JSON that is semantically wrong — correct braces and brackets, but wrong field names or values. The model's "intention" was to produce natural language, and the grammar constraint forced it into a JSON shape.
Used together, fine-tuning + constrained decoding gives you near-perfect results:
- Fine-tuning ensures the model "intends" to produce the right JSON schema with correct values
- Constrained decoding catches the remaining 0.8% of syntactic errors
This combination achieves 99.9%+ valid JSON rates in practice. On Ertas Deploy, grammar-based sampling is available as a configuration option for any deployed model.
Practical Tutorial: Fine-Tuning for an API Response Schema
Here is the end-to-end process for fine-tuning a model to produce API-compatible JSON responses:
1. Define the task. Extract structured product information from unstructured product descriptions and output a JSON object matching your API's response schema.
2. Prepare training data. Generate 2,500 input-output pairs using a frontier model. Validate all outputs against your schema. After validation and filtering, you have 2,200 clean pairs.
3. Format as JSONL. Each line contains an instruction, input, and output:
{"instruction": "Extract product information as JSON.", "input": "The Aeropress Go travel coffee maker...", "output": "{\"product_name\": \"Aeropress Go\", \"price\": 39.95, ...}"}
4. Fine-tune. On Ertas Studio: upload the JSONL, select Llama 3.1 8B as the base model, set LoRA rank to 16, learning rate to 2e-4, and train for 3 epochs. Training takes approximately 45 minutes on a single A100.
5. Evaluate. Run 200 held-out test examples through the fine-tuned model. Measure JSON validity rate, schema compliance, and value accuracy. Compare against the base model with your best prompt.
6. Deploy with grammar constraints. Deploy on Ertas Deploy with JSON grammar sampling enabled. This catches the last fraction of a percent of syntax errors.
Implications for Tool Use and Function Calling
JSON output reliability is not just about data extraction. It is the foundation for:
Agent tool calling. When an AI agent needs to call functions, it generates JSON arguments. A model that produces invalid JSON 35% of the time is an agent that fails 35% of the time. Fine-tuning the model on your specific function signatures makes tool calling reliable.
Multi-step workflows. In pipelines where one model's JSON output feeds into another model or service, a single malformed response breaks the entire chain. Fine-tuning makes pipelines robust.
Client-facing APIs. If you are building a product where the AI model's output is part of your API response, JSON reliability is a product quality issue. Your customers do not care why the JSON is malformed — they just see an error.
Structured logging and analytics. Models that produce consistent JSON schemas enable downstream analytics. When the schema is reliable, you can pipe model outputs directly into databases and dashboards without an error-handling layer.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
The Bottom Line
Small models are capable of producing perfectly reliable JSON output. They just need the right training signal. A 2,000-example fine-tuning dataset, built with proper validation and edge case coverage, transforms a 64% JSON validity rate into 99%+. Add constrained decoding and you are at 99.9%.
The investment is small: a few hours of data preparation, $20-50 in generation costs for training data, and 45-90 minutes of training time. The payoff is a model that produces production-ready structured output every single time.
Related reading:
- Fine-Tune a Model for Your App — broader guide to application-specific fine-tuning
- How to Fine-Tune an LLM: The Complete Guide — the technical fundamentals of the fine-tuning process
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning for Structured Output: Beyond JSON Mode to Guaranteed Schemas
JSON mode gets you valid JSON. Fine-tuning gets you guaranteed schema compliance — every field, every type, every time. Here's how to train models that output exactly the structure your app expects.

Model Risk Management for Fine-Tuned LLMs: SR 11-7 Compliance Guide
A practical guide to applying the Federal Reserve's SR 11-7 model risk management framework to fine-tuned LLMs in banking. Covers documentation requirements, validation frameworks, auditor questions, and why on-premise deployment simplifies compliance.

Detecting Model Drift in Fine-Tuned Models: When to Retrain
How to detect model drift in fine-tuned LLMs before users notice — covering input distribution shifts, vocabulary drift, task distribution changes, monitoring dashboards, decision frameworks, and practical maintenance cadence.