Structured data extraction
Fine-tune a 3B-class model to turn OCR'd receipts and invoices into strict JSON on-device, with recovery patterns for the times OCR is half-wrong.
Many of the most useful on-device AI features are extraction: take a photograph, an email, a PDF, or a chunk of free text, and turn it into fields an app can act on. The shape is consistent across use cases (receipts, invoices, business cards, intake forms, lab reports): noisy text in, strict JSON out, no fluff between the braces. The fine-tune is where you teach the model that strict means strict.
This recipe uses a worked example: imagine Ramp wants to extract line items from photographed receipts on-device, so users can submit expenses without an image of the receipt leaving the phone. The pipeline is OCR-then-LLM (the camera produces an image, Vision Framework or ML Kit gives you text, the fine-tuned model turns the text into JSON). Ramp is the anchor because expense-receipt extraction is the most-asked enterprise-extraction pattern and the privacy story for on-device is strong; substitute any vertical where you need fields out of paperwork.
Recipe also fits
The Ramp-shaped example is one of many. The same shape applies to:
- Expense management and corporate-card apps (Expensify, Brex, Pleo) that extract receipt and invoice fields.
- Accounting and bookkeeping tools (QuickBooks, Xero, FreshBooks) that ingest vendor bills.
- Invoice and AP processing (Bill.com, Stampli, Tipalti) where the field structure is well-defined and the input is OCR'd text.
- CRM lead extraction from business cards, sign-up forms, or scraped web data.
- Medical record digitisation (intake forms, lab results) where privacy makes on-device extraction a feature, not just a cost saving.
- Legal document intake and HR onboarding paperwork pipelines.
If your product turns paperwork into fields users can act on, this is the recipe to start with.
When this is the right fit
This recipe is the right fit when:
- The output schema is well-defined and stable. You can write down the JSON keys, types, and value formats once and stick with them. Vague schemas like "extract whatever's interesting" do not train well.
- The input is text, not pixels. The model takes OCR output, not images. If you need image-to-JSON in one step, that is a vision-language model (Llava, Qwen 2.5 VL) and a different recipe. The OCR-then-LLM split lets you use a smaller LLM and a separate, mature OCR layer.
- You need JSON the app can parse without retries. Strict JSON, valid against your schema, every time. The dataset and probe set both work hard on this.
- You can tolerate field-by-field accuracy in the 85 to 95% range on a clean OCR pass, falling toward 70% on messy receipts. If you need higher reliability, route to a frontier API or to a human in the loop.
It is not the right fit when:
- The schema changes weekly. Each schema change means a retrain. If your data shape is in flux, use a frontier API with structured output until the schema settles.
- You need multimodal reasoning (the model has to look at the photo, not just the OCR output). Extraction-from-image is a different task; see the model catalogue for vision-language options.
- The text input is unconstrained free-form and the right answer requires deep world knowledge ("extract the company's strategic priorities from this 50-page annual report"). That is a summarisation task, not extraction; see Document summariser.
The schema
Pick the schema and write it down before you write a single training row. For the Ramp-style receipt scenario:
{
"vendor": "Joe's Coffee",
"date": "2026-05-19",
"currency": "USD",
"subtotal": 12.50,
"tax": 1.06,
"tip": 2.00,
"total": 15.56,
"line_items": [
{"name": "Latte", "quantity": 1, "unit_price": 5.50, "total": 5.50},
{"name": "Croissant", "quantity": 1, "unit_price": 7.00, "total": 7.00}
],
"payment_method": "card_ending_4242",
"notes": null
}
Decisions baked into this schema:
- Dates in ISO 8601, not the receipt's format. The model normalises.
- Numbers as numbers, not strings. JSON numbers; the model emits
12.50, not"12.50". - Currency as ISO 4217 (
USD,EUR,JPY), inferred from receipt cues (symbols, country, language). - Nullable fields explicitly null, not omitted. A consistent shape is easier to parse and easier to validate against the schema.
line_itemsmay be an empty array for receipts where individual items are illegible.
The dataset
For the Ramp-style scenario, 5,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. Each row is one (OCR'd receipt text, JSON output) pair. A typical mix of sources:
| Source | Share | Notes |
|---|---|---|
| Real receipts (anonymised) | 50% (~2,500) | If you have an existing receipt corpus, this is the highest-signal data. Strip PII before training |
| Public receipt datasets | 20% (~1,000) | CORD, SROIE, and similar academic datasets have annotated receipt images and OCR-style text |
| Synthetic from invoice templates | 20% (~1,000) | Generate receipts from templated layouts with randomised vendors, dates, items |
| Adversarial OCR noise rows | 10% (~500) | Take clean rows and machine-introduce common OCR errors (rn → m, 1 → l, 5 → S, missing characters, line wraps) |
License-check each named dataset (CORD, SROIE, and similar academic receipt corpora) before commercial use; each carries its own academic-use terms and they have changed over time.
The adversarial subset is the difference between a model that handles a clean-receipt demo and a model that handles a wrinkled-receipt-photographed-in-bad-light reality.
Dataset format
Use the input/output schema. The instruction is the same every row; the input is the OCR text, the output is the JSON.
{
"instruction": "Extract receipt fields. Output strict JSON matching the schema. Use null for missing fields. Dates in ISO 8601. Currency as ISO 4217. Numbers as numbers, not strings.",
"input": "JOE'S COFFEE\n123 Main St\n2026-05-19 14:32\n\nLatte 5.50\nCroissant 7.00\n\nSubtotal 12.50\nTax (8.5%) 1.06\nTip 2.00\nTOTAL 15.56\n\nVisa **** 4242",
"output": "{\"vendor\": \"Joe's Coffee\", \"date\": \"2026-05-19\", \"currency\": \"USD\", \"subtotal\": 12.50, \"tax\": 1.06, \"tip\": 2.00, \"total\": 15.56, \"line_items\": [{\"name\": \"Latte\", \"quantity\": 1, \"unit_price\": 5.50, \"total\": 5.50}, {\"name\": \"Croissant\", \"quantity\": 1, \"unit_price\": 7.00, \"total\": 7.00}], \"payment_method\": \"card_ending_4242\", \"notes\": null}"
}
A few non-obvious decisions:
- The output is one JSON string on one line. No pretty-printing. Pretty-printed training output produces pretty-printed inference output, which is slower (more tokens) and more failure-prone (newlines inside JSON are easy to break).
- Every row has the full schema keys. No omissions. If the receipt does not show a tip line,
tipis0ornull, not absent. This trains the model to produce the same shape every time. - Currency inferred from cues, not just from the symbol. A receipt that says "USD" gets
USD; a receipt that shows "£" getsGBP; a receipt with no currency cue at all defaults tonullrather than guessing.
Adversarial OCR noise
The 10% adversarial subset is the highest-value 500 rows in the dataset. Apply a small set of transformations to clean rows:
| Transformation | Example |
|---|---|
rn → m | "morning" becomes "moming" |
1 → l, 0 → O | "Total 10.00" becomes "Tota1 lO.OO" |
| Missing characters | "Subtotal" becomes "Subttal" or "Subtoal" |
| Line wraps mid-word | "Croissant 7.00" becomes "Crois-\nsant 7.00" |
| Double-spaced columns | "Latte 5.50" becomes "Latte 5.50" |
| Cut-off bottom | The receipt's TOTAL line is missing entirely; the model has to compute it or null it |
The model that has trained on these is meaningfully more robust at runtime. The model that has not learned the boundary between "infer the missing value" and "null the missing value" tends to hallucinate values that look plausible but are not in the text.
Strict JSON in training equals strict JSON in production. The dataset rule: every output row is parseable by a strict JSON parser. If a single row has a trailing comma, the model will sometimes produce trailing commas. Run every output row through json.loads (or your language's equivalent) during dataset prep and drop the unparseable ones.
Optional: a DPO pass for JSON discipline
If your supervised fine-tune still produces broken JSON at, say, 2% of inferences (one in fifty), a small Direct Preference Optimisation (DPO) pass tightens it noticeably. Take 500 of your existing rows, generate "chosen" outputs (your existing strict JSON) and "rejected" outputs (deliberately broken: trailing commas, unquoted keys, missing brackets, paraphrased numbers). DPO teaches the model to prefer the strict shape over the broken one even on out-of-distribution inputs. See SFT vs DPO for the broader pattern; the DPO feature in Ertas is on the roadmap.
The base model
Pick Qwen 2.5 3B Instruct from the Ertas catalogue. The reasoning:
- Qwen 2.5's base instruction tuning handles structured-output tasks reliably. Out of the box it produces valid JSON more often than equivalent-size Llama and Gemma bases on these tasks.
- At Q4_K_M, the GGUF is about 2.1 GB. That fits comfortably on the iPhone and mid-tier Android phones a typical expense app needs to support.
- The 32k context window is plenty for receipts; even an itemised hotel folio rarely runs past 2,000 OCR'd tokens.
GPU tier: Qwen 2.5 3B Instruct trains on a T4, fitting the Free plan. Paid-plan upgrade options are documented two paragraphs down (Qwen 2.5 7B Instruct or equivalent A10G models).
If you need to go smaller (cheaper Android devices, web target), Qwen 2.5 1.5B at Q4_K_M (~1 GB) is the next step down. The JSON discipline holds at this size; complex receipt schemas (long itemisation, multi-currency) start to slip. Pair it with the DPO pass.
If your schema is much more complex (long invoices with 50+ line items, multi-page documents), an 8B may be warranted. Test the 3B first; it is usually enough.
Training config
For 5,000 rows of (OCR, JSON) pairs, start with:
| Setting | Value | Why |
|---|---|---|
| Schema | input/output | Single-task fit |
| Epochs | 4 | More than the other recipes; JSON discipline needs the extra passes to stabilise |
| Learning rate | 2e-4 | Standard SFT default |
| LoRA rank | 16 | The model is learning both the schema and the OCR-noise recovery |
| LoRA alpha | 32 | 2x rank |
| Batch size | 4 | OCR text is short; comfortable batch size |
| Grad accumulation | 4 | Effective batch 16 |
| Warmup | 5% of steps | Standard |
| Max sequence length | 2048 | Sufficient for typical receipts plus the JSON output |
Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.
The four-epoch choice is the most important difference from the other recipes. Structured-output tasks benefit from the extra reinforcement; the model needs to see the JSON shape often enough that it becomes the default behaviour. Three epochs leaves about 1 to 2% of inferences with subtle JSON breakages; four epochs is usually enough to drop that under 0.5%.
Integration: iOS via llamadart + Vision Framework
Ramp's iOS app uses Vision Framework for OCR (it ships in iOS, runs on the Neural Engine, and produces good output for receipts). The fine-tuned model takes the Vision output and produces the JSON.
// ReceiptExtractor.swift
import Vision
import LlamaDart // hypothetical iOS binding, see Ship: iOS
class ReceiptExtractor {
private let engine: LlamaEngine
private let session: ChatSession
init(modelPath: String) async throws {
engine = LlamaEngine()
try await engine.loadModel(path: modelPath)
session = ChatSession(engine: engine)
}
func extract(from image: UIImage) async throws -> Receipt {
let ocrText = try await runOCR(on: image)
let json = try await runLLM(on: ocrText)
return try JSONDecoder().decode(Receipt.self, from: json.data(using: .utf8)!)
}
private func runOCR(on image: UIImage) async throws -> String {
guard let cgImage = image.cgImage else { throw ExtractorError.invalidImage }
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
request.usesLanguageCorrection = false // false avoids autocorrecting receipt text
try VNImageRequestHandler(cgImage: cgImage).perform([request])
guard let observations = request.results else { return "" }
return observations
.compactMap { $0.topCandidates(1).first?.string }
.joined(separator: "\n")
}
private func runLLM(on ocrText: String) async throws -> String {
let prompt = """
Extract receipt fields. Output strict JSON matching the schema. \
Use null for missing fields. Dates in ISO 8601. Currency as ISO 4217. \
Numbers as numbers, not strings.
\(ocrText)
"""
let response = try await session.generate(
prompt: prompt,
options: GenerateOptions(temperature: 0.1, topP: 0.9, maxTokens: 800)
)
session.reset()
return response.trimmingCharacters(in: .whitespacesAndNewlines)
}
}
struct Receipt: Decodable {
let vendor: String?
let date: String?
let currency: String?
let subtotal: Double?
let tax: Double?
let tip: Double?
let total: Double?
let lineItems: [LineItem]
let paymentMethod: String?
let notes: String?
}
Key choices:
usesLanguageCorrection: falsefor OCR. Receipts have product names, vendor names, and codes that Vision's spell-correction wrecks if left on.- Temperature 0.1: low. JSON output should be near-deterministic.
maxTokens: 800: caps the JSON length. A receipt with 30 line items takes about that.- Decode through a strict
Codablestruct. If the model's output does not decode, you find out immediately at the boundary; you do not ship JSON to the next layer that pretends to be schema-compliant.
Recovery patterns
Even at 0.5% JSON failures, a high-volume expense app sees them. Two recovery patterns worth wiring in:
- Re-run with stricter sampling. If the first inference fails to parse, re-run with
temperature: 0.0(greedy). This catches most of the failures and produces an identical result on retry. - Surface to user. After two failures, show the OCR text in a form view and let the user type the fields. The model becomes a productivity boost, not a single point of failure.
func extractWithRecovery(from image: UIImage) async throws -> Receipt {
do {
return try await extract(from: image)
} catch {
// First attempt failed to parse. Retry greedy.
let ocrText = try await runOCR(on: image)
let json = try await runLLMGreedy(on: ocrText)
do {
return try JSONDecoder().decode(Receipt.self, from: json.data(using: .utf8)!)
} catch {
throw ExtractorError.userRecoveryNeeded(ocrText: ocrText)
}
}
}
Probe set
Ten prompts that exercise the failure modes. Each is an OCR output; pass criteria are about the JSON.
| # | Receipt type | Pass criteria |
|---|---|---|
| 1 | Clean US restaurant receipt | All fields populated, math reconciles (subtotal + tax + tip = total) |
| 2 | European grocery (EUR, no tip line) | currency is EUR, tip is null, line items populated |
| 3 | Taxi receipt (no itemisation) | line_items is [], total is populated, payment method captured |
| 4 | Hotel folio (15+ line items across two days) | All line items captured; date is the checkout date |
| 5 | Receipt with tip written by hand at the bottom | The handwritten amount is in tip; the model recognises the digit even with OCR ambiguity |
| 6 | Receipt with the TOTAL line cut off | total is null, not invented; other fields populated correctly |
| 7 | Receipt where OCR misread 5.50 as S.50 | The model recovers to 5.50 using context (subtotal math) or flags as null |
| 8 | Bilingual receipt (English vendor name, Japanese line items) | English fields preserved, Japanese line item names preserved verbatim, currency is JPY |
| 9 | Adversarial: blurred receipt with three plausible vendor names in the OCR text | Model picks the most likely vendor; does not concatenate |
| 10 | Adversarial: receipt with a trailing comma in the typical position of the OCR | Model output JSON does NOT have a trailing comma; passes a strict parser |
Most of these should pass cleanly on a properly trained model. Probe 6 (null on missing data instead of invention) and probe 10 (strict JSON discipline) are the canonical pass/fail tells.
Limits
- OCR quality is the ceiling. If the OCR step misses 40% of the text, no amount of model quality recovers it. The model is only as good as the text it sees; budget engineering time for OCR tuning (lighting hints, retry capture, perspective correction).
- Schema rigidity. Adding a new field to the schema means a retrain. Plan schema changes carefully; a "miscellaneous" string field as an escape hatch in v1 saves retrains later.
- Long documents. This recipe is sized for receipts (up to about 2,000 OCR'd tokens). Multi-page invoices need a different pattern: chunk the input, extract per page, merge in app code.
- Hallucination on missing data. Even with adversarial training, the model occasionally invents a value when it should null. The probe set catches systematic cases; surface low-confidence fields in the UI for user confirmation.
- No image input. This recipe is text-in, text-out. If you need image-in, you need a vision-language model and a different integration path. See the model catalogue for VLM options as they ship.