Structured data extraction

    Fine-tune a 3B-class model to turn OCR'd receipts and invoices into strict JSON on-device, with recovery patterns for the times OCR is half-wrong.

    Many of the most useful on-device AI features are extraction: take a photograph, an email, a PDF, or a chunk of free text, and turn it into fields an app can act on. The shape is consistent across use cases (receipts, invoices, business cards, intake forms, lab reports): noisy text in, strict JSON out, no fluff between the braces. The fine-tune is where you teach the model that strict means strict.

    This recipe uses a worked example: imagine Ramp wants to extract line items from photographed receipts on-device, so users can submit expenses without an image of the receipt leaving the phone. The pipeline is OCR-then-LLM (the camera produces an image, Vision Framework or ML Kit gives you text, the fine-tuned model turns the text into JSON). Ramp is the anchor because expense-receipt extraction is the most-asked enterprise-extraction pattern and the privacy story for on-device is strong; substitute any vertical where you need fields out of paperwork.

    Recipe also fits

    The Ramp-shaped example is one of many. The same shape applies to:

    • Expense management and corporate-card apps (Expensify, Brex, Pleo) that extract receipt and invoice fields.
    • Accounting and bookkeeping tools (QuickBooks, Xero, FreshBooks) that ingest vendor bills.
    • Invoice and AP processing (Bill.com, Stampli, Tipalti) where the field structure is well-defined and the input is OCR'd text.
    • CRM lead extraction from business cards, sign-up forms, or scraped web data.
    • Medical record digitisation (intake forms, lab results) where privacy makes on-device extraction a feature, not just a cost saving.
    • Legal document intake and HR onboarding paperwork pipelines.

    If your product turns paperwork into fields users can act on, this is the recipe to start with.

    When this is the right fit

    This recipe is the right fit when:

    • The output schema is well-defined and stable. You can write down the JSON keys, types, and value formats once and stick with them. Vague schemas like "extract whatever's interesting" do not train well.
    • The input is text, not pixels. The model takes OCR output, not images. If you need image-to-JSON in one step, that is a vision-language model (Llava, Qwen 2.5 VL) and a different recipe. The OCR-then-LLM split lets you use a smaller LLM and a separate, mature OCR layer.
    • You need JSON the app can parse without retries. Strict JSON, valid against your schema, every time. The dataset and probe set both work hard on this.
    • You can tolerate field-by-field accuracy in the 85 to 95% range on a clean OCR pass, falling toward 70% on messy receipts. If you need higher reliability, route to a frontier API or to a human in the loop.

    It is not the right fit when:

    • The schema changes weekly. Each schema change means a retrain. If your data shape is in flux, use a frontier API with structured output until the schema settles.
    • You need multimodal reasoning (the model has to look at the photo, not just the OCR output). Extraction-from-image is a different task; see the model catalogue for vision-language options.
    • The text input is unconstrained free-form and the right answer requires deep world knowledge ("extract the company's strategic priorities from this 50-page annual report"). That is a summarisation task, not extraction; see Document summariser.

    The schema

    Pick the schema and write it down before you write a single training row. For the Ramp-style receipt scenario:

    {
      "vendor": "Joe's Coffee",
      "date": "2026-05-19",
      "currency": "USD",
      "subtotal": 12.50,
      "tax": 1.06,
      "tip": 2.00,
      "total": 15.56,
      "line_items": [
        {"name": "Latte", "quantity": 1, "unit_price": 5.50, "total": 5.50},
        {"name": "Croissant", "quantity": 1, "unit_price": 7.00, "total": 7.00}
      ],
      "payment_method": "card_ending_4242",
      "notes": null
    }

    Decisions baked into this schema:

    • Dates in ISO 8601, not the receipt's format. The model normalises.
    • Numbers as numbers, not strings. JSON numbers; the model emits 12.50, not "12.50".
    • Currency as ISO 4217 (USD, EUR, JPY), inferred from receipt cues (symbols, country, language).
    • Nullable fields explicitly null, not omitted. A consistent shape is easier to parse and easier to validate against the schema.
    • line_items may be an empty array for receipts where individual items are illegible.

    The dataset

    For the Ramp-style scenario, 5,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. Each row is one (OCR'd receipt text, JSON output) pair. A typical mix of sources:

    SourceShareNotes
    Real receipts (anonymised)50% (~2,500)If you have an existing receipt corpus, this is the highest-signal data. Strip PII before training
    Public receipt datasets20% (~1,000)CORD, SROIE, and similar academic datasets have annotated receipt images and OCR-style text
    Synthetic from invoice templates20% (~1,000)Generate receipts from templated layouts with randomised vendors, dates, items
    Adversarial OCR noise rows10% (~500)Take clean rows and machine-introduce common OCR errors (rn → m, 1 → l, 5 → S, missing characters, line wraps)

    License-check each named dataset (CORD, SROIE, and similar academic receipt corpora) before commercial use; each carries its own academic-use terms and they have changed over time.

    The adversarial subset is the difference between a model that handles a clean-receipt demo and a model that handles a wrinkled-receipt-photographed-in-bad-light reality.

    Dataset format

    Use the input/output schema. The instruction is the same every row; the input is the OCR text, the output is the JSON.

    {
      "instruction": "Extract receipt fields. Output strict JSON matching the schema. Use null for missing fields. Dates in ISO 8601. Currency as ISO 4217. Numbers as numbers, not strings.",
      "input": "JOE'S COFFEE\n123 Main St\n2026-05-19 14:32\n\nLatte         5.50\nCroissant     7.00\n\nSubtotal     12.50\nTax (8.5%)    1.06\nTip           2.00\nTOTAL        15.56\n\nVisa **** 4242",
      "output": "{\"vendor\": \"Joe's Coffee\", \"date\": \"2026-05-19\", \"currency\": \"USD\", \"subtotal\": 12.50, \"tax\": 1.06, \"tip\": 2.00, \"total\": 15.56, \"line_items\": [{\"name\": \"Latte\", \"quantity\": 1, \"unit_price\": 5.50, \"total\": 5.50}, {\"name\": \"Croissant\", \"quantity\": 1, \"unit_price\": 7.00, \"total\": 7.00}], \"payment_method\": \"card_ending_4242\", \"notes\": null}"
    }

    A few non-obvious decisions:

    • The output is one JSON string on one line. No pretty-printing. Pretty-printed training output produces pretty-printed inference output, which is slower (more tokens) and more failure-prone (newlines inside JSON are easy to break).
    • Every row has the full schema keys. No omissions. If the receipt does not show a tip line, tip is 0 or null, not absent. This trains the model to produce the same shape every time.
    • Currency inferred from cues, not just from the symbol. A receipt that says "USD" gets USD; a receipt that shows "£" gets GBP; a receipt with no currency cue at all defaults to null rather than guessing.

    Adversarial OCR noise

    The 10% adversarial subset is the highest-value 500 rows in the dataset. Apply a small set of transformations to clean rows:

    TransformationExample
    rnm"morning" becomes "moming"
    1l, 0O"Total 10.00" becomes "Tota1 lO.OO"
    Missing characters"Subtotal" becomes "Subttal" or "Subtoal"
    Line wraps mid-word"Croissant 7.00" becomes "Crois-\nsant 7.00"
    Double-spaced columns"Latte 5.50" becomes "Latte 5.50"
    Cut-off bottomThe receipt's TOTAL line is missing entirely; the model has to compute it or null it

    The model that has trained on these is meaningfully more robust at runtime. The model that has not learned the boundary between "infer the missing value" and "null the missing value" tends to hallucinate values that look plausible but are not in the text.

    Strict JSON in training equals strict JSON in production. The dataset rule: every output row is parseable by a strict JSON parser. If a single row has a trailing comma, the model will sometimes produce trailing commas. Run every output row through json.loads (or your language's equivalent) during dataset prep and drop the unparseable ones.

    Optional: a DPO pass for JSON discipline

    If your supervised fine-tune still produces broken JSON at, say, 2% of inferences (one in fifty), a small Direct Preference Optimisation (DPO) pass tightens it noticeably. Take 500 of your existing rows, generate "chosen" outputs (your existing strict JSON) and "rejected" outputs (deliberately broken: trailing commas, unquoted keys, missing brackets, paraphrased numbers). DPO teaches the model to prefer the strict shape over the broken one even on out-of-distribution inputs. See SFT vs DPO for the broader pattern; the DPO feature in Ertas is on the roadmap.

    The base model

    Pick Qwen 2.5 3B Instruct from the Ertas catalogue. The reasoning:

    • Qwen 2.5's base instruction tuning handles structured-output tasks reliably. Out of the box it produces valid JSON more often than equivalent-size Llama and Gemma bases on these tasks.
    • At Q4_K_M, the GGUF is about 2.1 GB. That fits comfortably on the iPhone and mid-tier Android phones a typical expense app needs to support.
    • The 32k context window is plenty for receipts; even an itemised hotel folio rarely runs past 2,000 OCR'd tokens.

    GPU tier: Qwen 2.5 3B Instruct trains on a T4, fitting the Free plan. Paid-plan upgrade options are documented two paragraphs down (Qwen 2.5 7B Instruct or equivalent A10G models).

    If you need to go smaller (cheaper Android devices, web target), Qwen 2.5 1.5B at Q4_K_M (~1 GB) is the next step down. The JSON discipline holds at this size; complex receipt schemas (long itemisation, multi-currency) start to slip. Pair it with the DPO pass.

    If your schema is much more complex (long invoices with 50+ line items, multi-page documents), an 8B may be warranted. Test the 3B first; it is usually enough.

    Training config

    For 5,000 rows of (OCR, JSON) pairs, start with:

    SettingValueWhy
    Schemainput/outputSingle-task fit
    Epochs4More than the other recipes; JSON discipline needs the extra passes to stabilise
    Learning rate2e-4Standard SFT default
    LoRA rank16The model is learning both the schema and the OCR-noise recovery
    LoRA alpha322x rank
    Batch size4OCR text is short; comfortable batch size
    Grad accumulation4Effective batch 16
    Warmup5% of stepsStandard
    Max sequence length2048Sufficient for typical receipts plus the JSON output

    Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.

    The four-epoch choice is the most important difference from the other recipes. Structured-output tasks benefit from the extra reinforcement; the model needs to see the JSON shape often enough that it becomes the default behaviour. Three epochs leaves about 1 to 2% of inferences with subtle JSON breakages; four epochs is usually enough to drop that under 0.5%.

    Integration: iOS via llamadart + Vision Framework

    Ramp's iOS app uses Vision Framework for OCR (it ships in iOS, runs on the Neural Engine, and produces good output for receipts). The fine-tuned model takes the Vision output and produces the JSON.

    // ReceiptExtractor.swift
    import Vision
    import LlamaDart // hypothetical iOS binding, see Ship: iOS
    
    class ReceiptExtractor {
        private let engine: LlamaEngine
        private let session: ChatSession
    
        init(modelPath: String) async throws {
            engine = LlamaEngine()
            try await engine.loadModel(path: modelPath)
            session = ChatSession(engine: engine)
        }
    
        func extract(from image: UIImage) async throws -> Receipt {
            let ocrText = try await runOCR(on: image)
            let json = try await runLLM(on: ocrText)
            return try JSONDecoder().decode(Receipt.self, from: json.data(using: .utf8)!)
        }
    
        private func runOCR(on image: UIImage) async throws -> String {
            guard let cgImage = image.cgImage else { throw ExtractorError.invalidImage }
            let request = VNRecognizeTextRequest()
            request.recognitionLevel = .accurate
            request.usesLanguageCorrection = false // false avoids autocorrecting receipt text
            try VNImageRequestHandler(cgImage: cgImage).perform([request])
            guard let observations = request.results else { return "" }
            return observations
                .compactMap { $0.topCandidates(1).first?.string }
                .joined(separator: "\n")
        }
    
        private func runLLM(on ocrText: String) async throws -> String {
            let prompt = """
            Extract receipt fields. Output strict JSON matching the schema. \
            Use null for missing fields. Dates in ISO 8601. Currency as ISO 4217. \
            Numbers as numbers, not strings.
    
            \(ocrText)
            """
            let response = try await session.generate(
                prompt: prompt,
                options: GenerateOptions(temperature: 0.1, topP: 0.9, maxTokens: 800)
            )
            session.reset()
            return response.trimmingCharacters(in: .whitespacesAndNewlines)
        }
    }
    
    struct Receipt: Decodable {
        let vendor: String?
        let date: String?
        let currency: String?
        let subtotal: Double?
        let tax: Double?
        let tip: Double?
        let total: Double?
        let lineItems: [LineItem]
        let paymentMethod: String?
        let notes: String?
    }

    Key choices:

    • usesLanguageCorrection: false for OCR. Receipts have product names, vendor names, and codes that Vision's spell-correction wrecks if left on.
    • Temperature 0.1: low. JSON output should be near-deterministic.
    • maxTokens: 800: caps the JSON length. A receipt with 30 line items takes about that.
    • Decode through a strict Codable struct. If the model's output does not decode, you find out immediately at the boundary; you do not ship JSON to the next layer that pretends to be schema-compliant.

    Recovery patterns

    Even at 0.5% JSON failures, a high-volume expense app sees them. Two recovery patterns worth wiring in:

    1. Re-run with stricter sampling. If the first inference fails to parse, re-run with temperature: 0.0 (greedy). This catches most of the failures and produces an identical result on retry.
    2. Surface to user. After two failures, show the OCR text in a form view and let the user type the fields. The model becomes a productivity boost, not a single point of failure.
    func extractWithRecovery(from image: UIImage) async throws -> Receipt {
        do {
            return try await extract(from: image)
        } catch {
            // First attempt failed to parse. Retry greedy.
            let ocrText = try await runOCR(on: image)
            let json = try await runLLMGreedy(on: ocrText)
            do {
                return try JSONDecoder().decode(Receipt.self, from: json.data(using: .utf8)!)
            } catch {
                throw ExtractorError.userRecoveryNeeded(ocrText: ocrText)
            }
        }
    }

    Probe set

    Ten prompts that exercise the failure modes. Each is an OCR output; pass criteria are about the JSON.

    #Receipt typePass criteria
    1Clean US restaurant receiptAll fields populated, math reconciles (subtotal + tax + tip = total)
    2European grocery (EUR, no tip line)currency is EUR, tip is null, line items populated
    3Taxi receipt (no itemisation)line_items is [], total is populated, payment method captured
    4Hotel folio (15+ line items across two days)All line items captured; date is the checkout date
    5Receipt with tip written by hand at the bottomThe handwritten amount is in tip; the model recognises the digit even with OCR ambiguity
    6Receipt with the TOTAL line cut offtotal is null, not invented; other fields populated correctly
    7Receipt where OCR misread 5.50 as S.50The model recovers to 5.50 using context (subtotal math) or flags as null
    8Bilingual receipt (English vendor name, Japanese line items)English fields preserved, Japanese line item names preserved verbatim, currency is JPY
    9Adversarial: blurred receipt with three plausible vendor names in the OCR textModel picks the most likely vendor; does not concatenate
    10Adversarial: receipt with a trailing comma in the typical position of the OCRModel output JSON does NOT have a trailing comma; passes a strict parser

    Most of these should pass cleanly on a properly trained model. Probe 6 (null on missing data instead of invention) and probe 10 (strict JSON discipline) are the canonical pass/fail tells.

    Limits

    • OCR quality is the ceiling. If the OCR step misses 40% of the text, no amount of model quality recovers it. The model is only as good as the text it sees; budget engineering time for OCR tuning (lighting hints, retry capture, perspective correction).
    • Schema rigidity. Adding a new field to the schema means a retrain. Plan schema changes carefully; a "miscellaneous" string field as an escape hatch in v1 saves retrains later.
    • Long documents. This recipe is sized for receipts (up to about 2,000 OCR'd tokens). Multi-page invoices need a different pattern: chunk the input, extract per page, merge in app code.
    • Hallucination on missing data. Even with adversarial training, the model occasionally invents a value when it should null. The probe set catches systematic cases; surface low-confidence fields in the UI for user confirmation.
    • No image input. This recipe is text-in, text-out. If you need image-in, you need a vision-language model and a different integration path. See the model catalogue for VLM options as they ship.

    What's next