
How to Prepare Enterprise Training Data for Small Model Fine-Tuning
A five-stage practical guide to converting unstructured enterprise documents — PDFs, Word files, scanned forms — into clean JSONL training data for small language model fine-tuning.
The model is not the hard part. Training infrastructure is not the hard part. The hard part of enterprise fine-tuning is getting the training data ready.
Enterprise data lives in PDFs, Word documents, Excel spreadsheets, scanned forms, email attachments, and legacy database exports. Fine-tuning a language model requires structured JSONL files containing prompt/completion pairs — clean, consistent, correctly formatted text with clear instruction-response mappings.
The gap between those two states is the data preparation challenge. It is where most enterprise AI projects spend 60-80% of their time. It is where the most common and costly mistakes happen. And it is where the difference between a model that works in production and a model that fails at deployment is determined.
This guide covers the five stages of preparing enterprise training data for small model fine-tuning: ingest, clean, label, augment, and export. Each stage maps to a concrete set of operations with specific tools, quality metrics, and failure modes.
Stage 1: Ingest — Parse Documents into Structured Text
The first stage converts source documents into machine-readable text with preserved structure. This sounds simple. It is not.
Digital PDFs
PDFs created by software (exported from Word, generated by ERP systems, produced by CAD tools) contain embedded text. But the text is stored as positioned glyphs, not as paragraphs and tables. A PDF that looks like a clean three-column table to a human reader is stored internally as hundreds of individual text fragments at specific X-Y coordinates, with no table object connecting them.
Table reconstruction requires grouping text fragments by spatial proximity, identifying column boundaries from alignment patterns, and assembling rows from fragments that share a vertical coordinate within a tolerance band. Get the tolerance wrong and you merge rows that should be separate, or split a single row across two.
Multi-column layouts present similar challenges. A two-column document's text fragments, read in naive left-to-right order, interleave content from both columns — producing nonsensical text.
Scanned Documents
Scanned documents require OCR before any text processing. Modern OCR engines (Tesseract 5, PaddleOCR, EasyOCR) achieve 95-99% character accuracy on clean scans of printed text. But enterprise archives contain:
- Faded copies from older photocopiers, reducing character contrast
- Skewed scans where pages were fed at an angle
- Stamps and signatures overlaying printed text
- Handwritten annotations mixed with printed content
- Multi-generation copies — a copy of a copy of a fax
Each of these degrades OCR accuracy. A scan that is 98% accurate at the character level produces roughly one error per 50 characters — which means a 500-word document has approximately 50 character-level errors. Some of those errors corrupt numbers (changing "1,234" to "1,2B4"), which is catastrophic for financial or quantity data.
Spreadsheets and Structured Files
Excel files, CSV exports, and database dumps are already structured — but inconsistently. Common issues:
- Merged cells in Excel that break column alignment
- Hidden rows and columns containing auxiliary data or formulas
- Multiple sheets with different schemas in one workbook
- Encoding mismatches between exported CSV files and the consuming system
- Date format ambiguity — is "03/06/26" March 6 or June 3?
Ingest Output Format
The output of the ingest stage should be a standardized intermediate format — one document per JSON file, containing:
{
"document_id": "DOC-2026-00142",
"source_file": "contract_amendment_3.pdf",
"source_type": "digital_pdf",
"pages": [
{
"page_number": 1,
"text_blocks": [...],
"tables": [...],
"metadata": {"ocr_confidence": null, "layout_type": "single_column"}
}
],
"ingest_timestamp": "2026-03-06T14:30:00Z",
"ingest_version": "1.2.0"
}
This intermediate format decouples parsing from downstream processing. When you improve your OCR or table extraction, you re-run ingest without touching the rest of the pipeline.
Stage 2: Clean — Remove Noise, Standardize, De-Identify
Raw parsed output contains noise that will degrade model training if not removed. The cleaning stage has three objectives: remove artifacts, standardize formats, and redact sensitive information.
Artifact Removal
- Headers and footers. Repeated page headers ("CONFIDENTIAL — Company X") and footers ("Page 3 of 47") appear in every page's text output. If left in, the model trains on hundreds of instances of "Page N of M" — learning to reproduce boilerplate instead of extracting useful content.
- OCR artifacts. Misrecognized characters, especially in tables: pipe characters read as "l" or "I", parentheses read as "C" or ")", degree symbols read as "o".
- Duplicate documents. Large archives contain multiple copies — original, amended, signed, countersigned. Without deduplication, the model overweights these documents. Use content-based hashing (not filename-based) to detect near-duplicates. Exact hashing misses documents that differ only in a header or page number; fuzzy hashing (MinHash, SimHash) catches these.
Format Standardization
Inconsistent formatting across source documents creates noise in training data:
| Element | Variations Found | Standardized Form |
|---|---|---|
| Dates | "03/06/2026", "6 Mar 2026", "March 6, 2026", "2026-03-06" | ISO 8601: "2026-03-06" |
| Numbers | "1,234.50", "1.234,50", "1234.5" | Locale-neutral: "1234.50" |
| Currency | "$1,234", "USD 1,234", "1,234 USD" | Symbol-prefix: "$1234.00" |
| Units | "m3", "M3", "CUM", "cum", "cubic meters" | Abbreviated: "m³" |
| Percentages | "15%", "15 %", "15 percent", "0.15" | Symbol: "15%" |
Standardization must be deterministic and reversible. Log every transformation so you can trace a cleaned value back to its original form. This traceability is not optional — it is a regulatory requirement under the EU AI Act.
PII and PHI De-Identification
This is the step that enterprise teams most frequently skip, underestimate, or do incorrectly. The consequences of getting it wrong range from compliance fines to criminal liability.
What must be redacted:
- PII (all industries): Names, addresses, phone numbers, email addresses, national ID numbers, bank account numbers, dates of birth
- PHI (healthcare): All of the above plus medical record numbers, health plan beneficiary numbers, device identifiers, biometric identifiers, and any other of the 18 HIPAA identifiers
- Financial identifiers: Account numbers, SWIFT codes, tax identification numbers
Redaction methods:
-
Replacement with typed placeholders. Replace "John Smith" with "[PERSON_NAME_1]", not with a generic "[REDACTED]". Typed placeholders preserve the semantic structure of the text. The model learns that a person name appears in that position, even though it does not learn the specific name.
-
Consistent replacement within documents. If "John Smith" appears 15 times in a contract, all 15 instances must be replaced with the same placeholder. Inconsistent replacement — "[PERSON_NAME_1]" in one place and "[PERSON_NAME_2]" in another for the same entity — teaches the model incorrect entity relationships.
-
Date shifting. For healthcare data, dates must be shifted by a consistent random offset within each patient record. A patient's admission date of March 1 and discharge date of March 5 might become January 15 and January 19 — the 4-day duration is preserved, but the actual dates are not recoverable.
Validation: After de-identification, run a secondary scan to catch missed PII. Named entity recognition models, regex patterns for structured identifiers (SSNs, phone numbers, emails), and dictionary-based matching for known names all serve as verification layers. A PII scan pass rate above 99.5% is the minimum threshold for regulated industries.
Stage 3: Label — Domain Experts Create Training Examples
Labeling is the stage that converts cleaned documents into training data. A domain expert reviews documents and produces examples of correct extractions that the model will learn to replicate.
What a Training Example Looks Like
Each training example is an instruction-response pair:
Instruction (prompt): The input the model will receive at inference time — typically a parsed document or document section, preceded by a task instruction.
Response (completion): The structured output the model should produce — typically JSON containing the extracted fields.
{
"instruction": "Extract all contract parties, effective date, and termination clause from the following contract text:\n\n[cleaned document text]",
"response": "{\"parties\": [{\"name\": \"Acme Construction LLC\", \"role\": \"contractor\"}, {\"name\": \"City of Portland\", \"role\": \"owner\"}], \"effective_date\": \"2026-01-15\", \"termination\": {\"type\": \"for_convenience\", \"notice_period_days\": 30, \"clause_reference\": \"Section 14.2\"}}"
}
Why Domain Experts, Not ML Engineers
This is a point where teams consistently make the wrong decision. ML engineers are available. Domain experts are expensive and busy. The temptation is to have the ML team do the labeling.
The result: accuracy drops 15-20%.
An ML engineer labeling a construction BOQ does not know that "PC Sum" is a provisional cost sum — a budget allocation, not a priced line item. They do not know that "Measured Work" and "Daywork" are fundamentally different pricing mechanisms. They label what they see literally, missing the domain semantics that make the extraction useful.
A quantity surveyor labels the same BOQ with implicit knowledge of how the data will be used downstream. They know which fields are critical (quantity, unit, rate) and which are informational (specification references). They know when a line item is a main item versus a sub-item based on numbering conventions that are not documented anywhere — they are industry practice.
The same pattern holds across industries. A paralegal labels contracts better than an ML engineer. A clinical coder labels medical notes better than a data scientist. A financial analyst labels financial statements better than a software developer.
Budget for domain expert time. It is the single highest-impact investment in a fine-tuning project.
Quality Over Quantity
The relationship between training data volume and model accuracy is not linear. It follows a curve with diminishing returns:
| Labeled Examples | Typical Accuracy | Marginal Gain |
|---|---|---|
| 50 | 70-75% | Baseline |
| 100 | 78-82% | +8-10% |
| 250 | 85-88% | +5-7% |
| 500 | 90-93% | +4-6% |
| 1,000 | 93-95% | +2-3% |
| 2,000 | 94-96% | +1-2% |
| 5,000 | 95-96% | <1% |
The diminishing returns above 500-1,000 examples mean that quality matters far more than quantity at enterprise scales. 500 high-quality examples labeled by a domain expert consistently outperform 10,000 noisy examples labeled by crowd workers or junior staff.
Quality Control on Labeled Data
Every labeling project needs quality control. Three mechanisms work together:
-
Inter-annotator agreement. Have two domain experts independently label the same 50 documents. Measure agreement rate. For field extraction tasks, agreement above 90% indicates clear labeling guidelines. Below 85% means the guidelines are ambiguous and need revision before proceeding.
-
Spot-check reviews. A senior domain expert reviews a random 10% sample of labeled examples weekly. Errors caught at this stage are cheap to fix. Errors caught after training — when the model reproduces them — are expensive.
-
Automated consistency checks. Scripted checks that catch mechanical errors: missing required fields, values outside expected ranges (a quantity of -500, a date in 1900), format violations in the response JSON. These catch typos and formatting mistakes that domain experts make when fatigued.
Stage 4: Augment — Expand the Training Set with Synthetic Variations
500 labeled examples may not cover the full distribution of documents the model will encounter in production. Synthetic data augmentation expands the training set while maintaining domain accuracy.
Augmentation Techniques
Template variation. Take a labeled example and generate variations by:
- Rephrasing the instruction (10-15 variations of the same extraction task)
- Reordering fields in tabular data
- Adding or removing optional fields that appear in some documents but not all
LLM-assisted generation. Use a local LLM to generate synthetic documents based on the patterns in your labeled examples. Provide 5 real examples and ask the model to generate 20 new documents with similar structure but different content. Then have a domain expert verify a sample (20-30%) of the generated examples.
Perturbation. Introduce realistic noise: OCR-style character errors, missing fields, truncated text. This teaches the model to handle imperfect input, which it will encounter in production.
Augmentation Ratios
A conservative augmentation strategy:
- 500 real labeled examples → base training set
- 1,000 template variations → expand instruction diversity
- 500 synthetic documents → expand content diversity
- Total: ~2,000 training examples from 500 expert-labeled originals
Do not augment beyond 4x-5x the original labeled set. Over-augmentation dilutes the signal from real examples with noise from synthetic ones. The model starts learning the patterns of the augmentation process rather than the patterns of real documents.
Distribution Balance
Check that the augmented dataset maintains class balance. If 80% of your labeled BOQ items are concrete and steel, and only 5% are electrical and mechanical, the model will be excellent at extracting civil works items and poor at extracting MEP items.
Balance the augmented set by over-sampling under-represented categories and under-sampling over-represented ones. Target no class representing more than 3x the frequency of the smallest class.
Stage 5: Export — Convert to Training-Ready JSONL
The final stage converts the augmented dataset into the format required by the fine-tuning framework.
JSONL Format
Most fine-tuning frameworks (Hugging Face TRL, Axolotl, LLaMA-Factory) accept JSONL — one JSON object per line, each containing an instruction and response:
{"instruction": "Extract line items from...", "response": "{\"items\": [...]}"}
{"instruction": "Identify parties in...", "response": "{\"parties\": [...]}"}
Some frameworks use a messages format for chat-style fine-tuning:
{"messages": [{"role": "system", "content": "You are a document extraction assistant."}, {"role": "user", "content": "Extract..."}, {"role": "assistant", "content": "{...}"}]}
Train/Validation Split
Split the data 90/10 or 85/15 into training and validation sets. The split must be stratified — each document type and extraction task should be proportionally represented in both sets. Do not split randomly; ensure that all examples from a single source document are in the same split, to prevent data leakage.
Metadata for Traceability
Include metadata fields in the export for audit and debugging purposes:
{
"instruction": "...",
"response": "...",
"metadata": {
"source_document": "DOC-2026-00142",
"labeled_by": "expert_qs_03",
"labeled_date": "2026-02-15",
"augmentation_type": "none",
"quality_review": "passed",
"pii_scan": "clean"
}
}
This metadata is not used during training — it is stripped before the data reaches the model. But it is essential for:
- Debugging model errors. When the model makes a mistake on a specific document type, trace it back to the training examples for that type.
- Regulatory compliance. The EU AI Act Article 10 requires documentation of training data: what data was used, how it was prepared, what quality measures were applied. The metadata log provides this documentation.
- Retraining decisions. When new document types are added, the metadata tells you which existing examples are relevant and which new examples are needed.
Common Mistakes and How to Avoid Them
Mistake 1: Skipping De-Identification
"We'll do it later" is the most expensive sentence in enterprise AI. PHI in training data is a compliance violation the moment it enters the training pipeline — not when the model is deployed. HIPAA, GDPR, and sector-specific regulations all apply to data processing, not just data storage or model inference.
Fix: De-identification runs in Stage 2, before any labeling begins. No exceptions.
Mistake 2: Using Non-Expert Annotators
Hiring crowd workers or using junior staff to label domain-specific data saves money in the short term and costs significantly more in the long term. The model learns the annotators' mistakes. Fixing annotation errors after training means re-labeling and re-training — doubling the total cost.
Fix: Budget for domain expert time from the start. If the budget constrains the number of examples, label fewer examples at higher quality. 300 expert-labeled examples outperform 3,000 crowd-labeled examples for enterprise document extraction.
Mistake 3: No Quality Control on Labeled Data
Without quality control, annotation errors compound. If 10% of examples contain labeling errors, and those errors are inconsistent (different annotators making different mistakes), the model receives contradictory training signals. Accuracy plateaus well below what the model architecture can achieve.
Fix: Implement inter-annotator agreement checks, spot-check reviews, and automated consistency validation. Measure quality before training. If label agreement is below 90%, fix the labeling guidelines and re-label the disagreed examples before proceeding.
Mistake 4: Training on Volume Over Quality
More data is not always better. 10,000 training examples with inconsistent formatting, mixed quality, and unbalanced class distribution produce a worse model than 500 examples with clean formatting, consistent quality, and balanced distribution.
Fix: Set a quality threshold. Every training example must pass automated checks (format valid, fields present, values in range) and a sample must pass expert review (semantically correct, domain-appropriate). Remove examples that fail. A smaller clean dataset beats a larger noisy one.
Mistake 5: No Audit Trail
Without an audit trail, you cannot explain what data the model was trained on, how it was prepared, or what quality measures were applied. This is a compliance failure under the EU AI Act, and a practical failure when you need to debug model behavior or retrain on updated data.
Fix: Log every pipeline step with timestamps, versions, and parameters. Tag every training example with its provenance. Store the logs alongside the training data. When an auditor — internal or regulatory — asks "what data was this model trained on and how was it prepared," the answer should be a query against the metadata, not a reconstruction from memory.
Data Quality Metrics: What to Measure
Before starting fine-tuning, validate the training data against these metrics:
| Metric | Target | What It Measures |
|---|---|---|
| Label agreement rate | >90% | Consistency between annotators |
| Class balance ratio | <3:1 max:min | Distribution of document/extraction types |
| Example diversity score | >0.7 (cosine) | Variety in training examples |
| PII scan pass rate | >99.5% | Completeness of de-identification |
| Format validation pass rate | 100% | Structural correctness of JSONL output |
| Duplicate rate | <2% | Near-duplicate examples in dataset |
| Instruction length variance | CV <0.4 | Consistency of input formatting |
| Response completeness | >98% | Percentage of examples with all required fields |
If any metric falls below its target, fix the data before training. Training on substandard data wastes compute and produces a model that needs to be retrained anyway.
The Preparation Timeline
For a typical enterprise fine-tuning project targeting 500 labeled examples:
| Stage | Duration | Effort | Bottleneck |
|---|---|---|---|
| Ingest | 1-2 weeks | Engineering | Document format variety |
| Clean | 1-2 weeks | Engineering + Legal | PII identification and redaction rules |
| Label | 3-5 weeks | Domain experts | Expert availability |
| Augment | 1 week | Engineering | Quality review of synthetic data |
| Export | 2-3 days | Engineering | Format validation |
| Total | 7-11 weeks |
The labeling stage dominates the timeline. Plan for it. Book domain expert time in advance. Prepare labeling guidelines and tooling before the experts start. Every hour of expert time wasted on unclear instructions or broken tools is an hour you cannot get back.
Data preparation is not glamorous work. It is not the part of the AI project that gets featured in vendor demos or conference talks. But it is the part that determines whether the model works. Get the data right, and a 7B model will outperform a 70B model on your specific task. Get the data wrong, and no amount of model size or training compute will compensate.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

SLM Fine-Tuning for Document Processing: Turning Enterprise PDFs into Structured Data
How enterprises use fine-tuned small language models to extract structured data from PDFs — construction BOQs, legal contracts, medical records, and financial statements — at a fraction of manual processing cost.

On-Device vs On-Premise AI: Different Privacy Problems, Different Data Prep
On-device AI and on-premise AI solve fundamentally different privacy problems — and require fundamentally different data preparation strategies. Here's how to tell which you need and what your data pipeline should look like for each.

Runtime-Aware Data Prep: Why Your Pipeline Should Know Where the Model Will Run
Current AI pipelines assume train-then-deploy. For on-device AI, the workflow is teacher → distillation → quantization → runtime constraints. Data preparation that understands the target runtime produces fundamentally better models.