From 700GB of PDFs to a 500-Example Fine-Tuning Dataset: The Data Reduction Pipeline

A construction company came to us with 700GB of PDFs — bills of quantities, technical specifications, architectural drawings, site reports, and project correspondence accumulated over 15 years. They wanted to fine-tune a model for two tasks: document classification (identify document type from first page) and entity extraction (pull key project details from specifications).

They asked: "How do we use all this data for training?"

The answer: you don't. You reduce it.

Fine-tuning a language model requires 500-5,000 carefully curated examples for most enterprise tasks. Using "all the data" introduces noise, contradictions, redundancy, and formatting inconsistencies that actively damage model performance. The goal is not to maximize data volume — it is to distill 700GB of raw documents into the 2,000-3,000 examples that teach the model exactly what it needs to learn.

This article walks through the five-stage reduction pipeline that transforms massive document archives into precision training datasets.

The Scale Mismatch

The numbers put the challenge in perspective:

700GB of PDFs ≈ 140,000 documents at 5MB average
140,000 documents ≈ 14 million pages at 100 pages average
14 million pages ≈ 7 billion tokens at 500 tokens/page
Fine-tuning needs ≈ 2,000 examples at ~500 tokens average = 1 million tokens

You need 0.014% of the available data. The other 99.986% is either redundant, irrelevant, outdated, or too noisy to improve training.

The reduction pipeline must find the right 0.014% — the examples that are representative, accurate, diverse, and formatted correctly. This is not random sampling. It is systematic curation.

Stage 1: Triage

Input: 700GB of raw documents Output: ~200GB of potentially relevant documents Reduction: ~70%

Triage sorts the document archive into "keep," "discard," and "review" piles. The goal is to eliminate obviously irrelevant material before any expensive processing begins.

Automated Triage

Duplicate removal. Enterprise archives contain massive duplication — the same specification distributed to 15 subcontractors, the same drawing saved in 4 versions with minor filename variations. Content hashing (MD5 or SHA-256 of the file) catches exact duplicates. For construction companies, we typically see 15-30% exact duplicates in unmanaged archives.

File type filtering. Not all PDFs contain useful training content. Filter out: blank pages, cover sheets with only a logo, table of contents pages, placeholder documents, and corrupted files that fail to parse. Automated checks: page count (reject 0-page PDFs), file size (reject files under 10KB, likely blank), text extractability (reject files where no text can be extracted).

Date filtering. Documents older than a defined cutoff may not reflect current standards, terminology, or business practices. If your model will process current documents, training on 15-year-old specifications with outdated standards can introduce stale patterns. A reasonable cutoff for most enterprises: keep documents from the last 5-7 years unless older documents are specifically needed.

Document type classification. Use a zero-shot classifier or simple keyword matching to classify documents by type: specification, drawing, correspondence, report, contract, invoice. Keep only the document types relevant to your training task. For a document classification model, you need examples of all document types; for an entity extraction model focused on specifications, you only need specifications.

Manual Triage Review

Automated triage handles the bulk reduction, but 10-15% of documents land in the "review" pile — documents where automated classification is uncertain. A domain expert reviews these, spending 15-30 seconds per document deciding "keep" or "discard."

At this rate, reviewing 5,000 uncertain documents takes approximately 30 hours. This is the most cost-effective domain expert time in the entire pipeline — each minute eliminates multiple documents from downstream processing.

Stage 2: Extract

Input: ~200GB of relevant documents Output: ~5GB of relevant content sections Reduction: ~97%

Extraction pulls the specific sections from documents that contain training-relevant content. Not full documents — specific passages, tables, clauses, or pages.

Section-Level Extraction

Most enterprise documents are 80% boilerplate and 20% unique content. A 200-page construction specification contains:

40 pages of standard terms and conditions (same across every project)
30 pages of general requirements (mostly standardized)
80 pages of technical specifications (unique, high-value for training)
30 pages of appendices (drawings, schedules — may or may not be relevant)
20 pages of cover pages, contents, and blank pages

For training, the 80 pages of technical specifications are gold. The 40 pages of standard terms are noise — every project has the same boilerplate, so including it just teaches the model to output boilerplate.

Section extraction methods:

Heading-based extraction: Parse the document structure and extract sections by heading. "Section 03300 - Cast-in-Place Concrete" is relevant; "Section 00100 - Instructions to Bidders" is not.
Keyword-based extraction: Extract pages containing domain-specific keywords that indicate relevant content. Filter out pages with only administrative or procedural content.
Layout-based extraction: Use the layout detector to identify pages with high content density (text + tables) versus low content density (mostly whitespace, headers, or images).

Content Quality Filter

After section extraction, filter the extracted content for quality:

Remove sections with under 50% readable text (likely image-heavy pages that need different processing)
Remove sections that are identical or near-identical to previously extracted sections (cross-document deduplication)
Remove sections in languages your model won't handle
Remove sections with heavy redaction or censoring (insufficient content to learn from)

Stage 3: Transform

Input: ~5GB of relevant content sections Output: ~50MB of candidate training pairs Reduction: ~99%

This is where extracted content becomes training data. Each relevant section is transformed into an input/output pair that matches the format the model will use in production.

For Document Classification

Transform each document's first page (or representative section) into a classification example:

{
  "input": "Classify the following document excerpt:\n\n[first 500 tokens of document]",
  "output": "document_type: technical_specification\nconfidence_reason: Contains section headings with CSI format (Section 03300), material specifications, and compliance references to ASTM standards."
}

From 140,000 documents after deduplication and triage, you might have 30,000 unique first pages across 12 document types. Transform all of them into candidate classification pairs.

For Entity Extraction

Transform relevant sections into extraction examples:

{
  "input": "Extract project details from the following specification section:\n\n[specification text]",
  "output": {
    "project_name": "Westfield Commercial Center Phase 2",
    "specification_section": "03300 - Cast-in-Place Concrete",
    "concrete_grade": "C30/37",
    "slump_requirement": "100mm ± 25mm",
    "curing_period": "7 days minimum",
    "referenced_standards": ["ASTM C150", "ASTM C33", "ACI 318"]
  }
}

This transformation requires domain expertise. An ML engineer can structure the input format, but a construction engineer must identify and verify the correct entity values. This is the most labor-intensive stage.

Creating Training Pairs

The domain expert's involvement at this stage typically follows this workflow:

ML engineer extracts candidate sections and structures them as inputs
ML engineer uses an LLM to generate draft outputs (entity extraction suggestions)
Domain expert reviews each draft, correcting errors and filling gaps
ML engineer validates the final pairs against the output schema

Time estimate: reviewing and correcting a draft extraction takes 2-5 minutes per example. For 3,000 candidate pairs, budget 100-250 hours of domain expert time. Spread across 3 experts over 4 weeks, that's 8-20 hours per week per expert.

Stage 4: Curate

Input: ~50MB of candidate training pairs (e.g., 10,000 pairs) Output: ~10MB of high-quality training pairs (e.g., 2,000 pairs) Reduction: ~80%

Not all candidate pairs are suitable for training. Curation filters for quality, balance, and diversity.

Quality Filtering

Run all candidate pairs through quality checks:

Label accuracy: Have a second domain expert review a random 15% of pairs. If agreement is below 90%, the labeling guidelines need revision and conflicting examples need re-review.

Format compliance: Validate every output against the expected schema. Reject pairs with missing fields, incorrect data types, or malformed structure.

Near-duplicate removal: Embed all input texts and remove pairs where the input cosine similarity exceeds 0.95. From 10,000 candidates, near-duplicate removal typically eliminates 20-40%.

Class Balancing

For classification tasks, count examples per class. If "technical_specification" has 4,000 examples and "site_report" has 200, the model will overwhelmingly learn to predict "technical_specification."

Balance by downsampling overrepresented classes and keeping all examples from underrepresented classes. The target: no class below 5% of the final dataset. If a class genuinely occurs rarely, oversample it or create additional examples specifically for that class.

Diversity Maximization

Among the remaining candidates, select for maximum diversity:

Different projects (not 500 examples from the same construction project)
Different document templates (not all from the same engineering firm)
Different complexity levels (simple specifications and complex multi-system specifications)
Different edge cases (unusual formats, non-standard terminology, multi-language documents)

A practical approach: cluster the candidate pairs using sentence embeddings, then select examples that cover the full cluster space — more examples from underrepresented clusters, fewer from dense clusters.

Stage 5: Validate

Input: ~2,000 curated training pairs Output: ~2,000 validated training pairs (or fewer, if validation reveals issues) Reduction: 0-10% (validation removes the last quality issues)

The final stage is expert review of the complete curated dataset.

Expert Review

A domain expert reviews the final dataset, not by checking every example (they already reviewed individual examples in Stage 3), but by examining the dataset as a whole:

Coverage check: Are all document types represented? Are all entity types present? Are edge cases included?
Consistency check: Are similar documents labeled consistently? If two nearly identical specifications have different entity extractions, one is wrong.
Realism check: Does the dataset reflect what the model will actually encounter in production? Or is it biased toward easy examples?

Edge Case Audit

Specifically review the edge cases identified during earlier stages:

Documents with non-standard formatting
Documents in unusual templates
Documents with incomplete information
Documents with conflicting information
Documents at the boundaries between categories

Ensure at least 3-5 examples of each edge case type are in the final dataset.

Format Verification

Final automated check: run every example through the output schema validator one last time. This catches any formatting issues introduced during the review and correction process.

Real-World Results

The construction company's 700GB archive reduced as follows:

Stage	Volume	Documents/Examples	Reduction
Raw archive	700GB	~140,000 documents	—
After triage	195GB	~45,000 documents	72%
After extraction	4.8GB	~120,000 sections	97.5%
After transformation	42MB	8,200 candidate pairs	—
After curation	8.5MB	1,800 classification + 3,200 extraction pairs	—
After validation	8.1MB	1,750 classification + 3,100 extraction pairs	—

Total time: 4 weeks with 2 ML engineers and 3 domain experts.

The resulting model achieved 94% accuracy on document classification (12 categories) and 89% accuracy on entity extraction (15 entity types). This was with a 7B parameter model fine-tuned on fewer than 5,000 examples — extracted from 700GB of source material.

Common Mistakes

Skipping triage and trying to process everything. Processing 700GB of PDFs through a parsing pipeline takes weeks of compute time and produces massive amounts of irrelevant data. Triage eliminates 70% of that cost upfront.

Random sampling instead of systematic curation. Randomly selecting 2,000 documents from 140,000 produces a dataset that mirrors the archive's distribution — which is usually dominated by a few document types and a few projects. Systematic curation ensures coverage of all types and diversity across sources.

Letting ML engineers create ground truth without domain experts. The transformation stage requires domain knowledge to produce correct outputs. An ML engineer can structure the pairs, but a construction engineer must verify the entity values.

Including outdated documents without flagging. Older documents may reference obsolete standards or use deprecated terminology. If included, flag them so the model learns current practices, not historical ones.

Ertas Data Suite supports each stage of the data reduction pipeline. Automated triage classifies and deduplicates documents at ingestion. Section-level extraction identifies and isolates relevant content. The labeling interface supports efficient domain expert review of candidate training pairs. Quality scoring, deduplication, and class balancing are built into the curation stage. And the final validation workflow ensures the dataset meets quality thresholds before export. The entire pipeline runs on-premise, keeping 700GB of sensitive enterprise documents within your infrastructure.

Your data is the bottleneck — not your models.

Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

Book a Discovery Call Learn about Ertas Data Suite →