Synthetic Data Generation in Air-Gapped Environments for Fine-Tuning

Enterprise datasets are small. Not "we wish we had more" small — structurally small. A hospital might have 1,500 relevant radiology reports. A law firm might have 800 contracts of the specific type needed. A bank might have 3,000 transaction narratives that fit the classification task.

For fine-tuning, this is often insufficient. Most fine-tuning approaches produce better results with 5,000-50,000 training examples, depending on task complexity. When real data is scarce, synthetic data generation fills the gap — using models to create additional training examples that expand coverage, balance class distribution, and introduce variation.

In air-gapped environments, where no network traffic is allowed, all generation must use local models. This guide covers the practical techniques, the workflow, and the limitations.

Why Synthetic Data Matters for Service Providers

Service providers delivering fine-tuning to enterprise clients face the small-data problem on nearly every project. Enterprise clients have enough data to demonstrate the task but rarely enough for robust model training.

The options are:

Train on what you have — Works for simple tasks with large datasets. Produces brittle models when data is scarce.
Collect more real data — Ideal but slow. Requires domain expert time that clients may not have. May take months to accumulate enough.
Generate synthetic data — Expands the dataset immediately using the existing real data as seeds. Available now, with controllable quality.

Synthetic data isn't a replacement for real data. It's a force multiplier. A dataset of 1,500 real examples augmented with 5,000 synthetic examples typically produces better fine-tuning results than the 1,500 real examples alone — provided the synthetic data is filtered for quality.

Techniques for Synthetic Data Generation

Paraphrasing

Take an existing training example and generate variations that preserve the meaning but change the surface form. This is the simplest and safest augmentation technique.

How it works: Prompt a local LLM with an existing example and ask for 3-5 paraphrases. Filter generated paraphrases for similarity (too similar = no benefit, too different = semantic drift).

When to use: When you need more training volume but the label distribution is already balanced. Paraphrasing doesn't change the distribution — it just increases density.

Quality controls:

Semantic similarity between original and paraphrase should be 0.7-0.9 (measured by local embedding model)
Exact or near-exact copies should be discarded
Domain-specific terminology should be preserved, not paraphrased away

Instruction Generation from Documents

Transform raw documents into instruction/completion training pairs. This is the primary technique for building fine-tuning datasets from document collections.

How it works: Given a source document, prompt the model to generate questions or instructions that the document can answer. Then generate (or extract) the completion from the document.

Example: Given a contract clause about termination rights, generate:

"What are the termination conditions in this agreement?"
"Summarize the early termination clause."
"Under what circumstances can either party terminate?"

Each question-answer pair becomes a training example.

When to use: When the client has documents but not instruction/completion pairs. This is the most common enterprise scenario — organizations have knowledge in documents but not in the format fine-tuning requires.

Quality controls:

Generated questions must be answerable from the source document
Completions must be factually grounded in the document, not hallucinated
Questions should vary in type (factual, analytical, summarization) and complexity

DPO Pair Creation

Direct Preference Optimization (DPO) training requires pairs of responses where one is preferred over the other. Generating these pairs synthetically is valuable when you want to steer model behavior — prefer concise answers, prefer formal tone, prefer answers that cite sources.

How it works: For a given instruction, generate two responses: one following the desired behavior and one violating it. Label the pair as chosen/rejected.

When to use: When the fine-tuning objective includes behavioral alignment (tone, format, safety, citation behavior) beyond just factual accuracy.

Quality controls:

The difference between chosen and rejected should be clear and consistent
Both responses should be fluent — the rejected response shouldn't be obviously broken
Preference direction should align with a documented style guide

Seed Example Expansion

Start with a small set of high-quality, human-verified examples and generate additional examples that match the pattern, distribution, and quality of the seeds.

How it works: Provide 10-20 seed examples as context. Prompt the model to generate new examples following the same pattern. Filter for quality and deduplication.

When to use: When you have a small number of expert-created examples and need to scale up. Works well for specialized tasks where the pattern is consistent (e.g., clinical note summarization, contract clause extraction).

Quality controls:

Generated examples should match the distribution of seed examples (topic, length, complexity)
Manual review of a random sample (10-20%) to verify quality
Semantic similarity between generated examples and seeds should be in a defined range (not too close, not too far)

Air-Gapped Constraints

All of the above techniques work with cloud APIs. The constraint in air-gapped environments is that generation must use local models exclusively. This introduces specific limitations and considerations:

Model capability ceiling: Local models (7B-70B parameters) are less capable than frontier API models (GPT-4, Claude) for generation tasks. Generated text quality is lower, hallucination rates are higher, and instruction following is less reliable.

Mitigation: Stricter quality filtering. Generate more examples than you need and aggressively filter to the top 60-70% by quality score.

Throughput constraints: Generating 10,000 synthetic examples on a single GPU takes hours to days, depending on model size and output length. Plan generation time into the project timeline.

Mitigation: Use smaller models (7B-8B) for initial generation at high throughput, then use a larger model (13B-70B) for quality filtering. The generation model doesn't need to be perfect — the filter does.

No model updates: In an air-gapped environment, you can't download new model weights during the project. Pre-load all models you might need before the network is disconnected.

Comparison: Synthetic Data Tools

Distilabel (Argilla)

Open-source library for synthetic data generation using LLMs. Pipeline-based — define generation steps as a directed graph.

Strengths: Flexible, supports multiple LLM backends, well-documented. Weaknesses: Requires Python expertise to configure. No GUI. Pipeline definitions are code, not configuration. Domain experts can't use it independently.

Gretel

Commercial synthetic data platform focused on privacy-safe data generation. Supports tabular and text data.

Strengths: Strong privacy guarantees, good for tabular data augmentation. Weaknesses: Cloud/hybrid deployment model — not suitable for fully air-gapped environments. Commercial license.

Custom Scripts

Many teams write custom generation scripts — a Python loop calling Ollama's API with prompt templates and quality filters.

Strengths: Complete control, no dependencies beyond the LLM runtime. Weaknesses: Maintenance burden, no built-in quality metrics, no audit trail, not reusable across projects.

Practical Workflow

A step-by-step workflow for synthetic data generation in an air-gapped environment:

Step 1: Select seed examples (1-2 hours) From your labeled dataset, select 20-50 high-quality, representative examples. These should cover the full range of categories, complexity levels, and formats in your dataset.

Step 2: Configure local LLM (30 minutes) Deploy the generation model via Ollama or llama.cpp. Test inference speed and output quality with a few sample prompts. Adjust temperature (0.7-0.9 works well for generation) and max tokens.

Step 3: Design generation prompts (2-4 hours) Write and test prompts for each generation technique you'll use. Test against 20 seed examples. Iterate until output quality is consistent.

Step 4: Generate at scale (4-24 hours, depending on volume) Run batch generation for all techniques. Target 3-5x the volume you need — you'll filter down later.

Step 5: Quality filter (2-4 hours) Apply automated quality filters:

Semantic similarity to seeds (keep 0.6-0.9 range)
Deduplication against real data and within synthetic data
Heuristic quality checks (length, coherence, format compliance)
Optional: use a larger local model as a quality judge

Step 6: Human review (2-8 hours) Domain experts review a random 10-20% sample of filtered synthetic data. Reject examples that are factually wrong, off-topic, or stylistically inconsistent.

Step 7: Merge with real data (30 minutes) Combine filtered synthetic data with real labeled data. Mark synthetic examples with a metadata flag (for traceability). Typical final ratio: 20-40% synthetic, 60-80% real.

Quality Filtering: The Non-Negotiable Step

Synthetic data without quality filtering is worse than no synthetic data. Unfiltered generated text introduces hallucinations, factual errors, and distribution shifts that degrade model performance.

Minimum filtering pipeline:

Format compliance: Does the generated example match the required schema?
Deduplication: Is this example distinct from all other examples (real and synthetic)?
Semantic relevance: Is this example on-topic for the training task?
Factual grounding: For examples generated from source documents, can the answer be verified against the source?
Diversity check: Does the synthetic set cover the same distribution as the real data, or is it clustered?

Ertas Data Suite's Augment module handles synthetic data generation using local LLMs (via Ollama/llama.cpp) with built-in quality filtering and deduplication. Generation prompts are configured through a visual interface, and every generated example is tagged with its source seed, generation method, and quality scores — all logged to the project audit trail.

Connecting to the Pipeline

Augmented data (real + synthetic) feeds into export, where the combined dataset is formatted for the target use case — JSONL for fine-tuning, chunked text for RAG, or other formats as needed.

For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.

Synthetic Data Generation in Air-Gapped Environments for Fine-Tuning

Why Synthetic Data Matters for Service Providers

Techniques for Synthetic Data Generation

Paraphrasing

Instruction Generation from Documents

DPO Pair Creation

Seed Example Expansion

Air-Gapped Constraints

Comparison: Synthetic Data Tools

Distilabel (Argilla)

Gretel

Custom Scripts

Practical Workflow

Quality Filtering: The Non-Negotiable Step

Connecting to the Pipeline

Ship AI that runs on your users' devices.

Keep reading

On-Premise Runtime Architecture for Enterprise AI Data Preparation

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments

How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning