
Optimizing Local LLM Inference for Data Labeling and Augmentation Tasks
Practical guide to optimizing local LLM inference for data prep — model selection, quantization trade-offs, batch strategies, and throughput tuning for labeling and augmentation.
Local LLM inference transforms data preparation from a fully manual process into a human-in-the-loop workflow. Instead of labeling every document from scratch, an ML engineer or domain expert reviews and corrects AI-generated pre-annotations. The quality of those pre-annotations — and the speed at which they're generated — depends on how well the local inference stack is configured.
This guide covers practical optimization for using local LLMs in data labeling and augmentation: which models to choose, how quantization affects labeling accuracy, how to structure prompts for consistent outputs, and where the throughput bottlenecks are.
Model Selection for Data Prep Tasks
Not all models are equally suited for data preparation work. The requirements differ from conversational AI:
- Instruction following: The model must follow structured output instructions consistently. If you ask for JSON with specific keys, it should produce JSON with those keys — every time, not 95% of the time.
- Short, structured outputs: Most labeling tasks produce outputs of 10–200 tokens (a label, a JSON object, a brief extraction). Models optimized for long-form generation are wasted here.
- Domain vocabulary: The model should handle technical terminology without paraphrasing it. Medical codes, legal citations, and engineering terms need to pass through verbatim.
Recommended Models by Task
| Task | Recommended Models | Size | Notes |
|---|---|---|---|
| Text classification | Mistral 7B Instruct, Llama 3.1 8B Instruct | 7–8B | Fast, accurate for category assignment |
| Named entity extraction | Qwen 2.5 14B Instruct, Llama 3.1 8B Instruct | 8–14B | 14B improves accuracy on uncommon entities |
| Sentiment / topic analysis | Mistral 7B Instruct, Phi-3 Mini | 3.8–7B | Simple tasks; smaller models work fine |
| Document summarization | Qwen 2.5 14B Instruct, Llama 3.1 8B Instruct | 8–14B | Longer output; 14B produces more coherent summaries |
| Synthetic data generation | Qwen 2.5 14B Instruct, Mistral 7B Instruct | 7–14B | Quality scales with model size for generation |
| Multi-label classification | Qwen 2.5 14B Instruct | 14B | Better at handling multiple simultaneous labels |
The 7B sweet spot: For most classification and extraction tasks, 7B instruction-tuned models deliver 90–95% of the accuracy of larger models at 2–3x the throughput. Start with 7B. Move to 14B only when you measure an accuracy gap on your specific task.
Quantization Trade-Offs
Quantization reduces model precision from 16-bit floating point to lower bit widths, shrinking model size and increasing inference speed. The trade-off is accuracy.
Quantization Levels Compared
| Quantization | Size (7B model) | VRAM | Speed (RTX 4070) | Quality Impact |
|---|---|---|---|---|
| F16 (no quant) | ~14 GB | ~15 GB | ~20 tok/s | Baseline |
| Q8_0 | ~7.5 GB | ~8 GB | ~35 tok/s | Negligible loss |
| Q6_K | ~5.8 GB | ~6.5 GB | ~42 tok/s | Minimal loss |
| Q5_K_M | ~5.1 GB | ~5.8 GB | ~48 tok/s | Slight loss on nuanced tasks |
| Q4_K_M | ~4.3 GB | ~5 GB | ~55 tok/s | Measurable loss on complex extraction |
| Q4_0 | ~3.8 GB | ~4.5 GB | ~58 tok/s | Noticeable degradation |
| Q3_K_M | ~3.3 GB | ~4 GB | ~62 tok/s | Significant quality reduction |
| Q2_K | ~2.7 GB | ~3.5 GB | ~65 tok/s | Not recommended for data prep |
Quantization Recommendations for Data Prep
Classification tasks (binary, multi-class): Q4_K_M is fine. Classification outputs are short and constrained. The model either picks the right category or it doesn't — and Q4 preserves enough precision for this decision.
Entity extraction: Q5_K_M or Q8_0 preferred. Extraction requires the model to identify and reproduce specific strings from the input. Lower quantization can cause subtle token-level errors (misspellings, truncated entities) that are expensive to catch in review.
Synthetic data generation: Q5_K_M minimum. Generated text quality degrades noticeably below Q5 — sentences become less coherent, technical terminology gets garbled, and the output requires more human editing.
General recommendation: Start with Q4_K_M for initial testing. If accuracy on your specific task is acceptable, stay there for throughput. If you see quality issues, step up to Q5_K_M or Q8_0. Don't go below Q4_K_M for data preparation work.
Inference Backends: Ollama vs llama.cpp vs vLLM
Ollama
Best for: Most data preparation setups. Easy model management, OpenAI-compatible API, automatic GPU detection.
Ollama wraps llama.cpp with a model registry and HTTP server. The overhead is minimal — an HTTP request/response adds under 1ms per inference call, which is negligible compared to the 100ms–10s that inference itself takes.
Configuration for data prep:
# Set concurrent request limit (default is 1)
OLLAMA_NUM_PARALLEL=4 ollama serve
# Pull a model
ollama pull mistral:7b-instruct-v0.3-q4_K_M
# Or pull a specific quantization
ollama pull qwen2.5:14b-instruct-q5_K_M
Key setting: OLLAMA_NUM_PARALLEL controls how many requests Ollama processes concurrently. For data prep, set this to 2–4 if your GPU has enough VRAM. Each parallel request loads an additional context window into VRAM.
llama.cpp (Direct)
Best for: Air-gapped environments where Ollama's model registry is unreachable, or when you need fine-grained control over inference parameters.
llama.cpp's llama-server provides an OpenAI-compatible endpoint without Ollama's model management layer. You point it directly at a GGUF file.
# Start server with specific model and tuned parameters
llama-server \
--model ./models/mistral-7b-instruct-v0.3.Q4_K_M.gguf \
--ctx-size 4096 \
--n-gpu-layers 99 \
--parallel 4 \
--batch-size 512
Key parameters:
--ctx-size: Context window size. For document labeling, 4096 tokens handles most single-document tasks. Increase to 8192 or 16384 for long documents.--n-gpu-layers: Number of model layers to offload to GPU. Set to 99 to offload everything. Reduce if you run out of VRAM.--parallel: Concurrent request slots. Each slot reserves context memory.--batch-size: Tokens processed per batch during prompt evaluation. Higher values speed up prompt processing but use more memory.
vLLM
Best for: High-throughput batch processing with multiple concurrent requests. vLLM's PagedAttention mechanism handles many concurrent requests more efficiently than llama.cpp.
Trade-offs: Heavier installation (Python + PyTorch + CUDA). More complex setup. Less suitable for interactive labeling workflows where you process one document at a time. Overkill for most data preparation scenarios unless you're running batch inference on 100K+ documents.
Recommendation for most service providers: Use Ollama. It's the simplest path to working local inference. Switch to llama.cpp direct if you're in an air-gapped environment. Consider vLLM only if batch throughput is a measured bottleneck.
Batch Inference Strategies
Data labeling and augmentation naturally lend themselves to batch processing. You have N documents to label — process them as efficiently as possible.
Sequential Single-Document Processing
Process one document at a time. Simple, easy to debug, easy to resume if interrupted.
Throughput: For a 7B Q4 model generating ~50 tokens per label, expect 30–50 labels per minute on an RTX 4070. That's 1,800–3,000 labels per hour.
Parallel Batch Processing
Send multiple inference requests concurrently. Ollama's OLLAMA_NUM_PARALLEL or llama.cpp's --parallel enables this.
Throughput improvement: 2–4 parallel requests typically improve throughput by 1.5–3x (not linear, because GPU compute is shared). With 4 parallel requests on an RTX 4070, expect 4,500–8,000 labels per hour.
VRAM cost: Each parallel slot reserves context memory. At 4096 context size, each slot consumes ~200–400 MB of VRAM (varies by model). Ensure you have headroom.
Prompt Batching (Multiple Documents per Request)
Include multiple short documents in a single prompt, asking the model to label all of them. This amortizes the per-request overhead.
Label each document with one of: [contract, invoice, correspondence, report].
Respond with JSON array.
Document 1: "Dear Sir, we are writing to confirm..."
Document 2: "Invoice #4521 - Amount Due..."
Document 3: "Q3 Performance Summary..."
Trade-off: Higher throughput (fewer requests), but more complex error handling. If the model hallucinates on one document in the batch, you need to identify and re-process it. Best for simple classification tasks where errors are easy to detect.
Context Window Management
Document labeling often requires the full document text as input context. This creates a tension: longer documents need larger context windows, but larger context windows slow inference.
Context Size vs. Speed
| Context Size | Prompt Eval Speed | Generation Speed | VRAM per Slot |
|---|---|---|---|
| 2048 tokens | Fast | Fast | ~100–200 MB |
| 4096 tokens | Fast | Fast | ~200–400 MB |
| 8192 tokens | Moderate | Slightly slower | ~400–800 MB |
| 16384 tokens | Slower | Noticeably slower | ~800 MB–1.6 GB |
| 32768 tokens | Much slower | Slower | ~1.6–3.2 GB |
Practical approach: Set context size to match your actual document lengths, not the model's maximum. If 95% of your documents fit in 4096 tokens, set the context to 4096. For the 5% that are longer, either truncate (label based on the first N tokens) or process them separately with a larger context.
Chunking for Long Documents
Documents exceeding the context window need chunking. For labeling tasks:
- Split the document into overlapping chunks (e.g., 3000 tokens with 500 token overlap)
- Label each chunk independently
- Merge labels, resolving conflicts in overlap regions
This works well for entity extraction and classification but poorly for tasks requiring global document understanding (like summarization or document-level sentiment).
Prompt Engineering for Consistent Labels
Data preparation demands consistency. A classification prompt that produces "Contract" for one document and "contract" for an identical document creates downstream problems.
Structured Output Prompting
You are a document classifier. Classify the following document into exactly one category.
Categories: contract, invoice, correspondence, report, other
Rules:
- Output ONLY the category name in lowercase
- No explanation, no punctuation, no additional text
- If uncertain, output "other"
Document:
{document_text}
Category:
JSON Output for Complex Labels
Extract entities from the following document. Output valid JSON only.
Schema:
{"parties": [string], "date": string|null, "amount": string|null, "type": string}
Rules:
- Use null for missing fields
- Dates in ISO 8601 format (YYYY-MM-DD)
- Amounts include currency symbol
- No markdown, no explanation
Document:
{document_text}
JSON:
Key Prompt Principles for Data Prep
- Constrain the output space: List all valid labels explicitly. The model should choose from your list, not invent labels.
- Specify the output format exactly: "lowercase only", "JSON only", "no explanation". Repeat the constraint.
- Provide 2–3 examples for ambiguous categories. Few-shot examples are the most effective way to improve label consistency.
- Set temperature to 0 for classification and extraction tasks. You want deterministic output, not creative variation.
Throughput Estimates by Configuration
Realistic throughput numbers for common data prep tasks:
| Task | Model | Quant | Hardware | Throughput |
|---|---|---|---|---|
| Binary classification | Mistral 7B | Q4_K_M | RTX 4070 | ~3,000 docs/hour |
| Multi-class (5 cats) | Mistral 7B | Q4_K_M | RTX 4070 | ~2,500 docs/hour |
| Entity extraction (3–5 entities) | Qwen 2.5 14B | Q5_K_M | RTX 4080 | ~1,200 docs/hour |
| Document summarization (100 words) | Qwen 2.5 14B | Q4_K_M | RTX 4080 | ~400 docs/hour |
| Synthetic generation (500 words) | Qwen 2.5 14B | Q4_K_M | RTX 4080 | ~120 docs/hour |
These assume single-document processing. Parallel inference (2–4 concurrent requests) improves these numbers by 1.5–3x.
The Accuracy vs. Speed Trade-Off
Pre-annotation doesn't need to be perfect. It needs to be good enough that human review and correction is faster than labeling from scratch.
The threshold: if pre-annotations are >80% correct, human reviewers spend most of their time confirming correct labels (fast) rather than correcting wrong ones (slow). At >90% accuracy, the workflow is dominated by confirmation clicks.
Practical implication: Don't over-optimize for accuracy at the cost of throughput. A 7B Q4 model that's 85% accurate at 3,000 docs/hour is more useful than a 14B Q8 model that's 92% accurate at 800 docs/hour — because the human review time saved by the extra 7% accuracy doesn't offset the 3.7x throughput reduction.
Measure accuracy on a sample of your actual data before committing to a model configuration.
Configuration in Practice
Ertas Data Suite integrates local LLM inference as a built-in co-pilot for labeling and augmentation. The application communicates with Ollama or llama.cpp running on localhost, using the model configuration you specify. Prompt templates for classification, extraction, and generation are built in, with the option to customize prompts per project.
The combination of a native desktop application with a local inference backend means there's no network hop between the labeling interface and the model. Click a document, see the pre-annotation, accept or correct — all happening on local hardware with no data leaving the machine.
For service providers optimizing their data preparation workflow, the biggest lever is model selection and quantization, not exotic infrastructure. Get the right model at the right quantization on a workstation with adequate VRAM, and the throughput follows.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Local LLM-Assisted Data Labeling Without Data Egress
How to use local LLMs via Ollama and llama.cpp for AI-assisted data labeling — covering pre-annotation, quality checks, and active learning without sending data off-premise.

Running Ollama for AI-Assisted Data Prep in Air-Gapped Enterprise Environments
Step-by-step guide to deploying Ollama for AI-assisted data labeling in air-gapped environments — model transfer, offline setup, GPU configuration, and common failure modes.

Synthetic Data Generation in Air-Gapped Environments for Fine-Tuning
How to generate synthetic training data in air-gapped environments — covering paraphrasing, instruction generation, DPO pairs, and seed expansion using local LLMs only.