Optimizing Local LLM Inference for Data Labeling and Augmentation Tasks

Local LLM inference transforms data preparation from a fully manual process into a human-in-the-loop workflow. Instead of labeling every document from scratch, an ML engineer or domain expert reviews and corrects AI-generated pre-annotations. The quality of those pre-annotations — and the speed at which they're generated — depends on how well the local inference stack is configured.

This guide covers practical optimization for using local LLMs in data labeling and augmentation: which models to choose, how quantization affects labeling accuracy, how to structure prompts for consistent outputs, and where the throughput bottlenecks are.

Model Selection for Data Prep Tasks

Not all models are equally suited for data preparation work. The requirements differ from conversational AI:

Instruction following: The model must follow structured output instructions consistently. If you ask for JSON with specific keys, it should produce JSON with those keys — every time, not 95% of the time.
Short, structured outputs: Most labeling tasks produce outputs of 10–200 tokens (a label, a JSON object, a brief extraction). Models optimized for long-form generation are wasted here.
Domain vocabulary: The model should handle technical terminology without paraphrasing it. Medical codes, legal citations, and engineering terms need to pass through verbatim.

Recommended Models by Task

Task	Recommended Models	Size	Notes
Text classification	Mistral 7B Instruct, Llama 3.1 8B Instruct	7–8B	Fast, accurate for category assignment
Named entity extraction	Qwen 2.5 14B Instruct, Llama 3.1 8B Instruct	8–14B	14B improves accuracy on uncommon entities
Sentiment / topic analysis	Mistral 7B Instruct, Phi-3 Mini	3.8–7B	Simple tasks; smaller models work fine
Document summarization	Qwen 2.5 14B Instruct, Llama 3.1 8B Instruct	8–14B	Longer output; 14B produces more coherent summaries
Synthetic data generation	Qwen 2.5 14B Instruct, Mistral 7B Instruct	7–14B	Quality scales with model size for generation
Multi-label classification	Qwen 2.5 14B Instruct	14B	Better at handling multiple simultaneous labels

The 7B sweet spot: For most classification and extraction tasks, 7B instruction-tuned models deliver 90–95% of the accuracy of larger models at 2–3x the throughput. Start with 7B. Move to 14B only when you measure an accuracy gap on your specific task.

Quantization Trade-Offs

Quantization reduces model precision from 16-bit floating point to lower bit widths, shrinking model size and increasing inference speed. The trade-off is accuracy.

Quantization Levels Compared

Quantization	Size (7B model)	VRAM	Speed (RTX 4070)	Quality Impact
F16 (no quant)	~14 GB	~15 GB	~20 tok/s	Baseline
Q8_0	~7.5 GB	~8 GB	~35 tok/s	Negligible loss
Q6_K	~5.8 GB	~6.5 GB	~42 tok/s	Minimal loss
Q5_K_M	~5.1 GB	~5.8 GB	~48 tok/s	Slight loss on nuanced tasks
Q4_K_M	~4.3 GB	~5 GB	~55 tok/s	Measurable loss on complex extraction
Q4_0	~3.8 GB	~4.5 GB	~58 tok/s	Noticeable degradation
Q3_K_M	~3.3 GB	~4 GB	~62 tok/s	Significant quality reduction
Q2_K	~2.7 GB	~3.5 GB	~65 tok/s	Not recommended for data prep

Quantization Recommendations for Data Prep

Classification tasks (binary, multi-class): Q4_K_M is fine. Classification outputs are short and constrained. The model either picks the right category or it doesn't — and Q4 preserves enough precision for this decision.

Entity extraction: Q5_K_M or Q8_0 preferred. Extraction requires the model to identify and reproduce specific strings from the input. Lower quantization can cause subtle token-level errors (misspellings, truncated entities) that are expensive to catch in review.

Synthetic data generation: Q5_K_M minimum. Generated text quality degrades noticeably below Q5 — sentences become less coherent, technical terminology gets garbled, and the output requires more human editing.

General recommendation: Start with Q4_K_M for initial testing. If accuracy on your specific task is acceptable, stay there for throughput. If you see quality issues, step up to Q5_K_M or Q8_0. Don't go below Q4_K_M for data preparation work.

Inference Backends: Ollama vs llama.cpp vs vLLM

Ollama

Best for: Most data preparation setups. Easy model management, OpenAI-compatible API, automatic GPU detection.

Ollama wraps llama.cpp with a model registry and HTTP server. The overhead is minimal — an HTTP request/response adds under 1ms per inference call, which is negligible compared to the 100ms–10s that inference itself takes.

Configuration for data prep:

# Set concurrent request limit (default is 1)
OLLAMA_NUM_PARALLEL=4 ollama serve

# Pull a model
ollama pull mistral:7b-instruct-v0.3-q4_K_M

# Or pull a specific quantization
ollama pull qwen2.5:14b-instruct-q5_K_M

Key setting: OLLAMA_NUM_PARALLEL controls how many requests Ollama processes concurrently. For data prep, set this to 2–4 if your GPU has enough VRAM. Each parallel request loads an additional context window into VRAM.

llama.cpp (Direct)

Best for: Air-gapped environments where Ollama's model registry is unreachable, or when you need fine-grained control over inference parameters.

llama.cpp's llama-server provides an OpenAI-compatible endpoint without Ollama's model management layer. You point it directly at a GGUF file.

# Start server with specific model and tuned parameters
llama-server \
  --model ./models/mistral-7b-instruct-v0.3.Q4_K_M.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 99 \
  --parallel 4 \
  --batch-size 512

Key parameters:

--ctx-size: Context window size. For document labeling, 4096 tokens handles most single-document tasks. Increase to 8192 or 16384 for long documents.
--n-gpu-layers: Number of model layers to offload to GPU. Set to 99 to offload everything. Reduce if you run out of VRAM.
--parallel: Concurrent request slots. Each slot reserves context memory.
--batch-size: Tokens processed per batch during prompt evaluation. Higher values speed up prompt processing but use more memory.

vLLM

Best for: High-throughput batch processing with multiple concurrent requests. vLLM's PagedAttention mechanism handles many concurrent requests more efficiently than llama.cpp.

Trade-offs: Heavier installation (Python + PyTorch + CUDA). More complex setup. Less suitable for interactive labeling workflows where you process one document at a time. Overkill for most data preparation scenarios unless you're running batch inference on 100K+ documents.

Recommendation for most service providers: Use Ollama. It's the simplest path to working local inference. Switch to llama.cpp direct if you're in an air-gapped environment. Consider vLLM only if batch throughput is a measured bottleneck.

Batch Inference Strategies

Data labeling and augmentation naturally lend themselves to batch processing. You have N documents to label — process them as efficiently as possible.

Sequential Single-Document Processing

Process one document at a time. Simple, easy to debug, easy to resume if interrupted.

Throughput: For a 7B Q4 model generating ~50 tokens per label, expect 30–50 labels per minute on an RTX 4070. That's 1,800–3,000 labels per hour.

Parallel Batch Processing

Send multiple inference requests concurrently. Ollama's OLLAMA_NUM_PARALLEL or llama.cpp's --parallel enables this.

Throughput improvement: 2–4 parallel requests typically improve throughput by 1.5–3x (not linear, because GPU compute is shared). With 4 parallel requests on an RTX 4070, expect 4,500–8,000 labels per hour.

VRAM cost: Each parallel slot reserves context memory. At 4096 context size, each slot consumes ~200–400 MB of VRAM (varies by model). Ensure you have headroom.

Prompt Batching (Multiple Documents per Request)

Include multiple short documents in a single prompt, asking the model to label all of them. This amortizes the per-request overhead.

Label each document with one of: [contract, invoice, correspondence, report].
Respond with JSON array.

Document 1: "Dear Sir, we are writing to confirm..."
Document 2: "Invoice #4521 - Amount Due..."
Document 3: "Q3 Performance Summary..."

Trade-off: Higher throughput (fewer requests), but more complex error handling. If the model hallucinates on one document in the batch, you need to identify and re-process it. Best for simple classification tasks where errors are easy to detect.

Context Window Management

Document labeling often requires the full document text as input context. This creates a tension: longer documents need larger context windows, but larger context windows slow inference.

Context Size vs. Speed

Context Size	Prompt Eval Speed	Generation Speed	VRAM per Slot
2048 tokens	Fast	Fast	~100–200 MB
4096 tokens	Fast	Fast	~200–400 MB
8192 tokens	Moderate	Slightly slower	~400–800 MB
16384 tokens	Slower	Noticeably slower	~800 MB–1.6 GB
32768 tokens	Much slower	Slower	~1.6–3.2 GB

Practical approach: Set context size to match your actual document lengths, not the model's maximum. If 95% of your documents fit in 4096 tokens, set the context to 4096. For the 5% that are longer, either truncate (label based on the first N tokens) or process them separately with a larger context.

Chunking for Long Documents

Documents exceeding the context window need chunking. For labeling tasks:

Split the document into overlapping chunks (e.g., 3000 tokens with 500 token overlap)
Label each chunk independently
Merge labels, resolving conflicts in overlap regions

This works well for entity extraction and classification but poorly for tasks requiring global document understanding (like summarization or document-level sentiment).

Prompt Engineering for Consistent Labels

Data preparation demands consistency. A classification prompt that produces "Contract" for one document and "contract" for an identical document creates downstream problems.

Structured Output Prompting

You are a document classifier. Classify the following document into exactly one category.

Categories: contract, invoice, correspondence, report, other

Rules:
- Output ONLY the category name in lowercase
- No explanation, no punctuation, no additional text
- If uncertain, output "other"

Document:
{document_text}

Category:

JSON Output for Complex Labels

Extract entities from the following document. Output valid JSON only.

Schema:
{"parties": [string], "date": string|null, "amount": string|null, "type": string}

Rules:
- Use null for missing fields
- Dates in ISO 8601 format (YYYY-MM-DD)
- Amounts include currency symbol
- No markdown, no explanation

Document:
{document_text}

JSON:

Key Prompt Principles for Data Prep

Constrain the output space: List all valid labels explicitly. The model should choose from your list, not invent labels.
Specify the output format exactly: "lowercase only", "JSON only", "no explanation". Repeat the constraint.
Provide 2–3 examples for ambiguous categories. Few-shot examples are the most effective way to improve label consistency.
Set temperature to 0 for classification and extraction tasks. You want deterministic output, not creative variation.

Throughput Estimates by Configuration

Realistic throughput numbers for common data prep tasks:

Task	Model	Quant	Hardware	Throughput
Binary classification	Mistral 7B	Q4_K_M	RTX 4070	~3,000 docs/hour
Multi-class (5 cats)	Mistral 7B	Q4_K_M	RTX 4070	~2,500 docs/hour
Entity extraction (3–5 entities)	Qwen 2.5 14B	Q5_K_M	RTX 4080	~1,200 docs/hour
Document summarization (100 words)	Qwen 2.5 14B	Q4_K_M	RTX 4080	~400 docs/hour
Synthetic generation (500 words)	Qwen 2.5 14B	Q4_K_M	RTX 4080	~120 docs/hour

These assume single-document processing. Parallel inference (2–4 concurrent requests) improves these numbers by 1.5–3x.

The Accuracy vs. Speed Trade-Off

Pre-annotation doesn't need to be perfect. It needs to be good enough that human review and correction is faster than labeling from scratch.

The threshold: if pre-annotations are >80% correct, human reviewers spend most of their time confirming correct labels (fast) rather than correcting wrong ones (slow). At >90% accuracy, the workflow is dominated by confirmation clicks.

Practical implication: Don't over-optimize for accuracy at the cost of throughput. A 7B Q4 model that's 85% accurate at 3,000 docs/hour is more useful than a 14B Q8 model that's 92% accurate at 800 docs/hour — because the human review time saved by the extra 7% accuracy doesn't offset the 3.7x throughput reduction.

Measure accuracy on a sample of your actual data before committing to a model configuration.

Configuration in Practice

Ertas Data Suite integrates local LLM inference as a built-in co-pilot for labeling and augmentation. The application communicates with Ollama or llama.cpp running on localhost, using the model configuration you specify. Prompt templates for classification, extraction, and generation are built in, with the option to customize prompts per project.

The combination of a native desktop application with a local inference backend means there's no network hop between the labeling interface and the model. Click a document, see the pre-annotation, accept or correct — all happening on local hardware with no data leaving the machine.

For service providers optimizing their data preparation workflow, the biggest lever is model selection and quantization, not exotic infrastructure. Get the right model at the right quantization on a workstation with adequate VRAM, and the throughput follows.