Back to blog
    Optimizing Local LLM Inference for Data Labeling and Augmentation Tasks
    local-llminferenceollamallama-cppquantizationdata-labelingaugmentationoptimizationsegment:service-provider

    Optimizing Local LLM Inference for Data Labeling and Augmentation Tasks

    Practical guide to optimizing local LLM inference for data prep — model selection, quantization trade-offs, batch strategies, and throughput tuning for labeling and augmentation.

    EErtas Team·

    Local LLM inference transforms data preparation from a fully manual process into a human-in-the-loop workflow. Instead of labeling every document from scratch, an ML engineer or domain expert reviews and corrects AI-generated pre-annotations. The quality of those pre-annotations — and the speed at which they're generated — depends on how well the local inference stack is configured.

    This guide covers practical optimization for using local LLMs in data labeling and augmentation: which models to choose, how quantization affects labeling accuracy, how to structure prompts for consistent outputs, and where the throughput bottlenecks are.


    Model Selection for Data Prep Tasks

    Not all models are equally suited for data preparation work. The requirements differ from conversational AI:

    • Instruction following: The model must follow structured output instructions consistently. If you ask for JSON with specific keys, it should produce JSON with those keys — every time, not 95% of the time.
    • Short, structured outputs: Most labeling tasks produce outputs of 10–200 tokens (a label, a JSON object, a brief extraction). Models optimized for long-form generation are wasted here.
    • Domain vocabulary: The model should handle technical terminology without paraphrasing it. Medical codes, legal citations, and engineering terms need to pass through verbatim.
    TaskRecommended ModelsSizeNotes
    Text classificationMistral 7B Instruct, Llama 3.1 8B Instruct7–8BFast, accurate for category assignment
    Named entity extractionQwen 2.5 14B Instruct, Llama 3.1 8B Instruct8–14B14B improves accuracy on uncommon entities
    Sentiment / topic analysisMistral 7B Instruct, Phi-3 Mini3.8–7BSimple tasks; smaller models work fine
    Document summarizationQwen 2.5 14B Instruct, Llama 3.1 8B Instruct8–14BLonger output; 14B produces more coherent summaries
    Synthetic data generationQwen 2.5 14B Instruct, Mistral 7B Instruct7–14BQuality scales with model size for generation
    Multi-label classificationQwen 2.5 14B Instruct14BBetter at handling multiple simultaneous labels

    The 7B sweet spot: For most classification and extraction tasks, 7B instruction-tuned models deliver 90–95% of the accuracy of larger models at 2–3x the throughput. Start with 7B. Move to 14B only when you measure an accuracy gap on your specific task.


    Quantization Trade-Offs

    Quantization reduces model precision from 16-bit floating point to lower bit widths, shrinking model size and increasing inference speed. The trade-off is accuracy.

    Quantization Levels Compared

    QuantizationSize (7B model)VRAMSpeed (RTX 4070)Quality Impact
    F16 (no quant)~14 GB~15 GB~20 tok/sBaseline
    Q8_0~7.5 GB~8 GB~35 tok/sNegligible loss
    Q6_K~5.8 GB~6.5 GB~42 tok/sMinimal loss
    Q5_K_M~5.1 GB~5.8 GB~48 tok/sSlight loss on nuanced tasks
    Q4_K_M~4.3 GB~5 GB~55 tok/sMeasurable loss on complex extraction
    Q4_0~3.8 GB~4.5 GB~58 tok/sNoticeable degradation
    Q3_K_M~3.3 GB~4 GB~62 tok/sSignificant quality reduction
    Q2_K~2.7 GB~3.5 GB~65 tok/sNot recommended for data prep

    Quantization Recommendations for Data Prep

    Classification tasks (binary, multi-class): Q4_K_M is fine. Classification outputs are short and constrained. The model either picks the right category or it doesn't — and Q4 preserves enough precision for this decision.

    Entity extraction: Q5_K_M or Q8_0 preferred. Extraction requires the model to identify and reproduce specific strings from the input. Lower quantization can cause subtle token-level errors (misspellings, truncated entities) that are expensive to catch in review.

    Synthetic data generation: Q5_K_M minimum. Generated text quality degrades noticeably below Q5 — sentences become less coherent, technical terminology gets garbled, and the output requires more human editing.

    General recommendation: Start with Q4_K_M for initial testing. If accuracy on your specific task is acceptable, stay there for throughput. If you see quality issues, step up to Q5_K_M or Q8_0. Don't go below Q4_K_M for data preparation work.


    Inference Backends: Ollama vs llama.cpp vs vLLM

    Ollama

    Best for: Most data preparation setups. Easy model management, OpenAI-compatible API, automatic GPU detection.

    Ollama wraps llama.cpp with a model registry and HTTP server. The overhead is minimal — an HTTP request/response adds under 1ms per inference call, which is negligible compared to the 100ms–10s that inference itself takes.

    Configuration for data prep:

    # Set concurrent request limit (default is 1)
    OLLAMA_NUM_PARALLEL=4 ollama serve
    
    # Pull a model
    ollama pull mistral:7b-instruct-v0.3-q4_K_M
    
    # Or pull a specific quantization
    ollama pull qwen2.5:14b-instruct-q5_K_M
    

    Key setting: OLLAMA_NUM_PARALLEL controls how many requests Ollama processes concurrently. For data prep, set this to 2–4 if your GPU has enough VRAM. Each parallel request loads an additional context window into VRAM.

    llama.cpp (Direct)

    Best for: Air-gapped environments where Ollama's model registry is unreachable, or when you need fine-grained control over inference parameters.

    llama.cpp's llama-server provides an OpenAI-compatible endpoint without Ollama's model management layer. You point it directly at a GGUF file.

    # Start server with specific model and tuned parameters
    llama-server \
      --model ./models/mistral-7b-instruct-v0.3.Q4_K_M.gguf \
      --ctx-size 4096 \
      --n-gpu-layers 99 \
      --parallel 4 \
      --batch-size 512
    

    Key parameters:

    • --ctx-size: Context window size. For document labeling, 4096 tokens handles most single-document tasks. Increase to 8192 or 16384 for long documents.
    • --n-gpu-layers: Number of model layers to offload to GPU. Set to 99 to offload everything. Reduce if you run out of VRAM.
    • --parallel: Concurrent request slots. Each slot reserves context memory.
    • --batch-size: Tokens processed per batch during prompt evaluation. Higher values speed up prompt processing but use more memory.

    vLLM

    Best for: High-throughput batch processing with multiple concurrent requests. vLLM's PagedAttention mechanism handles many concurrent requests more efficiently than llama.cpp.

    Trade-offs: Heavier installation (Python + PyTorch + CUDA). More complex setup. Less suitable for interactive labeling workflows where you process one document at a time. Overkill for most data preparation scenarios unless you're running batch inference on 100K+ documents.

    Recommendation for most service providers: Use Ollama. It's the simplest path to working local inference. Switch to llama.cpp direct if you're in an air-gapped environment. Consider vLLM only if batch throughput is a measured bottleneck.


    Batch Inference Strategies

    Data labeling and augmentation naturally lend themselves to batch processing. You have N documents to label — process them as efficiently as possible.

    Sequential Single-Document Processing

    Process one document at a time. Simple, easy to debug, easy to resume if interrupted.

    Throughput: For a 7B Q4 model generating ~50 tokens per label, expect 30–50 labels per minute on an RTX 4070. That's 1,800–3,000 labels per hour.

    Parallel Batch Processing

    Send multiple inference requests concurrently. Ollama's OLLAMA_NUM_PARALLEL or llama.cpp's --parallel enables this.

    Throughput improvement: 2–4 parallel requests typically improve throughput by 1.5–3x (not linear, because GPU compute is shared). With 4 parallel requests on an RTX 4070, expect 4,500–8,000 labels per hour.

    VRAM cost: Each parallel slot reserves context memory. At 4096 context size, each slot consumes ~200–400 MB of VRAM (varies by model). Ensure you have headroom.

    Prompt Batching (Multiple Documents per Request)

    Include multiple short documents in a single prompt, asking the model to label all of them. This amortizes the per-request overhead.

    Label each document with one of: [contract, invoice, correspondence, report].
    Respond with JSON array.
    
    Document 1: "Dear Sir, we are writing to confirm..."
    Document 2: "Invoice #4521 - Amount Due..."
    Document 3: "Q3 Performance Summary..."
    

    Trade-off: Higher throughput (fewer requests), but more complex error handling. If the model hallucinates on one document in the batch, you need to identify and re-process it. Best for simple classification tasks where errors are easy to detect.


    Context Window Management

    Document labeling often requires the full document text as input context. This creates a tension: longer documents need larger context windows, but larger context windows slow inference.

    Context Size vs. Speed

    Context SizePrompt Eval SpeedGeneration SpeedVRAM per Slot
    2048 tokensFastFast~100–200 MB
    4096 tokensFastFast~200–400 MB
    8192 tokensModerateSlightly slower~400–800 MB
    16384 tokensSlowerNoticeably slower~800 MB–1.6 GB
    32768 tokensMuch slowerSlower~1.6–3.2 GB

    Practical approach: Set context size to match your actual document lengths, not the model's maximum. If 95% of your documents fit in 4096 tokens, set the context to 4096. For the 5% that are longer, either truncate (label based on the first N tokens) or process them separately with a larger context.

    Chunking for Long Documents

    Documents exceeding the context window need chunking. For labeling tasks:

    1. Split the document into overlapping chunks (e.g., 3000 tokens with 500 token overlap)
    2. Label each chunk independently
    3. Merge labels, resolving conflicts in overlap regions

    This works well for entity extraction and classification but poorly for tasks requiring global document understanding (like summarization or document-level sentiment).


    Prompt Engineering for Consistent Labels

    Data preparation demands consistency. A classification prompt that produces "Contract" for one document and "contract" for an identical document creates downstream problems.

    Structured Output Prompting

    You are a document classifier. Classify the following document into exactly one category.
    
    Categories: contract, invoice, correspondence, report, other
    
    Rules:
    - Output ONLY the category name in lowercase
    - No explanation, no punctuation, no additional text
    - If uncertain, output "other"
    
    Document:
    {document_text}
    
    Category:
    

    JSON Output for Complex Labels

    Extract entities from the following document. Output valid JSON only.
    
    Schema:
    {"parties": [string], "date": string|null, "amount": string|null, "type": string}
    
    Rules:
    - Use null for missing fields
    - Dates in ISO 8601 format (YYYY-MM-DD)
    - Amounts include currency symbol
    - No markdown, no explanation
    
    Document:
    {document_text}
    
    JSON:
    

    Key Prompt Principles for Data Prep

    1. Constrain the output space: List all valid labels explicitly. The model should choose from your list, not invent labels.
    2. Specify the output format exactly: "lowercase only", "JSON only", "no explanation". Repeat the constraint.
    3. Provide 2–3 examples for ambiguous categories. Few-shot examples are the most effective way to improve label consistency.
    4. Set temperature to 0 for classification and extraction tasks. You want deterministic output, not creative variation.

    Throughput Estimates by Configuration

    Realistic throughput numbers for common data prep tasks:

    TaskModelQuantHardwareThroughput
    Binary classificationMistral 7BQ4_K_MRTX 4070~3,000 docs/hour
    Multi-class (5 cats)Mistral 7BQ4_K_MRTX 4070~2,500 docs/hour
    Entity extraction (3–5 entities)Qwen 2.5 14BQ5_K_MRTX 4080~1,200 docs/hour
    Document summarization (100 words)Qwen 2.5 14BQ4_K_MRTX 4080~400 docs/hour
    Synthetic generation (500 words)Qwen 2.5 14BQ4_K_MRTX 4080~120 docs/hour

    These assume single-document processing. Parallel inference (2–4 concurrent requests) improves these numbers by 1.5–3x.


    The Accuracy vs. Speed Trade-Off

    Pre-annotation doesn't need to be perfect. It needs to be good enough that human review and correction is faster than labeling from scratch.

    The threshold: if pre-annotations are >80% correct, human reviewers spend most of their time confirming correct labels (fast) rather than correcting wrong ones (slow). At >90% accuracy, the workflow is dominated by confirmation clicks.

    Practical implication: Don't over-optimize for accuracy at the cost of throughput. A 7B Q4 model that's 85% accurate at 3,000 docs/hour is more useful than a 14B Q8 model that's 92% accurate at 800 docs/hour — because the human review time saved by the extra 7% accuracy doesn't offset the 3.7x throughput reduction.

    Measure accuracy on a sample of your actual data before committing to a model configuration.


    Configuration in Practice

    Ertas Data Suite integrates local LLM inference as a built-in co-pilot for labeling and augmentation. The application communicates with Ollama or llama.cpp running on localhost, using the model configuration you specify. Prompt templates for classification, extraction, and generation are built in, with the option to customize prompts per project.

    The combination of a native desktop application with a local inference backend means there's no network hop between the labeling interface and the model. Click a document, see the pre-annotation, accept or correct — all happening on local hardware with no data leaving the machine.

    For service providers optimizing their data preparation workflow, the biggest lever is model selection and quantization, not exotic infrastructure. Get the right model at the right quantization on a workstation with adequate VRAM, and the throughput follows.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading