Back to blog
    Local LLM-Assisted Data Labeling Without Data Egress
    data-labelinglocal-llmollamaactive-learningzero-egresson-premisesegment:service-provider

    Local LLM-Assisted Data Labeling Without Data Egress

    How to use local LLMs via Ollama and llama.cpp for AI-assisted data labeling — covering pre-annotation, quality checks, and active learning without sending data off-premise.

    EErtas Team·

    Data labeling is the most labor-intensive stage of the data preparation pipeline. A 10,000-example dataset with complex labeling requirements can take a team of annotators weeks. Multiply that by the number of client projects a service provider handles in a year, and labeling becomes the primary bottleneck to throughput.

    Cloud-based labeling APIs (OpenAI, Anthropic, Google) can accelerate this dramatically — a model can pre-annotate thousands of records in minutes. But for regulated enterprise clients, sending data to cloud APIs is not an option. The data cannot leave the building.

    The practical alternative: use local LLMs running on-premise to assist with labeling. Not to replace human annotators, but to reduce the workload per annotator by 40-60%. This guide covers the setup, model selection, and workflow for local LLM-assisted labeling.


    What Local LLMs Can Do for Labeling

    Local LLMs assist labeling in three ways:

    1. Pre-Annotation (Draft Labels)

    The model generates a proposed label for each record. A human annotator then reviews and corrects the proposal instead of labeling from scratch.

    For a text classification task with 10 categories, a well-prompted local 7B model typically achieves 60-80% accuracy on draft labels. That means 60-80% of records need only verification (fast), not labeling from scratch (slow). The time savings are substantial — annotator throughput roughly doubles.

    For more complex tasks (entity extraction, multi-label classification, instruction/completion pair generation), accuracy varies more, but even 40% correct pre-annotations save significant time.

    2. Label Quality Checks

    After human annotators apply labels, the model reviews for consistency:

    • Does this label match the content?
    • Is this label consistent with how similar records were labeled?
    • Are there annotation patterns that suggest fatigue or systematic error?

    This catches errors that would otherwise survive into the training set. Human annotators operating at speed make mistakes — typically 5-15% error rate depending on task complexity and annotator expertise. A quality check pass catches a significant fraction of those.

    3. Active Learning Prioritization

    Not all unlabeled records are equally informative for model training. Active learning uses model uncertainty to prioritize which records should be labeled next — focusing annotator time on the records that will most improve model performance.

    With a local LLM, you can compute prediction confidence for each unlabeled record and present the most uncertain records first. This produces a better training set per unit of annotator effort.


    Setting Up Local LLM Inference

    Two practical options for running LLMs locally:

    Ollama

    Ollama provides the simplest path to local model inference. Install the binary, pull a model, and access it via a local API endpoint.

    Hardware requirements for labeling tasks:

    • 7B models (Mistral 7B, Llama 3 8B): 8 GB RAM minimum, 16 GB recommended. Runs on CPU but GPU acceleration dramatically improves throughput.
    • 13B models: 16 GB RAM minimum. Notably better at complex labeling tasks.
    • 70B+ models: Requires serious GPU infrastructure (48+ GB VRAM). Usually overkill for labeling assistance.

    For most labeling use cases, a 7B-8B instruction-following model provides the best throughput-to-accuracy ratio.

    llama.cpp

    More control, more configuration. llama.cpp runs GGUF-quantized models directly on CPU or GPU with fine-grained control over context length, batch size, and quantization level.

    Relevant for service providers who need to:

    • Run on hardware without CUDA-capable GPUs (Apple Silicon, AMD, CPU-only servers)
    • Maximize throughput on specific hardware
    • Deploy in environments where installing Ollama isn't possible

    Model Selection for Labeling Tasks

    Not all models are equally suited to labeling. The key property is instruction following — the model needs to reliably produce structured output in the format you specify.

    ModelSizeInstruction FollowingStructured OutputLabeling Accuracy (typical)
    Llama 3.1 8B Instruct8BExcellentGood65-80%
    Mistral 7B Instruct v0.37BVery GoodGood60-75%
    Qwen 2.5 7B Instruct7BVery GoodVery Good65-80%
    Phi-3.5 Mini Instruct3.8BGoodFair50-65%
    Llama 3.1 70B Instruct70BExcellentExcellent80-90%

    The accuracy ranges are estimates for a typical text classification task with 5-10 categories. Your mileage will vary based on domain, task complexity, and prompt design.


    Batch vs. Interactive Labeling

    Two workflow patterns:

    Batch Pre-Annotation

    Run the model over the entire unlabeled dataset, generating draft labels for all records. Annotators then work through the queue, verifying or correcting each draft.

    Advantages: Maximizes GPU utilization. Annotators always have a queue of pre-annotated records ready to review. Simple to implement.

    Disadvantages: Initial batch processing takes time (hours for large datasets on modest hardware). Draft labels are generated without benefit of any human corrections — the model doesn't improve during the batch.

    Interactive Co-Pilot Labeling

    The model generates a draft label in real time as the annotator opens each record. The annotator sees the suggestion immediately and accepts, modifies, or rejects it.

    Advantages: Feels more natural. The prompt can incorporate recently labeled examples (few-shot), improving accuracy as the session progresses.

    Disadvantages: Requires low-latency inference (sub-second per record). Puts a throughput ceiling based on single-record inference speed. On CPU-only hardware with a 7B model, latency may be 5-15 seconds per record — acceptable for simple tasks, frustrating for fast annotators.

    For most service provider workflows, batch pre-annotation is the practical starting point. Switch to interactive co-pilot labeling when hardware supports sub-second inference.


    Comparison: Local LLM Labeling vs. Existing Tools

    Label Studio

    The most widely deployed open-source annotation tool. Label Studio provides a web-based interface for multiple annotation types (classification, NER, bounding boxes, etc.) with project management, multi-annotator support, and basic ML backend integration.

    Strengths: Mature, flexible, supports many annotation types. Weaknesses: Self-hosted deployment adds operational complexity. ML backend integration (for pre-annotation) requires custom code. No built-in local LLM support — you need to build the bridge yourself.

    Prodigy

    Explosion's commercial annotation tool. Built for efficiency — designed around active learning and rapid annotation workflows.

    Strengths: Fast annotation interface, built-in active learning, good NLP integration. Weaknesses: Commercial license required. Desktop application (not web-based), which limits multi-annotator workflows. Python-centric — domain experts need technical assistance to configure.

    Cloud Labeling Services (Scale AI, Labelbox)

    Enterprise-grade labeling platforms with workforce management, quality control, and model-in-the-loop features.

    Strengths: Powerful, scalable, well-integrated quality management. Weaknesses: Data must leave the client's infrastructure. Not an option for regulated industries with zero-egress requirements.


    Practical Workflow: From Unlabeled to Training-Ready

    Here's a realistic workflow for a service provider handling a labeling project for a regulated enterprise client:

    Phase 1: Setup (Day 1)

    • Deploy local LLM inference (Ollama or llama.cpp) on client hardware
    • Design labeling schema with domain experts
    • Write and test labeling prompts against a 50-record sample
    • Measure pre-annotation accuracy and iterate on prompts until accuracy exceeds 60%

    Phase 2: Batch Pre-Annotation (Day 2)

    • Run the model over the full dataset
    • Generate draft labels with confidence scores
    • Flag low-confidence records for priority human review

    Phase 3: Human Review (Days 3-10+)

    • Domain experts review pre-annotated records
    • High-confidence correct labels: verify and approve (fast)
    • Low-confidence or incorrect labels: correct manually
    • Track annotator agreement on overlapping records

    Phase 4: Quality Assurance (Ongoing)

    • Run the local LLM as a quality checker on completed labels
    • Flag inconsistencies for re-review
    • Compute inter-annotator agreement metrics
    • Export quality report for the audit trail

    Phase 5: Iteration

    • After initial labeling round, use labeled data to improve prompts
    • Re-run pre-annotation on remaining unlabeled records with improved prompts
    • Each iteration typically improves pre-annotation accuracy by 5-10%

    Hardware Recommendations

    For a service provider deploying labeling infrastructure at a client site:

    ScenarioHardwareModelExpected Throughput
    Budget / CPU-only32 GB RAM workstationLlama 3.1 8B Q450-100 records/hour (batch)
    Mid-rangeNVIDIA RTX 4090 (24 GB)Llama 3.1 8B Q8500-1,000 records/hour (batch)
    ProductionNVIDIA A100 (40 GB)Llama 3.1 70B Q4200-400 records/hour (batch, higher accuracy)
    Apple SiliconM3 Max (64 GB unified)Llama 3.1 8B Q8200-400 records/hour (batch)

    These throughput numbers are for a typical text classification task with 200-token input records and 50-token output. Entity extraction and instruction generation tasks are slower.


    What This Enables

    Ertas Data Suite's Label module integrates local LLM-assisted labeling directly into the data preparation pipeline. The built-in co-pilot runs via Ollama or llama.cpp, supports batch pre-annotation and interactive labeling, and logs every label decision to the project audit trail. Domain experts work in a visual interface — no Python, no command line, no configuration files.

    The key advantage over assembling Label Studio + Ollama + custom glue code: everything runs in a single application with a unified data model. Labels applied in the Label module feed directly into augmentation and export without file format conversions or data transfers.


    Connecting to the Pipeline

    Labeled data feeds into augmentation, where synthetic data generation expands the dataset — especially important when real labeled data is scarce (the typical enterprise case).

    For the complete pipeline overview, see How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading