Ertas for Document Classification

Fine-tune AI models that automatically categorize documents by type, department, urgency, or custom taxonomies — with accuracy that generic models cannot match.

The Challenge

Organizations process thousands of documents daily — contracts, invoices, correspondence, reports, applications, and compliance filings — and routing each document to the right team or workflow depends on accurate classification. Manual classification is slow, inconsistent, and scales poorly. When a single misrouted document can delay a legal filing or lose a time-sensitive business opportunity, the cost of errors is significant.

Generic AI models struggle with document classification in specialized domains because they lack context about an organization's specific document taxonomy. A general model might distinguish between an invoice and a contract, but it cannot reliably differentiate between a master services agreement and a statement of work, or between a regulatory filing and an internal compliance memo. These fine-grained distinctions require domain knowledge that can only come from training on the organization's actual document corpus — exactly the kind of task that fine-tuning is designed to solve.

The Solution

Ertas enables organizations to fine-tune classification models on their own document taxonomy using real examples from their archives. With Ertas Studio, teams upload labeled document samples in JSONL format — where each entry maps document text to its correct category — and train a lightweight LoRA adapter that teaches the model to recognize the specific patterns, vocabulary, and structural cues that distinguish each document type in their taxonomy.

The fine-tuned model can be deployed as a classification endpoint through Ollama, vLLM, or Ertas Cloud, processing incoming documents in real time with sub-second latency. Because the model runs on your infrastructure, sensitive document content never leaves your network. Ertas Vault ensures that all training data and model artifacts are encrypted and access-controlled, meeting the data governance requirements of regulated industries. As the document taxonomy evolves — new categories are added, existing ones are split or merged — teams can retrain the model in Ertas Studio with updated examples and redeploy without any application changes.

Key Features

Studio

Custom Taxonomy Training

Train classification models on your organization's exact document taxonomy using labeled examples. Support for hierarchical categories, multi-label classification, and confidence scoring per category.

Hub

Pre-Trained Document Models

Start from base models on Hub that already understand document structure — headers, footers, tables, signatures — so your fine-tuning focuses on classification accuracy rather than basic document comprehension.

Cloud

Real-Time Classification API

Deploy your classifier as a low-latency REST endpoint through Cloud. Process documents on arrival with sub-second classification and route them automatically to downstream workflows.

Vault

Secure Document Processing

Vault ensures all training documents and inference data are encrypted at rest and in transit. Configurable retention policies automatically purge processed documents after classification.

Example Workflow

A large insurance company receives 10,000+ documents daily across email, fax, and web portal channels. The documents include new claims, policy amendments, medical records, adjuster reports, and legal correspondence — each requiring routing to a different department. The team exports 50,000 labeled document examples from their archive and uploads them to Ertas Vault. In Ertas Studio, they fine-tune a 7B model with a LoRA adapter targeting their 28-category taxonomy. After training, the model achieves 96% classification accuracy on a held-out test set — compared to 71% from a generic model. The classifier is deployed as an API endpoint behind their document intake system, automatically routing each incoming document to the correct department queue with a confidence score. Documents below the confidence threshold are flagged for human review, creating a feedback loop that generates additional training data for future model improvements.