OpenVINO + Ertas

Deploy fine-tuned models from Ertas on Intel CPUs, GPUs, and NPUs using OpenVINO's inference optimization toolkit, achieving efficient local inference without requiring NVIDIA hardware.

Overview

OpenVINO is Intel's open-source toolkit for optimizing and deploying deep learning models across Intel hardware — from Xeon server CPUs and Arc GPUs to the neural processing units (NPUs) embedded in Core Ultra laptops. For large language models, OpenVINO applies weight compression, kernel optimization, and hardware-specific compilation to deliver competitive inference performance on Intel silicon. This is significant because Intel hardware is ubiquitous: most enterprise servers, developer workstations, and laptops run Intel processors, yet the LLM deployment conversation has been dominated by NVIDIA GPUs.

OpenVINO's LLM support includes INT4 and INT8 weight compression, speculative decoding, continuous batching, and integration with Hugging Face's Optimum library for streamlined model conversion. The toolkit also provides an OpenAI-compatible serving layer through its Model Server component, making deployed models accessible to standard client applications. For organizations with existing Intel infrastructure — or those looking to avoid GPU procurement bottlenecks and costs — OpenVINO offers a practical path to local LLM inference using the hardware already in their data centers and on their employees' desks.

How Ertas Integrates

Ertas Studio handles the model customization step, fine-tuning a base model on your domain-specific data to create a specialist model for your use case. OpenVINO then handles the deployment optimization step, converting that fine-tuned model into an Intel-optimized format that runs efficiently on your existing hardware. This pairing is especially valuable for enterprises that have Intel server fleets and want to deploy custom AI models without procuring scarce GPU capacity.

The workflow connects naturally: fine-tune in Ertas Studio, export the model in Hugging Face format, and use OpenVINO's conversion tools (or Hugging Face Optimum Intel) to compile it for your target Intel hardware. Apply INT4 weight compression to fit larger models in available memory, and deploy through OpenVINO Model Server with an OpenAI-compatible endpoint. Your applications connect to this endpoint just as they would to any cloud AI API — but the model is fine-tuned on your data, running on your Intel hardware, with predictable costs and complete data control.

Getting Started

1
Fine-tune a model in Ertas Studio
Prepare your domain-specific dataset and train a fine-tuned model in Ertas Studio. Select a base model with a parameter count appropriate for your Intel hardware — 7B to 13B models work well on modern Xeon servers with sufficient RAM.
2
Export and convert to OpenVINO format
Export the fine-tuned model from Ertas in Hugging Face safetensors format. Use Optimum Intel or OpenVINO's model converter to compile it into OpenVINO's intermediate representation (IR) with INT4 or INT8 weight compression.
3
Benchmark on your target hardware
Run OpenVINO's benchmark tool to measure inference throughput and latency on your specific Intel hardware. Test with representative prompts from your use case to verify both performance and output quality after compression.
4
Deploy with OpenVINO Model Server
Load the optimized model into OpenVINO Model Server, which provides REST and gRPC endpoints compatible with the OpenAI API format. Configure context length, batching parameters, and resource allocation for your serving environment.
5
Integrate and iterate
Connect your applications to the OpenVINO Model Server endpoint. Monitor output quality and performance in production. Fine-tune improved versions in Ertas when you need to expand the model's domain knowledge or correct recurring issues.

Benefits

Deploy fine-tuned models on Intel CPUs, GPUs, and NPUs without NVIDIA hardware
Leverage existing Intel server infrastructure already present in most enterprise data centers
INT4 weight compression enables running larger models within available system memory
OpenAI-compatible serving endpoint for seamless integration with standard client libraries
Predictable, fixed infrastructure costs with no per-token API charges
Support for Intel Core Ultra NPUs brings efficient inference to developer laptops