ONNX Format Guide

Open Neural Network Exchange format for cross-platform inference

Model Weights

Specification

ONNX (Open Neural Network Exchange) is an open standard format for representing machine learning models. Originally developed by Microsoft and Facebook in 2017 and now governed by the Linux Foundation, ONNX defines a common set of operators, data types, and a computation graph format that enables models trained in one framework to be deployed in another. The format uses Protocol Buffers (protobuf) for serialization and supports a wide range of model architectures including convolutional neural networks, recurrent networks, transformers, and classical ML models.

An ONNX model is represented as a directed acyclic graph (DAG) where nodes represent operations (convolution, matrix multiplication, activation functions, etc.), edges represent tensors flowing between operations, and the graph has defined inputs and outputs. The ONNX operator set (opset) is versioned, allowing models to specify which version of the operator definitions they require. As of opset 21, ONNX defines over 200 operators covering neural network operations, mathematical functions, tensor manipulation, and control flow.

The format includes an extensible metadata system for model documentation, training information, and custom properties. ONNX also supports quantized data types (INT8, UINT8) and mixed-precision representations, enabling model optimization for deployment on resource-constrained devices. The ONNX Model Zoo provides a collection of pre-trained models in ONNX format covering image classification, object detection, NLP, speech recognition, and other common tasks.

When to Use ONNX

ONNX is the right choice when you need cross-platform model deployment — training in PyTorch or TensorFlow and deploying on different hardware or runtime environments. ONNX Runtime (ORT) provides optimized inference across CPUs (x86, ARM), GPUs (NVIDIA, AMD, Intel), NPUs, and specialized accelerators. If your deployment target includes Windows applications (via DirectML), mobile devices (via ONNX Runtime Mobile), web browsers (via ONNX Runtime Web/WebAssembly), or edge devices, ONNX provides a single model format that works across all of them.

Choose ONNX when inference performance is critical and you want to benefit from ONNX Runtime's extensive graph optimizations including operator fusion, constant folding, layout optimization, and quantization. ONNX Runtime consistently ranks among the fastest inference engines across various model architectures and hardware platforms. It is particularly strong for transformer encoder models (BERT, RoBERTa, DeBERTa) used in classification, NER, and embedding tasks.

ONNX is less suitable for very large generative language models (over 7B parameters) where specialized inference engines like vLLM, TensorRT-LLM, or llama.cpp provide better performance through techniques like continuous batching, PagedAttention, and speculative decoding. ONNX also has a conversion gap — not all PyTorch operations are supported by the ONNX exporter, particularly dynamic control flow and certain custom operators, which can require workarounds during export.

Schema / Structure

protobuf

// ONNX Model structure (simplified from onnx.proto3)
message ModelProto {
  int64 ir_version = 1;           // IR version (currently 9)
  repeated OperatorSetIdProto opset_import = 8;
  string producer_name = 2;       // e.g., "pytorch"
  string producer_version = 3;
  string domain = 4;
  int64 model_version = 5;
  string doc_string = 6;
  GraphProto graph = 7;           // The computation graph
  repeated StringStringEntryProto metadata_props = 14;
}

message GraphProto {
  string name = 1;
  repeated NodeProto node = 2;      // Operations in the graph
  repeated TensorProto initializer = 5; // Pretrained weights
  repeated ValueInfoProto input = 11;
  repeated ValueInfoProto output = 12;
}

message NodeProto {
  repeated string input = 1;
  repeated string output = 2;
  string op_type = 4;              // e.g., "Conv", "MatMul", "Relu"
  repeated AttributeProto attribute = 5;
}

Simplified ONNX protobuf schema showing model, graph, and node structure

Example Data

python

import torch
import onnx
import onnxruntime as ort
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Export a HuggingFace model to ONNX
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

dummy_input = tokenizer("This is a test", return_tensors="pt")
torch.onnx.export(
    model,
    (dummy_input["input_ids"], dummy_input["attention_mask"]),
    "sentiment_model.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch", 1: "seq"},
                  "attention_mask": {0: "batch", 1: "seq"},
                  "logits": {0: "batch"}},
    opset_version=17,
)

# Run inference with ONNX Runtime
session = ort.InferenceSession("sentiment_model.onnx")
inputs = tokenizer("Great product, highly recommend!", return_tensors="np")
outputs = session.run(None, dict(inputs))
print(f"Logits: {outputs[0]}")  # [[negative_score, positive_score]]

Exporting a HuggingFace sentiment model to ONNX and running inference with ONNX Runtime

Ertas Support

Ertas Studio supports ONNX as an export target for trained models, enabling deployment across the diverse runtime environments that ONNX Runtime supports. Models trained through the Ertas cloud training pipeline can be exported to ONNX format with optimizations applied for your target deployment platform, whether that is CPU inference on edge devices, GPU inference in data centers, or browser-based inference via WebAssembly.

Related Resources

Glossary

Inference

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →