Fine-Tune GPT-OSS with Ertas

OpenAI's first open-weight model release since GPT-2 — a mixture-of-experts family with the 117B/5.1B-active GPT-OSS-120B flagship and a smaller 21B/3.6B-active GPT-OSS-20B variant, released August 2025 under Apache 2.0.

21B-A3.6B (20b)117B-A5.1B (120b)OpenAI

Overview

GPT-OSS, released by OpenAI in August 2025, is OpenAI's first open-weight model release since GPT-2 in 2019 — a multi-year reversal of the company's closed-weight posture and an event that significantly reshaped the open-weight ecosystem. The release includes two variants: GPT-OSS-120B (117B total / 5.1B active mixture-of-experts) and GPT-OSS-20B (21B total / 3.6B active). Both are released under Apache 2.0.

The 120B variant was positioned at release as competitive with OpenAI's o3-mini on a range of benchmarks, while the 20B variant targets local deployment and edge use cases. Independent evaluation has confirmed strong performance — GPT-OSS-120B exceeds o3-mini on several reasoning benchmarks despite the 20-30x smaller active parameter count, validating OpenAI's claims about the architecture's efficiency.

From an architectural standpoint, GPT-OSS uses a relatively conventional MoE design with top-k expert routing and grouped-query attention. The headline innovation is the post-training pipeline, which OpenAI has discussed publicly as combining their internal RLHF infrastructure with new techniques developed for this release. The result is a pair of models that punch substantially above their active parameter weight class.

Weights are available on Hugging Face under `openai/gpt-oss-120b` and `openai/gpt-oss-20b`. Apache 2.0 licensing combined with OpenAI's brand recognition has made GPT-OSS one of the most-deployed open-weight model families in the months since release, particularly in enterprise environments where the OpenAI brand carries weight in vendor selection.

Key Features

OpenAI's brand alone is one of GPT-OSS's most significant features in practice. For teams making vendor selection decisions, the ability to deploy an OpenAI-trained model on their own infrastructure removes a major friction point in adopting open-weight AI — particularly in enterprise environments where the question 'is this safe to deploy?' is often answered by reference to brand reputation rather than technical evaluation. GPT-OSS makes that decision easier than alternatives that require evaluating less-familiar labs.

The 5.1B active parameter count on GPT-OSS-120B is exceptionally efficient. Inference throughput is comparable to a 5B dense model, well within the operating range of mid-tier consumer GPUs and modest server hardware. Combined with quality that exceeds o3-mini on many evaluations, GPT-OSS-120B offers an outstanding cost-quality ratio for production serving.

GPT-OSS-20B targets the local-deployment sweet spot. With 3.6B active parameters and a total memory footprint of approximately 12GB at Q4_K_M, the 20B variant runs on consumer hardware ranging from gaming laptops to base-tier desktops. This is OpenAI's first real entry into the local LLM ecosystem, and the model's strong tool-use fidelity and instruction-following make it competitive with the best small open-weight models for on-device deployment.

Apache 2.0 licensing is unrestrictive — including for commercial use, derivative training, and fine-tuning. Unlike some recent OpenAI releases that included usage-policy restrictions in their API terms, GPT-OSS imposes no such restrictions on the open weights themselves. Users are free to fine-tune, deploy, and integrate without licensing review beyond standard Apache compliance.

Fine-Tuning with Ertas

Both GPT-OSS variants are well-suited to fine-tuning in Ertas Studio. The 20B variant with QLoRA fits comfortably on consumer GPUs with 16-24GB VRAM at typical sequence lengths, making it an excellent choice for rapid iteration and small-scale specialization. The 120B variant with QLoRA needs approximately 50-70GB of VRAM, fitting on a single 80GB GPU or split across two 48GB GPUs.

The MoE architecture in GPT-OSS-120B is handled by Ertas Studio's standard MoE fine-tuning pipeline — expert routing stability, load balancing, and adapter merging are configured automatically. The 5.1B active parameter count means training throughput per step is comparable to a 5B dense model, which is fast enough for production fine-tuning workflows on a single 80GB GPU.

For fine-tuning datasets, GPT-OSS supports the full range of training formats: instruction-following pairs, multi-turn conversations, tool-use traces, and reasoning-mode data. The model inherits OpenAI's strong tool-use training, which carries over to fine-tunes — a fine-tuned GPT-OSS variant retains high-fidelity function-calling behavior even when specialized for narrow domains, which is not always the case with other open-weight bases.

After training, Ertas Studio exports to GGUF format with full GPT-OSS chat template preservation. The 20B Q4_K_M quantization is approximately 12GB, deployable on consumer hardware via Ollama, llama.cpp, or LM Studio. The 120B Q4_K_M is approximately 65GB, requiring an 80GB GPU or large-memory CPU host for deployment.

Use Cases

GPT-OSS-120B is well-suited for enterprise applications where the OpenAI brand carries weight in deployment review. Internal knowledge retrieval, document analysis, customer support automation, and code assistance are all natural fits. The model's combination of strong reasoning capability, high tool-use fidelity, and 5B-class inference economics makes it attractive for high-throughput production serving where alternative open-weight choices would require larger active parameter counts.

GPT-OSS-20B targets local deployment patterns. On-device chat assistants, browser-based AI tools, edge processing, and developer tools that ship with embedded LLM capability all benefit from the 20B variant's combination of strong quality and modest hardware requirements. The model is also a natural choice for fine-tuning into specialized small models — its strong base capabilities make domain adaptation more sample-efficient than starting from a comparable dense base.

For teams building products that previously used the OpenAI API and are now moving to self-hosted deployment for cost or data sovereignty reasons, GPT-OSS provides a relatively low-friction migration path. The model's prompt formatting and behavioral patterns are familiar to teams with OpenAI API experience, reducing the engineering work needed to port existing prompts and integrations.

Hardware Requirements

GPT-OSS-20B at Q4_K_M quantization requires approximately 12GB of VRAM, fitting on consumer GPUs from the RTX 3060 12GB upward. At Q8_0, expect approximately 22GB. The 3.6B active parameter count gives the model fast inference even on modest hardware, making it well-suited for interactive local applications.

GPT-OSS-120B at Q4_K_M requires approximately 65GB of VRAM, fitting on a single 80GB GPU (A100 80GB, H100 80GB) or split across two 48GB GPUs with tensor parallelism. At Q8_0, expect approximately 120GB. Active parameter count of 5.1B determines token generation throughput, so once loaded the model serves at approximately 5B-class speeds — exceptionally fast for a model of this effective quality range.

For fine-tuning in Ertas Studio: GPT-OSS-20B with QLoRA needs 16-24GB VRAM at typical sequence lengths, fitting comfortably on a single 24GB GPU. GPT-OSS-120B with QLoRA needs 50-70GB VRAM, fitting on a single 80GB GPU or split across two 48GB GPUs. The favorable fine-tuning hardware requirements relative to the model's effective quality are one of the strongest reasons to choose GPT-OSS for production fine-tuning workflows.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

llama.cpp

LM Studio

Ollama

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →