Fine-Tune Hermes 4 with Ertas

Nous Research's August 2025 model family — Llama-3.1-based fine-tunes in 14B, 70B, and 405B sizes featuring hybrid reasoning via explicit thinking tokens, neutrally-aligned post-training, and trained on ~60B tokens with the Atropos reinforcement learning system using ~1,000 task-specific verifiers.

14B70B405BNous Research

Overview

Hermes 4, released by Nous Research on August 30, 2025, is the fourth generation of the Hermes model family and the version that established Nous as a leading source of capable open-weight fine-tunes. The family ships in three sizes — 14B, 70B, and 405B parameters — all derived from Meta's Llama 3.1 base models. Despite being fine-tunes rather than from-scratch pretraining, the Hermes 4 models deliver capabilities that rival or exceed many native flagship releases on reasoning benchmarks.

The key innovation in Hermes 4 is its hybrid reasoning architecture using explicit `<think>` tokens. Unlike pure reasoning models that always generate chain-of-thought, or pure instruction models that respond directly, Hermes 4 supports both modes within a single checkpoint. The model can produce structured thinking traces wrapped in `<think>...</think>` tags when reasoning is beneficial, or skip directly to the answer for queries that don't require deliberation. This is similar in spirit to the unified thinking modes in Qwen 3+ and DeepSeek V3.2+, but achieved through targeted post-training rather than from-scratch architectural design.

Hermes 4 is positioned as 'neutrally aligned' — Nous Research has explicitly avoided heavy-handed RLHF refusal training, producing a model that follows instructions without the over-refusal patterns common in other contemporary releases. This positioning makes Hermes 4 particularly valuable for legitimate use cases that struggle with mainstream models' refusal patterns, including security research, creative writing requiring mature content, and red-team evaluation work.

The training methodology is also notable. Nous used their Atropos reinforcement learning framework with approximately 1,000 task-specific verifiers — automated graders that score model outputs on factual accuracy, code correctness, mathematical validity, and other domain-specific signals. This produces a fine-tune with substantially improved reasoning quality without the alignment artifacts of traditional RLHF.

Key Features

Hybrid reasoning via `<think>` tokens is Hermes 4's most distinctive capability. The model knows when to reason — typically engaging thinking mode for math, code, complex factual questions, and multi-step planning, while responding directly for conversational queries, simple instructions, and recall tasks. Developers can control this behavior via prompting (e.g., asking the model to think first) or via fine-tuning to bias toward direct or reasoning responses for specific domains.

Neutrally-aligned post-training means Hermes 4 follows instructions without the layered refusal patterns common in mainstream releases. This is significant for legitimate use cases that require the model to engage with content other models reject — including red-team safety evaluation, security research and CTF challenges, fiction with mature themes, historical content analysis, and educational discussion of sensitive topics. Nous has been explicit that the model is designed for capability and steerability rather than reflexive refusal.

The Atropos RL framework with 1,000+ verifiers produces measurable improvements over base Llama 3.1 on reasoning benchmarks. On AIME, GPQA, and complex code generation tasks, Hermes 4 70B substantially outperforms Llama 3.1 70B Instruct, and Hermes 4 405B closes much of the gap with frontier proprietary models on reasoning-heavy evaluations.

Because Hermes 4 is built on Llama 3.1, it inherits Llama's tooling ecosystem — including efficient inference in llama.cpp, vLLM, and TensorRT-LLM, broad quantization support, mature fine-tuning recipes, and compatibility with the wide ecosystem of Llama-based deployment infrastructure.

Fine-Tuning with Ertas

Hermes 4's Llama 3.1 base architecture means it inherits Llama 3.1's well-established fine-tuning workflow. In Ertas Studio, the 14B variant fine-tunes with QLoRA on 12-16GB VRAM, the 70B variant on 40-48GB VRAM, and the 405B variant on multi-GPU server configurations (8x A100 80GB or larger).

For fine-tuning Hermes 4, the most valuable pattern is preserving the hybrid reasoning behavior in your training data. Datasets that include explicit `<think>...</think>` traces for complex examples and direct responses for simple ones teach the fine-tuned model to retain the adaptive reasoning capability rather than collapsing into one mode or the other. Ertas Studio supports these annotated datasets natively and can also generate synthetic thinking traces from your existing instruction data using a separate reasoning model.

After training, Ertas Studio exports to GGUF format with full Hermes 4 prompt-template preservation, including the `<think>` token markers. Quantized models deploy directly via Ollama, llama.cpp, or LM Studio. The 70B variant at Q4_K_M produces a ~40GB file deployable on a 48GB GPU, providing high-quality reasoning capabilities in a self-hosted package without the multi-GPU footprint of larger models.

Use Cases

Hermes 4 is the preferred choice when you need a model that follows instructions without heavy refusal patterns. This includes security research and CTF training environments, red-team evaluation tooling, creative writing platforms with mature content support, historical and educational content involving sensitive topics, and applications where over-refusal degrades user experience. The hybrid reasoning makes it well-suited for these use cases since they often involve multi-step thinking but rarely benefit from forced reasoning-mode latency.

For general reasoning workloads, Hermes 4 70B is one of the strongest open-weight options at the 70B parameter scale. It's well-suited for code review, debugging assistance, mathematical problem solving, and structured analysis tasks. The hybrid `<think>` mode allows fast direct responses for simple queries and full reasoning depth on harder ones — useful in interactive applications where uniform reasoning-mode latency would be disruptive.

The 405B variant targets high-capability research and synthesis applications. Its strong combination of reasoning depth, instruction following, and steerability makes it useful for tasks like advanced code generation, scientific writing, complex content review, and as a teacher model for fine-tuning smaller students. Hermes 4 405B is also frequently deployed as a base for further specialization — its already-strong reasoning capability makes domain fine-tuning more sample-efficient.

Hardware Requirements

The Hermes 4 14B model at Q4_K_M quantization requires approximately 8.5GB of VRAM, runnable on consumer GPUs from the RTX 3060 12GB upward. At Q8_0, expect approximately 15GB. The 70B model at Q4_K_M needs approximately 40GB, fitting on a single 48GB GPU (RTX 6000 Ada, A6000) or splitable across two 24GB GPUs.

The 405B model at Q4_K_M requires approximately 230GB, demanding multi-GPU server setups (4x A100 80GB, 8x A6000 48GB) or large-memory CPU inference systems with 512GB+ RAM. For most teams interested in Hermes 4 capability without the 405B hardware footprint, the 70B variant offers the best quality-to-resource ratio.

For fine-tuning in Ertas Studio: 14B QLoRA needs 12-16GB VRAM, 70B QLoRA needs 40-48GB VRAM, and 405B QLoRA needs multi-GPU server configurations. Note that reasoning-mode training generates substantially more tokens per example than standard instruction tuning, so allow additional VRAM headroom for sequence lengths and gradient accumulation when fine-tuning on reasoning-heavy datasets.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

llama.cpp

LM Studio

Ollama

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →