Fine-Tune Devstral 2 with Ertas

Mistral AI's coding-specialized open-weight family — Devstral 2 (123B) and Devstral Small 2 (24B), with the 123B variant scoring 72.2% on SWE-Bench Verified and the 24B running on consumer hardware. Released as a coding specialist line before being absorbed into Mistral Small 4's unified architecture in March 2026.

24B (Small 2)123BMistral AI

Overview

Devstral 2, released by Mistral AI as part of the broader 2025 Devstral coding-specialized line, is the second generation of Mistral's dedicated agentic coding model. The family ships in two sizes: a 123-billion parameter flagship (Devstral 2) and a 24-billion parameter consumer-deployable variant (Devstral Small 2). Both are open-weight releases targeting agentic coding workloads — the multi-step task patterns that characterize CLI-based coding agents like Claude Code, Cline, and Aider.

Devstral 2's headline benchmark result is 72.2% on SWE-Bench Verified — a strong score that places it competitive with the top tier of open-weight coding models at release. Devstral Small 2 achieves 68.0% on the same benchmark, which is exceptional for a 24B-parameter model and competitive with substantially larger alternatives. For teams that want strong coding capability at consumer-deployable scales, Devstral Small 2 hits a particularly productive sweet spot.

The Devstral line was substantively absorbed into Mistral Small 4's unified architecture in March 2026. Where Mistral previously maintained three distinct model lineages — Magistral for reasoning, Devstral for coding, and Mistral Small for instruction-tuned use — Mistral Small 4 unifies all three into a single 119B-A6B mixture-of-experts checkpoint. For new deployments, Mistral Small 4 is the recommended path — but Devstral 2 remains valid for teams running stable production deployments adopted before the consolidation.

Devstral 2's positioning as a dedicated coding specialist is meaningful in specific deployment scenarios. While Mistral Small 4 covers coding via its unified architecture, the Devstral 2 line was specifically engineered for agentic coding workloads — different post-training emphasis, different evaluation suite, different deployment patterns. For teams whose primary use case is coding rather than general-purpose AI, Devstral 2 maintains advantages in specific niches even after the consolidation.

The license for Devstral 2 covers open-weight deployment but is worth reviewing for specific commercial scenarios. Devstral Small 2 in particular ships under terms designed to support consumer-product deployment without restrictive usage caps. Weights are available on Hugging Face under Mistral's organization.

Key Features

Devstral 2's 72.2% SWE-Bench Verified score positioned the model competitively against open-weight alternatives at release. The benchmark measures real-world software engineering capability — multi-file changes, test-driven iteration, codebase navigation — and Devstral 2's score reflects genuine production-grade coding capability rather than synthetic benchmark optimization.

Devstral Small 2 at 24B parameters with 68.0% SWE-Bench Verified is the standout efficiency result. For consumer-deployable scales, achieving this score is exceptional — substantially exceeding general-purpose 24B alternatives and approaching the capability of much larger coding-specialized models. For teams wanting frontier-tier coding capability on consumer or workstation hardware, Devstral Small 2 is among the strongest options in the family.

The coding-specialist positioning differentiates Devstral 2 from general-purpose alternatives. While Mistral Small 4's unified architecture covers coding through general post-training, Devstral 2 specifically targeted agentic coding workloads with appropriate training data emphasis — multi-step coding traces, tool-use patterns, test-driven iteration examples. For teams whose deployment is exclusively coding-focused, this specialization provides quality advantages over general-purpose alternatives at equivalent parameter counts.

Mistral's strong tool-use training tradition translates well to Devstral 2's agentic coding capability. The model handles function calls, structured outputs, and multi-step tool sequences with high fidelity — capabilities that matter for agentic coding deployments where reliability of the tool-use loop is often more important than raw code-generation quality.

Fine-Tuning with Ertas

Devstral 2 fine-tuning in Ertas Studio is straightforward across both variants. Devstral Small 2 (24B) fine-tunes with QLoRA on consumer GPUs (16-24GB VRAM), making it among the most accessible coding-specialist bases for teams without server-class infrastructure. Devstral 2 (123B) requires workstation or modest-server configurations — 48GB+ GPU for QLoRA at typical sequence lengths.

For coding-domain fine-tuning specifically, Devstral 2 benefits from training data that includes complete agentic execution traces — task descriptions, planning, multi-file edits, test outputs, and corrective iterations. Ertas Studio supports these multi-step formats natively, including tool-use traces from Claude Code, Cline, or Aider runs. Training on your team's specific codebase produces a domain-specialized coding model that outperforms the base on tasks within your codebase.

For teams choosing between Devstral 2 fine-tuning and Mistral Small 4 fine-tuning, the recommendation depends on the deployment profile. Mistral Small 4's 6B active parameter count provides better fine-tuning economics for general-purpose specialization including coding. Devstral 2 offers somewhat better baseline coding-specific capability for teams whose fine-tuned variant will be exclusively used for coding workloads — but the gap has narrowed substantially with Mistral Small 4's release.

After training, Ertas Studio exports to GGUF format with full Devstral 2 chat template preservation. Both variants deploy cleanly via Ollama, llama.cpp, or vLLM with standard configuration.

Use Cases

Self-hosted coding agent deployments on consumer or workstation hardware are Devstral Small 2's most natural use case. Teams of 5-20 developers wanting strong coding agent capability without committing to server infrastructure find Devstral Small 2 among the most accessible options at the 24B size class. Production patterns include AI pair-programming for small enterprise codebases, autonomous PR generation for routine change patterns, and CI-integrated code review at moderate request volumes.

Devstral 2 at 123B targets larger team deployments where the additional capability justifies the workstation/server hardware investment. AI pair-programming for large enterprise codebases, autonomous coding agents handling complex refactors, and high-throughput code review automation benefit from the 123B variant's stronger baseline capability.

For teams running stable production deployments on Devstral 2 before the Mistral Small 4 consolidation, the model remains documented and supported. The migration to Mistral Small 4 offers operational simplification (one model replacing three separate lineages) but comes with non-trivial migration costs for teams with existing Devstral-specific fine-tunes or downstream tooling. Continuing Devstral 2 deployment is valid for these scenarios.

For European teams or any deployment subject to data sovereignty requirements, Mistral's EU-headquartered positioning combined with Devstral 2's open-weight distribution provides structural advantages over US-based or Chinese-lab alternatives. Self-hosted deployment on EU infrastructure with EU-developed models meets compliance requirements that some regulatory environments specifically require.

Hardware Requirements

Devstral Small 2 at Q4_K_M quantization requires approximately 14GB of memory, fitting on consumer GPUs from RTX 3090 24GB and RTX 4090 onwards. At Q8_0, expect approximately 26GB. The 24B size makes it deployable on workstation hardware that's substantially more accessible than server-class infrastructure.

Devstral 2 at Q4_K_M requires approximately 70GB, fitting on a single 80GB GPU (A100 80GB, H100 80GB) or split across two 48GB GPUs with tensor parallelism. At Q8_0, expect approximately 130GB. CPU inference is feasible on hosts with 192GB+ RAM but at substantially lower throughput than GPU deployment.

For fine-tuning in Ertas Studio: Devstral Small 2 QLoRA needs 16-24GB VRAM at typical sequence lengths, fitting on a single consumer-tier GPU (RTX 4090, RTX 5090). Devstral 2 QLoRA needs 50-80GB VRAM, fitting on a single 80GB GPU or split across two 48GB GPUs with model parallelism. Long-context fine-tuning (32K-64K sequences) requires proportionally more memory with gradient checkpointing.