Fine-Tune Devstral 2 with Ertas
Mistral AI's coding-specialized open-weight family — Devstral 2 (123B) and Devstral Small 2 (24B), with the 123B variant scoring 72.2% on SWE-Bench Verified and the 24B running on consumer hardware. Released as a coding specialist line before being absorbed into Mistral Small 4's unified architecture in March 2026.
Overview
Devstral 2, released by Mistral AI as part of the broader 2025 Devstral coding-specialized line, is the second generation of Mistral's dedicated agentic coding model. The family ships in two sizes: a 123-billion parameter flagship (Devstral 2) and a 24-billion parameter consumer-deployable variant (Devstral Small 2). Both are open-weight releases targeting agentic coding workloads — the multi-step task patterns that characterize CLI-based coding agents like Claude Code, Cline, and Aider.
Devstral 2's headline benchmark result is 72.2% on SWE-Bench Verified — a strong score that places it competitive with the top tier of open-weight coding models at release. Devstral Small 2 achieves 68.0% on the same benchmark, which is exceptional for a 24B-parameter model and competitive with substantially larger alternatives. For teams that want strong coding capability at consumer-deployable scales, Devstral Small 2 hits a particularly productive sweet spot.
The Devstral line was substantively absorbed into Mistral Small 4's unified architecture in March 2026. Where Mistral previously maintained three distinct model lineages — Magistral for reasoning, Devstral for coding, and Mistral Small for instruction-tuned use — Mistral Small 4 unifies all three into a single 119B-A6B mixture-of-experts checkpoint. For new deployments, Mistral Small 4 is the recommended path — but Devstral 2 remains valid for teams running stable production deployments adopted before the consolidation.
Devstral 2's positioning as a dedicated coding specialist is meaningful in specific deployment scenarios. While Mistral Small 4 covers coding via its unified architecture, the Devstral 2 line was specifically engineered for agentic coding workloads — different post-training emphasis, different evaluation suite, different deployment patterns. For teams whose primary use case is coding rather than general-purpose AI, Devstral 2 maintains advantages in specific niches even after the consolidation.
The license for Devstral 2 covers open-weight deployment but is worth reviewing for specific commercial scenarios. Devstral Small 2 in particular ships under terms designed to support consumer-product deployment without restrictive usage caps. Weights are available on Hugging Face under Mistral's organization.
Key Features
Devstral 2's 72.2% SWE-Bench Verified score positioned the model competitively against open-weight alternatives at release. The benchmark measures real-world software engineering capability — multi-file changes, test-driven iteration, codebase navigation — and Devstral 2's score reflects genuine production-grade coding capability rather than synthetic benchmark optimization.
Devstral Small 2 at 24B parameters with 68.0% SWE-Bench Verified is the standout efficiency result. For consumer-deployable scales, achieving this score is exceptional — substantially exceeding general-purpose 24B alternatives and approaching the capability of much larger coding-specialized models. For teams wanting frontier-tier coding capability on consumer or workstation hardware, Devstral Small 2 is among the strongest options in the family.
The coding-specialist positioning differentiates Devstral 2 from general-purpose alternatives. While Mistral Small 4's unified architecture covers coding through general post-training, Devstral 2 specifically targeted agentic coding workloads with appropriate training data emphasis — multi-step coding traces, tool-use patterns, test-driven iteration examples. For teams whose deployment is exclusively coding-focused, this specialization provides quality advantages over general-purpose alternatives at equivalent parameter counts.
Mistral's strong tool-use training tradition translates well to Devstral 2's agentic coding capability. The model handles function calls, structured outputs, and multi-step tool sequences with high fidelity — capabilities that matter for agentic coding deployments where reliability of the tool-use loop is often more important than raw code-generation quality.
Fine-Tuning with Ertas
Devstral 2 fine-tuning in Ertas Studio is straightforward across both variants. Devstral Small 2 (24B) fine-tunes with QLoRA on consumer GPUs (16-24GB VRAM), making it among the most accessible coding-specialist bases for teams without server-class infrastructure. Devstral 2 (123B) requires workstation or modest-server configurations — 48GB+ GPU for QLoRA at typical sequence lengths.
For coding-domain fine-tuning specifically, Devstral 2 benefits from training data that includes complete agentic execution traces — task descriptions, planning, multi-file edits, test outputs, and corrective iterations. Ertas Studio supports these multi-step formats natively, including tool-use traces from Claude Code, Cline, or Aider runs. Training on your team's specific codebase produces a domain-specialized coding model that outperforms the base on tasks within your codebase.
For teams choosing between Devstral 2 fine-tuning and Mistral Small 4 fine-tuning, the recommendation depends on the deployment profile. Mistral Small 4's 6B active parameter count provides better fine-tuning economics for general-purpose specialization including coding. Devstral 2 offers somewhat better baseline coding-specific capability for teams whose fine-tuned variant will be exclusively used for coding workloads — but the gap has narrowed substantially with Mistral Small 4's release.
After training, Ertas Studio exports to GGUF format with full Devstral 2 chat template preservation. Both variants deploy cleanly via Ollama, llama.cpp, or vLLM with standard configuration.
Use Cases
Self-hosted coding agent deployments on consumer or workstation hardware are Devstral Small 2's most natural use case. Teams of 5-20 developers wanting strong coding agent capability without committing to server infrastructure find Devstral Small 2 among the most accessible options at the 24B size class. Production patterns include AI pair-programming for small enterprise codebases, autonomous PR generation for routine change patterns, and CI-integrated code review at moderate request volumes.
Devstral 2 at 123B targets larger team deployments where the additional capability justifies the workstation/server hardware investment. AI pair-programming for large enterprise codebases, autonomous coding agents handling complex refactors, and high-throughput code review automation benefit from the 123B variant's stronger baseline capability.
For teams running stable production deployments on Devstral 2 before the Mistral Small 4 consolidation, the model remains documented and supported. The migration to Mistral Small 4 offers operational simplification (one model replacing three separate lineages) but comes with non-trivial migration costs for teams with existing Devstral-specific fine-tunes or downstream tooling. Continuing Devstral 2 deployment is valid for these scenarios.
For European teams or any deployment subject to data sovereignty requirements, Mistral's EU-headquartered positioning combined with Devstral 2's open-weight distribution provides structural advantages over US-based or Chinese-lab alternatives. Self-hosted deployment on EU infrastructure with EU-developed models meets compliance requirements that some regulatory environments specifically require.
Hardware Requirements
Devstral Small 2 at Q4_K_M quantization requires approximately 14GB of memory, fitting on consumer GPUs from RTX 3090 24GB and RTX 4090 onwards. At Q8_0, expect approximately 26GB. The 24B size makes it deployable on workstation hardware that's substantially more accessible than server-class infrastructure.
Devstral 2 at Q4_K_M requires approximately 70GB, fitting on a single 80GB GPU (A100 80GB, H100 80GB) or split across two 48GB GPUs with tensor parallelism. At Q8_0, expect approximately 130GB. CPU inference is feasible on hosts with 192GB+ RAM but at substantially lower throughput than GPU deployment.
For fine-tuning in Ertas Studio: Devstral Small 2 QLoRA needs 16-24GB VRAM at typical sequence lengths, fitting on a single consumer-tier GPU (RTX 4090, RTX 5090). Devstral 2 QLoRA needs 50-80GB VRAM, fitting on a single 80GB GPU or split across two 48GB GPUs with model parallelism. Long-context fine-tuning (32K-64K sequences) requires proportionally more memory with gradient checkpointing.
Supported Quantizations
Related Resources
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.