MiMo V2.5 Pro
Per Xiaomi, beats Claude Opus 4.6
A harder successor to SWE-Bench Verified, designed to be contamination-resistant and to evaluate models on more complex multi-file changes drawn from recent GitHub issues — the current frontier benchmark for agentic coding capability.
SWE-Bench Pro is the harder successor to SWE-Bench Verified, designed to address two limitations of the earlier benchmark. First, it includes more complex multi-file changes — tasks where a fix requires coordinated edits across several files rather than localized changes to a single file. Second, it draws from more recent GitHub issues, reducing the risk that frontier models have seen the tasks during training (a contamination problem that has affected SWE-Bench Verified scores at the top of the leaderboard).
By late 2025 / early 2026, SWE-Bench Pro had become the benchmark of choice for serious agentic coding evaluation. Models are now competing for the SWE-Bench Pro leaderboard the way they competed for SWE-Bench Verified two years earlier — with the difference that SWE-Bench Pro scores remain meaningfully bounded below 100%, providing room for models to differentiate themselves at the high end.
The evaluation methodology is similar to SWE-Bench Verified: each task includes a GitHub issue, repository state, and hidden test suite. The model must produce code changes that pass the test suite. The differences are in task selection — SWE-Bench Pro emphasizes harder, multi-file, more recent tasks — and in the rigor of the curation process to ensure tasks are unambiguous and verifiable.
As with SWE-Bench Verified, the agent harness used to run the model is a meaningful variable. Reported SWE-Bench Pro scores generally use standardized harnesses, but cross-report comparisons should always check harness details. Some reports also distinguish between 'pure model' scores and 'model + scaffolding' scores; SWE-Bench Pro tasks are sufficiently complex that the scaffolding choice can move scores by 5-10 percentage points.
Per Xiaomi, beats Claude Opus 4.6
SWE-Bench Pro scores are substantially lower than SWE-Bench Verified scores for the same model — a model scoring 80% on Verified might score 50-60% on Pro. This is by design: Pro is meant to be the harder evaluation that frontier models can still meaningfully fail. As of April 2026, the open-weight leader on SWE-Bench Pro is reportedly MiMo V2.5 Pro per Xiaomi's evaluations (claimed to beat Claude Opus 4.6), with Claude Opus 4.7 leading proprietary models at 64.3%. Independent verification of these claims is ongoing. For practical evaluation, SWE-Bench Pro is the more credible signal of frontier agentic coding capability than Verified, which is increasingly contaminated and saturated.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.