SWE-Bench Pro

A harder successor to SWE-Bench Verified, designed to be contamination-resistant and to evaluate models on more complex multi-file changes drawn from recent GitHub issues — the current frontier benchmark for agentic coding capability.

CodingUpdated 2026-04-30

What It Measures

SWE-Bench Pro is the harder successor to SWE-Bench Verified, designed to address two limitations of the earlier benchmark. First, it includes more complex multi-file changes — tasks where a fix requires coordinated edits across several files rather than localized changes to a single file. Second, it draws from more recent GitHub issues, reducing the risk that frontier models have seen the tasks during training (a contamination problem that has affected SWE-Bench Verified scores at the top of the leaderboard).

By late 2025 / early 2026, SWE-Bench Pro had become the benchmark of choice for serious agentic coding evaluation. Models are now competing for the SWE-Bench Pro leaderboard the way they competed for SWE-Bench Verified two years earlier — with the difference that SWE-Bench Pro scores remain meaningfully bounded below 100%, providing room for models to differentiate themselves at the high end.

How It Works

The evaluation methodology is similar to SWE-Bench Verified: each task includes a GitHub issue, repository state, and hidden test suite. The model must produce code changes that pass the test suite. The differences are in task selection — SWE-Bench Pro emphasizes harder, multi-file, more recent tasks — and in the rigor of the curation process to ensure tasks are unambiguous and verifiable.

As with SWE-Bench Verified, the agent harness used to run the model is a meaningful variable. Reported SWE-Bench Pro scores generally use standardized harnesses, but cross-report comparisons should always check harness details. Some reports also distinguish between 'pure model' scores and 'model + scaffolding' scores; SWE-Bench Pro tasks are sufficiently complex that the scaffolding choice can move scores by 5-10 percentage points.

Current Leaders

MiMo V2.5 Pro

Per Xiaomi, beats Claude Opus 4.6

Leader (open-weight)

Kimi K2.6

Strong

DeepSeek V4

Strong

MiniMax M2.5

Strong

Qwen3-Coder

Competitive

How to Interpret Scores

SWE-Bench Pro scores are substantially lower than SWE-Bench Verified scores for the same model — a model scoring 80% on Verified might score 50-60% on Pro. This is by design: Pro is meant to be the harder evaluation that frontier models can still meaningfully fail. As of April 2026, the open-weight leader on SWE-Bench Pro is reportedly MiMo V2.5 Pro per Xiaomi's evaluations (claimed to beat Claude Opus 4.6), with Claude Opus 4.7 leading proprietary models at 64.3%. Independent verification of these claims is ongoing. For practical evaluation, SWE-Bench Pro is the more credible signal of frontier agentic coding capability than Verified, which is increasingly contaminated and saturated.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →