SWE-Bench Verified

A benchmark for evaluating language models on real-world software engineering tasks drawn from open-source GitHub repositories — measuring whether the model can autonomously close issues by making the correct multi-file code changes.

CodingUpdated 2026-04-30

What It Measures

SWE-Bench Verified evaluates language models on real software engineering tasks: closing GitHub issues by making the correct code changes across one or more files. Each task includes the issue description, the relevant repository state, and a hidden test suite. A model's score on a task is binary — either the model's proposed code change passes the test suite (resolved) or it doesn't. The 'Verified' subset is a curated 500-task subset that has been manually reviewed for quality, removing tasks where the original test suite was ambiguous, the issue description was misleading, or the expected change was implementation-detail-specific.

This is fundamentally different from synthetic coding benchmarks like HumanEval or MBPP, where the model writes a single function from a description. SWE-Bench tasks require the model to navigate an existing codebase, understand cross-file relationships, identify the right place to make changes, and produce edits that integrate cleanly with surrounding code. As a measure of agentic coding capability, SWE-Bench Verified is currently considered the most practically meaningful evaluation in widespread use.

How It Works

Each task in SWE-Bench Verified provides: the GitHub issue text, the repository state at the time the issue was reported, and a hidden test suite that the correct fix must pass. The model (or agent built on the model) is given access to the repository and must produce a code change. The change is applied to the repository, and the hidden test suite is run. The model's score on the task is 1 if all tests pass and 0 otherwise.

Most current evaluations run the model inside an agent harness — a scaffolding layer that handles repository navigation, file reading, code editing, and test execution. The agent harness is itself a meaningful variable: the same model can score substantially differently with different harness implementations. Most reported SWE-Bench Verified scores use a standardized harness (often a CodeAct-style or SWE-agent-style scaffold), but harness details should always be checked when comparing scores across reports.

Current Leaders

MiniMax M2.5

80.2%

MiMo V2.5 Pro

78.0%

GLM-5

77.8%

Kimi K2.6

76.8%

DeepSeek V4

V3.2 baseline maintained in V4

~73%

How to Interpret Scores

SWE-Bench Verified scores tend to correlate well with real-world coding agent reliability — substantially better than HumanEval, which is now considered saturated and contamination-prone. A score of 80%+ on SWE-Bench Verified represents a coding model that can credibly handle a meaningful fraction of real-world engineering tasks autonomously, though the failed 20% will include some easy-looking tasks for hard-to-predict reasons. Scores below 50% indicate a model that requires substantial human review on most tasks. The benchmark is hard enough that 100% is unlikely to be reached for some time, since the failure modes include ambiguity in issue descriptions and edge cases in test suites that even strong models will occasionally trip on.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →