SWE-Bench Verified
CodingA benchmark for evaluating language models on real-world software engineering tasks drawn from open-source GitHub repositories — measuring whether the model can autonomously close issues by making the correct multi-file code changes.