SWE-bench Leaderboards

`; } else if (random

There's an all-new, challenging SWE-bench Multimodal, containing software issues described with images.
Learn more here.

`; } else if (random

mini-SWE-agent scores up to 74% on SWE-bench Verified in 100 lines of Python code.
Click here to learn more.

`; } else { announcementHTML = `

Introducing CodeClash, our new evaluation where LMs compete head to head to write the best codebase!
Click here to learn more.

`; } // Insert the selected announcement into the DOM document.getElementById('random-announcement').innerHTML = announcementHTML;

New!

SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].

Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite, 517 Multimodal).

Analyze Results in Detail

News

[11/2025] Introducing CodeClash, our new eval of LMs as goal (not task) oriented developers! [Link]
[07/2025] mini-SWE-agent scores 65% on SWE-bench Verified in 100 lines of python code. [Link]
[05/2025] SWE-smith is out! Train your own models for software engineering agents. [Link]
[03/2025] SWE-agent 1.0 is the open source SOTA on SWE-bench Lite! [Link]
[10/2024] Introducing SWE-bench Multimodal! [Link]
[08/2024] SWE-bench x OpenAI = SWE-bench Verified [Report]
[06/2024] Docker-ized SWE-bench for easier evaluation [Report]
[03/2024] Check out SWE-agent (12.47% on SWE-bench) [Link]
[03/2024] Released SWE-bench Lite [Report]

Acknowledgements

We thank the following institutions for their generous support: Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.