There's an all-new, challenging SWE-bench Multimodal, containing software issues described with images.
Learn more here.
mini-SWE-agent scores up to 74% on SWE-bench Verified in 100 lines of Python code.
Click here to learn more.
Introducing CodeClash, our new evaluation where LMs compete head to head to write the best codebase!
Click here to learn more.
SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].
Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite, 517 Multimodal).
We thank the following institutions for their generous support: Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.