SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024)
A benchmark of 2,294 real GitHub issues from 12 popular Python repositories — to resolve each issue, a model must understand a full codebase, write a patch, and pass the existing test suite.
Citation: Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.
One sentence: A benchmark of 2,294 real GitHub issues from 12 popular Python repositories — to resolve each issue, a model must understand a full codebase, write a patch, and pass the existing test suite.
What Problem It Solved
Coding benchmarks before SWE-bench (HumanEval, MBPP, CodeContests) test isolated function completion. They don't capture what software engineering actually requires: navigating a large existing codebase, understanding context, making multi-file edits, and validating changes against tests.
SWE-bench uses real issues from real open-source repos. The task is not "write a function". It is "fix this reported bug in this production codebase".
Benchmark Construction
Source Repositories (12 Python repos)
Django, Flask, Requests, scikit-learn, pandas, Matplotlib, Sympy, Pillow, Sphinx, pytest, seaborn, Astropy. Popular, well-maintained, high-quality repos with good test coverage.
Issue Selection
For each resolved GitHub issue:
- Find the PR that fixed it
- Identify the test(s) added in the PR that verify the fix
- Revert the codebase to pre-fix state
- Record: (repository, issue description, test(s) that should pass after the fix)
Evaluation
A model is given: the repository at the pre-fix state + the issue text. It must output a unified diff (git patch).
The patch is applied to the repo and the target tests are run. Pass = the tests pass. Fail = tests fail or patch doesn't apply.
No human evaluation needed — test suite pass/fail is the oracle.
Difficulty and Results
Why It's Hard
- Repos have hundreds to thousands of files
- The relevant code to change is not identified — the model must find it
- Changes often touch multiple files
- Understanding the issue requires understanding the codebase's abstractions
- The fix must not break existing tests (many submissions introduce regressions)
Baseline Performance (2024)
At publication, leading models performed poorly:
| System | SWE-bench (full) | SWE-bench Lite (300 issues) |
|---|---|---|
| Claude 2 + retrieval | 4.8% | — |
| GPT-4 + retrieval | 1.7% | — |
| SWE-agent (GPT-4) | 12.5% | — |
| Devin (2024, claimed) | 13.9% | — |
Progress After Publication
SWE-bench Verified was introduced (500 manually verified issues). Scores rose rapidly:
| System | SWE-bench Verified |
|---|---|
| Claude 3.5 Sonnet (2024) | 49% |
| Claude 3.7 Sonnet (2025) | 62% |
| Claude Opus 4.6 (2025) | 80.8% |
| Claude Opus 4.7 (2026) | 87.6% |
The benchmark drove a rapid improvement cycle. Labs optimised their agents specifically for SWE-bench.
SWE-agent — The Accompanying Agent
The paper introduced SWE-agent: a simple agent scaffold (shell commands, file viewer, search) that outperformed all prior RAG-based approaches.
Key insight from SWE-agent: models need an interactive environment to efficiently navigate code, not just a static dump of relevant files. Navigation, searching, and editing in a loop dramatically outperforms "retrieve relevant files, then generate patch".
Impact
- Became the gold standard benchmark for agentic coding capability
- SWE-bench scores are now cited in model launch announcements alongside MMLU and HumanEval
- Drove development of coding agents: Devin, SWE-agent, Claude Code, GitHub Copilot Workspace
- "SWE-bench Verified" subset validated by human raters to filter ambiguous or impossible issues
- SWE-bench Multimodal extended to issues involving screenshots and UIs
Key Facts
- 2,294 issues; 12 Python repos; ICLR 2024; Princeton team
- Evaluation: apply patch → run test suite → pass/fail (automated, reproducible)
- SWE-bench Lite: 300 representative subset for cheaper evaluation
- SWE-bench Verified: 500 human-validated issues (more reliable than full set)
- Claude 3.7 Sonnet: 62% on Verified (2025); frontier as of that release
- Repos: Django, Flask, Requests, scikit-learn, pandas, Matplotlib, Sympy, Pillow, pytest, Sphinx, seaborn, Astropy
Connections
papers/key-papers · papers/react · agents/practical-agent-design · evals/methodology · ai-tools/claude-code · landscape/ai-labs
Open Questions
- What claims in this paper have since been challenged or superseded by follow-up work?
- What did later research reveal about the limitations of this approach?
Related reading
More in Papers