Public Benchmarks
Shared, comparable yardsticks for agentic coding -- the same tasks and grader run against every model, led by SWE-bench. The coarse capability filter for choosing what to trial, distinct from your own evals; useful but easy to over-trust, since benchmarks rot, leak, and do not reflect your codebase.
The Pattern
"Are the models really that far from independently handling compound coding tasks, or are the evaluation standards skewed?" -- Toloka, auditing SWE-bench (source)
A public benchmark is a shared, comparable yardstick for agentic coding: the same tasks, the same grader, run against every model and agent so results can be ranked against one another. The dominant one is SWE-bench -- the agent is handed a real GitHub issue and must produce a patch that makes the project's hidden unit tests pass; if they pass, the task is solved (SWE-bench). It is the headline number labs quote, and scores have climbed fast -- from roughly 38% on SWE-bench Lite in mid-2024 to open agents now resolving well over 70% of SWE-bench Verified, some in only a few hundred lines of harness (mini-SWE-agent).
Benchmarks come in flavours -- end-to-end issue resolution (SWE-bench), test-writing (SWT-Bench), terminal tasks, context retrieval -- but all share the promise of an apples-to-apples comparison no single team could produce alone.
Why It Matters
Benchmarks are how the field calibrates: they turn "this agent feels good" into a number you can rank, and they drive rapid, visible progress. For a platform team they are the coarse filter for which model or agent is even worth trialing.
But treat the number with suspicion:
- A benchmark is not your codebase. A high SWE-bench score says an agent can fix some open-source Python issues; it says little about your stack, your conventions, or your scale -- the same reason two models that rank as "statistical twins" can feel nothing alike in real work (see models).
- Benchmarks rot and leak. Toloka's audit of SWE-bench found overly-specific tests that reject correct fixes, vague problem descriptions, environment mismatches, and a newest issue dating to 2023 -- stale, and exposed to training-data contamination (Toloka).
- What you optimize, you game. A public target invites overfitting -- the score climbs while real-world capability lags behind it.
So use public benchmarks to choose what to trial, and your own evals -- run on your codebase -- to decide what actually works.
Sources
- SWE-bench: resolving real GitHub issues -- the standard public benchmark for coding agents
- Fixing SWE-bench: a smarter way to evaluate coding AI (benchmark shortcomings) -- Toloka
- mini-SWE-agent: a 100-line agent scoring >74% on SWE-bench -- SWE-agent (Princeton)
- SWT-Bench: testing and validating real-world bug-fixes with code agents -- ETH Zurich / SRI Lab
Last reviewed: 2026-06-25