PatternTechAcross TeamsVerified

Public Benchmarks

Shared, comparable yardsticks for agentic coding -- the same tasks and grader run against every model, led by SWE-bench. The coarse capability filter for choosing what to trial, distinct from your own evals; useful but easy to over-trust, since benchmarks rot, leak, and do not reflect your codebase.

The Pattern

"Are the models really that far from independently handling compound coding tasks, or are the evaluation standards skewed?" -- Toloka, auditing SWE-bench (source)

A public benchmark is a shared, comparable yardstick for agentic coding: the same tasks, the same grader, run against every model and agent so results can be ranked against one another. The dominant one is SWE-bench -- the agent is handed a real GitHub issue and must produce a patch that makes the project's hidden unit tests pass; if they pass, the task is solved (SWE-bench). It is the headline number labs quote, and scores have climbed fast -- from roughly 38% on SWE-bench Lite in mid-2024 to open agents now resolving well over 70% of SWE-bench Verified, some in only a few hundred lines of harness (mini-SWE-agent).

Benchmarks come in flavours -- end-to-end issue resolution (SWE-bench), test-writing (SWT-Bench), terminal tasks, context retrieval -- but all share the promise of an apples-to-apples comparison no single team could produce alone.

Why It Matters

Benchmarks are how the field calibrates: they turn "this agent feels good" into a number you can rank, and they drive rapid, visible progress. For a platform team they are the coarse filter for which model or agent is even worth trialing.

But treat the number with suspicion:

A benchmark is not your codebase. A high SWE-bench score says an agent can fix some open-source Python issues; it says little about your stack, your conventions, or your scale -- the same reason two models that rank as "statistical twins" can feel nothing alike in real work (see models).
Benchmarks rot and leak. Toloka's audit of SWE-bench found overly-specific tests that reject correct fixes, vague problem descriptions, environment mismatches, and a newest issue dating to 2023 -- stale, and exposed to training-data contamination (Toloka).
What you optimize, you game. A public target invites overfitting -- the score climbs while real-world capability lags behind it.

So use public benchmarks to choose what to trial, and your own evals -- run on your codebase -- to decide what actually works.

Sources

Last reviewed: 2026-06-25