PatternTechIndividual & TeamVerified

Harness Engineering

A coding agent is a model plus everything built around it. Most of the leverage in agentic development sits in the harness -- the prompts, hooks, sandboxes, feedback loops, and memory policies -- not in model selection.

The Pattern

"A decent model with a great harness beats a great model with a bad harness." -- Addy Osmani (source)

Developers spend enormous energy debating which model is best; the more actionable question is what surrounds it. Claude Code, Cursor, Codex, and Aider run on similar or identical models yet produce dramatically different results -- the difference is the harness. The formula is simple: coding agent = AI model(s) + harness, and if you are not the model provider, the harness is your leverage.

There are really two of them. The inner harness ships inside the coding agent itself -- the tool-use loop, context management, sub-agents, and built-in prompts that make Claude Code or Cursor behave the way they do. The outer harness is everything you wrap around that agent; since you rarely control the inner one, the outer harness is where your leverage lives.

That outer harness -- system prompts and AGENTS.md, skills, tools and MCP servers, sandboxes, hooks, observability -- is what Birgitta Böckeler frames as two halves (Thoughtworks): feed-forward context you push in up front -- skills, conventions, guardrails -- and feedback tooling that observes what the agent actually produced so it can self-correct before a human ever looks. That feedback half carries most of the leverage, and it comes in two kinds:

Sensors -- static checks that read the generated code without running it: linters (ESLint), type checkers (tsc, mypy), security and dependency scanners (Semgrep, Snyk), dead-code and formatting checks, architecture or module-boundary rules (Dependency Cruiser). Mostly deterministic -- green or red -- which gives guarantees the model's own judgment cannot.
Simulators -- ways to actually run the code so the agent can observe behaviour: the test suite, a sandbox, a browser driving the running app, mock services, even screenshots or video of the result. Sensors tell the agent the code looks right; simulators tell it whether it works.

Mitchell Hashimoto, who put the term on the map, describes the core move: when an agent makes a mistake, don't just correct it -- build the sensor or simulator it can call to catch that mistake next time (Mitchell Hashimoto). The good harnesses aren't downloaded; they accumulate from your own failure history.

Why It Matters

Wire enough sensors and simulators together and the harness, not the model, carries the behavior -- which is why the leading coding agents now resemble each other more than they resemble their underlying models. "2025 was agents; 2026 is agent harnesses," and Harness-as-a-Service SDKs (Claude Agent SDK, Codex SDK, OpenAI Agents SDK) move the baseline from building a harness to configuring one well (Aakash Gupta). It is the old control-loop idea: you stop writing the code and start designing the loop the agent runs inside -- pushed to the limit, humans steer and agents execute, the way OpenAI's frontier harness reportedly produced a million lines in five months with zero written by hand (OpenAI, via Ryan Lopopolo).

The catch is calibration: sensors and simulators only help if they encode your standards. Agents do not learn by osmosis -- "if you don't write it down, the agent makes the same mistakes on the hundredth run as the first" (George). The real work is making what "good" means for your system machine-checkable; the discipline is early enough that this still reads as practitioner craft rather than settled practice -- though it is starting to be mapped, with a recent survey framing code itself as the agent harness (the substrate agents reason, act, and verify through) and naming regression-free harness improvement among the field's open problems (Code as Agent Harness).

In practice

A few moves separate a working harness from a frustrating one:

Tool descriptions are prompts, not docs. The agent re-reads them every turn and acts on them immediately, so iterate like a system prompt: cross-reference related tools ("for remote execution, use run_remote_command"), describe a flag's consequence rather than its definition, and shape input schemas to how the model already behaves (George Fahmy).
Separate generation from verification, adversarially. Cloudflare's vulnerability-hunting harness runs in phases -- recon the codebase, fan ~50 agents out to hunt concurrently, then have independent agents try to disprove each finding before it counts (Cloudflare, via Eugene Yan). A finding that survives a skeptic is worth more than one a single agent asserts.
Give parallel agents their own sandbox. Running many agents at once only works if each gets an isolated environment, so they don't trample each other's working tree or ship half-finished changes -- the idea behind tools that hand every agent its own box (Crabbox / Peter Steinberger).
Keep it model-agnostic. Proprietary harnesses like Claude Code are "overly tuned for their proprietary models" -- if you want to swap models as the frontier moves, a model-agnostic harness keeps the leverage yours (Harrison Chase).
Aim for a self-improving loop. The real prize is a harness that gets better against each new model instead of being rebuilt for it -- "the team that can iterate fastest against the newest models will win" (DeepMind, Build the Loop, Not the Agent).

Sources

Last reviewed: 2026-06-25