Rethinking Performance
Once an agent can emit thousands of lines on command, output metrics measure the wrong thing. Individual performance has to be re-anchored on outcomes and judgment -- problems solved, quality shipped, decisions made -- not the volume of code produced.
The Pattern
Once an agent can emit thousands of lines on command, output metrics measure the wrong thing. "Lines of code" was always a poor proxy; with agents it is actively misleading -- you can ship enormous volume and create only review burden and comprehension debt. Nicole Forsgren, creator of the DORA and SPACE frameworks, puts it bluntly: PRs and diffs "are good signals and are terrible signals" -- useful as one view into throughput, dangerous as a solo metric, and systematically blind to senior engineers whose real work is unblocking, design, review, and mentoring rather than commits. The reframe is to anchor individual performance on outcomes and judgment: problems solved, quality shipped, good decisions made, and the leverage a person gets from agents -- not the volume of code produced.
Why It Matters
Teams that keep measuring code output optimize the part that no longer matters. A Stanford study of roughly 100,000 developers across more than 600 companies found that AI raises measured productivity by about 15-20% on average -- but that gross output looks far higher because much of the new volume is rework: bug fixes to code the agent just wrote, churn that feels like progress while spinning in place. Counting commits or lines rewards exactly that churn. The fairer signals are downstream: did the work hold up, was it understood, did it move the outcome.
The honest caveat is that the obvious alternative -- asking people -- is also unreliable. In a METR randomized trial, experienced open-source developers believed AI sped them up by about 20%, while the measured result was a 19% slowdown on tasks in their own large codebases. Self-assessment and gut feel are weak instruments here, and so is any single number. The realistic answer is a small constellation of signals read in context, accepting that judgment and contribution are genuinely harder to count than throughput -- which is precisely why rewarding throughput in an age of cheap throughput is the wrong default. This is the measurement counterpart to hiring -- both move from counting code to judging contribution.
Sources
- Loop Engineering: The Breakthrough That Makes the Software Factory Real -- Jazz Tong
- Developer productivity with Nicole Forsgren (the creator of DORA) -- The Pragmatic Engineer
- Does AI Actually Boost Developer Productivity? (100k Devs Study) -- Yegor Denisov-Blanch, Stanford
- Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity -- METR
Last reviewed: 2026-06-25