Tessl
Patterns
Practices for
PatternProcessAcross TeamsAI draft

Automated QA

Using agents to do quality assurance at the pace agents produce code: generating and maintaining tests, running suites, exercising the app end-to-end, and surfacing regressions -- so "is it tested" keeps up with "is it written."

The Pattern

Automated QA is using agents to do quality assurance at the pace agents produce code: generating and maintaining tests, running suites, exercising the running application end-to-end, and surfacing regressions -- so "is it tested" keeps up with "is it written." Where humans can no longer read every diff, they certainly can't hand-test every change, so QA itself has to be automated and agent-driven.

The discipline is more demanding than it sounds. SWT-Bench -- a benchmark from ETH Zurich and the SRI Lab (NeurIPS 2024) -- tasks agents with writing a test that reproduces a real GitHub issue: it must fail on the broken code and pass once the fix lands. The leaderboard makes the difficulty concrete: the best entry (LogicStar's L*Agent v1) reproduces 67.7% of issues on SWT-Bench Verified, with strong general agents like OpenHands on GPT-5 close behind at 66.3%. The benchmark authors note something important for anyone leaning on agents to test their own work: a reproducing test "can aide test-driven development, avoid regression, and are a powerful tool to cross-validate proposed bug fixes" -- and crucially, that the difficulty of writing the test and the difficulty of fixing the bug are not correlated at the instance level. Generating a test that actually catches the bug is its own hard problem, not a free byproduct of generating the fix.

Why It Matters

Generation outran not just review but testing. The same agents that write code can write and run its tests, drive a browser to exercise their own work, and re-run regression suites on every change -- increasingly as a step inside a loop rather than a manual gate. Automated QA complements the other two quality controls rather than duplicating them: evals score the agent's behavior, automated review checks the diff, and QA asks the third question -- does the running software actually do what it should.

The honest caveat: agents both write and check the tests, so QA only counts if the tests are real. An agent that writes a green test for broken code has done worse than nothing. SWT-Bench's grading reflects exactly this risk -- it credits an instance only when a generated test fails on the original code and doesn't spuriously fail on the fixed code, because a test that passes regardless of correctness proves nothing. With even specialized testing agents topping out around two-thirds of real issues reproduced, automated QA reduces the manual testing burden without eliminating the need to verify that the safety net has holes in the right places, not the wrong ones.

Last reviewed: 2026-06-25

PREVIEW