PatternTechIndividual & TeamAcross TeamsVerified

Evals & LLM-as-a-Judge

If context is the new code, evals are its tests: repeatable checks that your AGENTS.md, rules, skills, and prompts still produce the behaviour you want when you edit them or the model changes underneath you. Run on your own codebase -- which is what makes them different from public benchmarks.

The Pattern

"Evals are a testing framework for probabilistic AI." -- Ben Stein, Teammates (source)

If context is the new code, evals are its tests. The context, rules, and skills you hand a coding agent now do much of the work code used to -- and like code, they are not trustworthy just because they looked right once. An eval is a repeatable check that your setup still produces the behaviour you want.

The question it answers is concrete: change a few lines in your AGENTS.md or your rules, or let the model get upgraded under you -- does the agent still do the right thing? You will not know by eye. An eval turns that into something you can re-run. You have a few ways to check, cheap to thorough:

Deterministic checks -- the mechanical parts: did the generated code compile and pass its tests, is a skill's frontmatter valid, does a spec link its tests.
LLM-as-a-judge -- a second model reads the generated code and rates it against your criteria, for the fuzzy things code cannot assert.
Agent-as-a-judge -- give that judge tools so it can run the code and check behaviour, not just read it (Philipp Schmid).

Why It Matters

Two everyday jobs make evals worth the trouble:

Regression. Your context and skills are an input you keep editing, and the model underneath changes without asking. A saved eval suite tells you whether last week's AGENTS.md tweak, this morning's model upgrade, or a switch to a different coding agent still holds up -- and if you maintain context for several agents at once, it is the only sane way to keep them all honest.
Pruning. A before/after eval shows when the model already knows something unaided, so you can delete that guidance and keep the context window lean. What the eval proves redundant, you can cut.

This is also what separates evals from public benchmarks: a benchmark scores a model on someone else's tasks, while an eval runs on your codebase -- the only place your context and skills actually have to perform. Tessl's own framework, run over real library tasks, found spec and context documentation lifted idiomatic API use ~35% -- a number that only exists because they built the eval to measure it (Tessl).

The hard part is writing evals that catch what matters. The method is the one Hamel Husain and Shreya Shankar teach -- collect real traces, do error analysis to name the failure modes, then write a check for each (Lenny's Podcast) -- but aim it at the code your agent generates and the context that shaped it, not at generic product outputs (Hamel: writing the judge, evals FAQ). And keep the judge honest: align its verdicts to yours on real examples first, because a misaligned judge is worse than none.

Sources

Last reviewed: 2026-06-25