PatternTechIndividual & TeamVerified

Context Engineering

Managing what enters and stays in an agent's context window is now as important as the prompt. More tokens do not mean better results -- as the window fills, models suffer context rot. The skill is finding the smallest set of high-signal tokens that fully specifies the task.

The Pattern

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." -- Andrej Karpathy (source)

Context engineering is the successor to prompt engineering. Prompt engineering asks "what words do I write?"; context engineering asks the broader question "what configuration of the entire context -- system prompt, tools, examples, history, and retrieved data -- is most likely to produce the behavior I want?" For a single chat the distinction barely matters; for an agent operating over many turns it decides whether the run succeeds.

The discipline has two halves. Upstream, it finds and cleans the right context before it enters the window: retrieval, selection, pruning. In-window, it manages what is already there: ordering, compaction, isolation. The shared goal is the smallest set of high-signal tokens that fully specifies the task. Minimal does not mean short -- it means high-signal.

Why It Matters

The reason more tokens make agents worse has a name: context rot. As a window fills, a model's ability to recall and reason over any given token degrades -- attention is a finite budget, like human working memory (Anthropic). The codebase itself is part of that budget: SonarSource's 660-trial study found agents on cleaner, more navigable code used 7-8% fewer tokens and made 34% fewer file revisitations at the same success rate (SonarSource). With a typical 25:1 input-to-output token ratio in agentic sessions, per-run cost is driven almost entirely by how much context accumulates (Tokenomics) -- which makes context engineering an economics argument as much as a quality one.

Keeping the window small

The window is finite, so the craft is partly mechanical: pull context in only when needed, and push it back out before it rots. Dexter Horthy -- who coined "context engineering" in his 12-factor agents essay -- frames the discipline as "frequent intentional compaction": deliberately structuring what you feed the agent at each step rather than letting the window fill, a technique his team used to drive Claude Code through 300k-line codebases at review-passing quality (Dex Horthy, No Vibes Allowed).

Load only when needed (progressive disclosure):

Split one giant prompt or AGENTS.md into pieces loaded just-in-time -- the agent holds lightweight references (paths, queries) and pulls data in at runtime rather than up front (Anthropic).
MCP first earned a bad name for spending the window on every tool's definition; the fix is on-demand tool discovery -- once the tool set is large, definitions are searched and loaded as needed rather than all at once.
Skills are the cleanest version: each SKILL.md advertises itself in a few dozen tokens of frontmatter, with the full instructions loaded only when a task calls for it (Simon Willison).
The open gap: tools, skills, and MCP can all be loaded on demand, but none has a clean way to unload once pulled in.

Shed what is no longer needed:

Compaction summarizes a near-full window and restarts from the summary; context editing is the lighter touch, auto-clearing stale tool calls and results -- ~84% fewer tokens in one 100-turn evaluation (Anthropic).
Memory on disk -- the agent writes notes, plans, and research to files and reads them back when relevant, keeping the live window lean (Anthropic; Dex Horthy).
Subagents get their own clean window: they do the heavy exploration -- tens of thousands of tokens -- and hand back only a distilled 1-2k-token result, so the main agent stays focused (Anthropic).

One caveat on generating context: do not auto-write the context file itself. An ETH Zurich study found LLM-generated AGENTS.md files raised inference cost 20-23% for slightly worse results, because they duplicate documentation the agent already discovers -- "cost without adding signal." The nuance is not "don't write one" but "don't ship the raw generation": a short, hand-pruned file of only the non-discoverable specifics -- commands, constraints, tooling -- is what actually helps, and dropping a redundant architecture section gave "the same agent behavior at a lower token budget" (ETH Zurich).

These effects are invisible to the naked eye -- the auto-generated file looked helpful -- so you cannot tune context by intuition. You need evals to tell whether a context change actually helped or quietly regressed the agent; benchmarks like Context-Bench exist precisely to score agentic context engineering (Context-Bench).

Sources

Last reviewed: 2026-06-25