PatternProcessOrganizationalAI draft

Data Governance

Controlling what data agents can see, send, and train on across the organization -- which code, secrets, and customer data flow to which models and tools, and where that data ends up. The org-level policy layer above the platform's technical controls.

The Pattern

Data governance for AI is the organization-level discipline of controlling what data agents can see, send, and learn from: which repositories, secrets, and customer data are allowed to reach which models and tools, where prompts and outputs are retained, and what may or may not leave the boundary. It sits above the platform's technical controls -- the policy that the guardrails proxy and identity layers enforce.

The exposure is rarely dramatic; it hides in ordinary, useful work. Daniel Whitenack of Prediction Guard, describing real enterprise deployments at the AI Engineer World's Fair, walks through the most common path: a retrieval-augmented system pulls a support ticket into a prompt to ground an answer, that ticket carries an employee's email and address, and "all of a sudden you've just doxed your employee." The data leak comes from the system working as designed, not from an attacker. The OWASP GenAI Security Project codifies exactly this as LLM02: Sensitive Information Disclosure in its Top 10 for LLM Applications -- a community-built, vendor-neutral catalogue of where models leak PII, secrets, and proprietary context. Governance is the layer that decides, before any of this runs, which data classes are in bounds for which model and tool.

Why It Matters

Agents are data-movement machines: every prompt potentially ships context to a third-party model, and every tool call can reach sensitive systems. Three failure modes recur. First, sensitive context leaks out of a model -- the RAG-doxing case above, where ungoverned retrieval surfaces PII into a completion. Second, data leaks in to training: researchers showed that prompting a production model to "repeat the word X forever" could make it diverge and regurgitate memorized training data, including verbatim secrets and PII (Carlini et al.), which is why what an org allows to be trained on is itself a governance decision. Third, agents read more than people realize -- coding assistants have been observed loading local .env files automatically, pulling API keys and database credentials into context without an explicit prompt. OWASP captures the agentic version as LLM06: Excessive Agency: the more autonomy and reach a tool has, the more a single bad instruction or hallucination can move.

The honest tension is that governance fights usability head-on. Whitenack's own caveat is that every safeguard you bolt around the model adds latency and friction -- a PII filter, a factual-consistency check, a prompt-injection classifier each cost time, and teams trade them off against the model call itself. Lock data down too hard and people route around the controls with personal accounts and shadow tools; leave it open and you ship secrets to a vendor's logs. A growing class of local data-loss controls -- Stacklok's CodeGate, the LLM Secrets vault for Claude Code -- try to resolve this by keeping secrets on the developer's machine and exposing only variable names to the agent, so the safe path is also the easy one (vendor tools; treat their claims as directional). That last point is the durable principle: governance succeeds when the compliant default is the path of least resistance, not when it relies on every individual to make the right call under deadline. Tie the policy to enforceable controls in the guardrails proxy and agent identity, or it stays a document nobody reads.

Sources

Last reviewed: 2026-06-25