Model Router
A single control point every model call passes through: one interface to many LLM providers, with routing across tiers, cost tracking, rate limiting, caching, key management, and logging centralized in one place. The model-access layer -- distinct from the guardrails proxy that polices content and the cost discipline that sets budgets.
The Pattern
"Gateways are all you need." -- Karan Sampath, Anthropic (source)
A model router is a single control point that every LLM call in the organization passes through, instead of each app talking to each provider directly. It presents one interface to many backends -- call 100+ LLM APIs in a uniform format -- and centralizes the things you do not want scattered across services: routing across model tiers, rate limiting, key management, caching, and logging (LiteLLM).
It is deliberately one layer of three. The model router decides where a call goes; the guardrails proxy decides whether it is allowed and safe; and cost management decides what it is worth. Keeping them separate keeps each legible. Because routing is centralized here, you can run cheap models on triage and summarization while reserving expensive ones for hard reasoning -- a policy the gateway enforces rather than each developer remembering to.
Why It Matters
Without a gateway, per-app provider integrations multiply, a model swap means touching every service, and there is no single place to apply caching or read what is happening. With one, the org gets a single root of trust for model access: change providers, enforce a routing policy, or instrument every call from one place. As agent fleets grow, that consolidation is what keeps model access from sprawling across the codebase.
Caching
Caching is the cheapest performance win the router can centralize. An agent runs in a loop and passes "all those prior tool calls back through every time," so you pay for the same context on every step -- and caching the repeated history "reduces both latency and cost significantly" (Lance Martin). Because agentic sessions are heavily input-skewed, that repeated context is most of the bill. Three layers stack:
- Prompt / prefix (KV) caching -- providers cache the unchanged prefix of a prompt (system prompt, loaded files) and bill it at a fraction of the price; the router enables it for every app instead of each one wiring it up.
- Exact-match caching -- an identical request returns a stored response; trivial, but real for repeated calls.
- Semantic caching -- return a stored answer when a new prompt is close enough in meaning, trading a similarity threshold for more hits.
The point of doing it here is that cache policy, keys, and hit-rate monitoring then live in one place rather than scattered across services.
Sources
Last reviewed: 2026-06-25