PatternTechAcross TeamsVerified

Models

The raw capability the rest of the platform is built on. The pattern is not "pick the best model" but "match models to tasks" -- frontier models for hard reasoning, cheaper tiers for high-volume work, open-weight models where control, cost, or data-residency matter more than peak capability.

The Pattern

Models are the raw capability the rest of the platform is built on -- the gateway, runtime, and evals all exist to get more out of them. The pattern is not "pick the best model" but "match models to tasks": frontier models (Claude Opus and Fable, GPT, Gemini) for hard reasoning, cheaper and faster tiers (Haiku-class) for triage, summarization, and high-volume work, and open-weight models (Llama, Qwen, DeepSeek) where control, cost, or data-residency matter more than peak capability. The model router is how you route across them; this pattern is about which to choose and why.

The task-to-model fit is real and measurable. When researchers benchmarked seven frontier models across categories of autonomous research work, one model won overall even under a cost constraint -- yet on ML-engineering tasks an open model surpassed every frontier model. The same story shows up in code editing: on Aider's leaderboard, an open 32B Qwen-Coder slotted in between two proprietary tiers, ahead of a same-generation GPT model. "Best model" is the wrong frame; "best model for this task, at this price" is the right one.

Why It Matters

Model choice is the single biggest lever on both cost and quality, and it is a moving target: the length of task a frontier model can complete unattended has roughly doubled every seven months (METR), so today's right answer expires fast.

There is a real tension to navigate -- proprietary frontier models lead on capability but cede control and send data out, while open-weight models trade some peak capability for ownership, privacy, and predictable cost. That trade has narrowed fast: DeepSeek-R1 shipped MIT-licensed and roughly on par with a proprietary reasoning model. As one practitioner put it after two days pair-programming a cloud model against a local open one, "AI in the cloud is not aligned with you; it's aligned with the company that owns it" -- a blunt way of saying ownership is itself a feature you may be buying.

The discipline is to treat the model as a swappable component: benchmark for your tasks rather than trusting public leaderboards -- which routinely rank two models as "statistical twins" that feel nothing alike in real work -- route by difficulty, and avoid hard-coupling to one provider so you can move as the frontier moves.

Serving the model

Choosing a model is only half of it -- an open-weight model is just weights until something serves it. An inference engine is "the code system that takes a trained model and turns it into a live, callable API," and a general web server like FastAPI is not built for the job: AI workloads need streaming, batching, and speculative decoding (William Falcon). The de-facto standard is vLLM, whose PagedAttention delivers roughly 24x the throughput of naive HuggingFace serving -- the difference that makes self-hosting open models affordable (vLLM). So the open-vs-proprietary choice carries a hidden cost: a proprietary API makes serving someone else's problem, while an open model means you own the serving stack -- GPUs, batching, autoscaling, uptime -- in exchange for the control and data-residency you were after.

Sources

Last reviewed: 2026-06-25