Models
The raw capability the rest of the platform is built on. The pattern is not "pick the best model" but "match models to tasks" -- frontier models for hard reasoning, cheaper tiers for high-volume work, open-weight models where control, cost, or data-residency matter more than peak capability.
The Pattern
Models are the raw capability the rest of the platform is built on -- the gateway, runtime, and evals all exist to get more out of them. The pattern is not "pick the best model" but "match models to tasks": frontier models (Claude Opus and Fable, GPT, Gemini) for hard reasoning, cheaper and faster tiers (Haiku-class) for triage, summarization, and high-volume work, and open-weight models (Llama, Qwen, DeepSeek) where control, cost, or data-residency matter more than peak capability. The model router is how you route across them; this pattern is about which to choose and why.
The task-to-model fit is real and measurable. When researchers benchmarked seven frontier models across categories of autonomous research work, one model won overall even under a cost constraint -- yet on ML-engineering tasks an open model surpassed every frontier model. The same story shows up in code editing: on Aider's leaderboard, an open 32B Qwen-Coder slotted in between two proprietary tiers, ahead of a same-generation GPT model. "Best model" is the wrong frame; "best model for this task, at this price" is the right one.
Why It Matters
Model choice is the single biggest lever on both cost and quality, and it is a moving target: the length of task a frontier model can complete unattended has roughly doubled every seven months (METR), so today's right answer expires fast.
There is a real tension to navigate -- proprietary frontier models lead on capability but cede control and send data out, while open-weight models trade some peak capability for ownership, privacy, and predictable cost. That trade has narrowed fast: DeepSeek-R1 shipped MIT-licensed and roughly on par with a proprietary reasoning model. As one practitioner put it after two days pair-programming a cloud model against a local open one, "AI in the cloud is not aligned with you; it's aligned with the company that owns it" -- a blunt way of saying ownership is itself a feature you may be buying.
The discipline is to treat the model as a swappable component: benchmark for your tasks rather than trusting public leaderboards -- which routinely rank two models as "statistical twins" that feel nothing alike in real work -- route by difficulty, and avoid hard-coupling to one provider so you can move as the frontier moves.
Serving the model
Choosing a model is only half of it -- an open-weight model is just weights until something serves it. An inference engine is "the code system that takes a trained model and turns it into a live, callable API," and a general web server like FastAPI is not built for the job: AI workloads need streaming, batching, and speculative decoding (William Falcon). The de-facto standard is vLLM, whose PagedAttention delivers roughly 24x the throughput of naive HuggingFace serving -- the difference that makes self-hosting open models affordable (vLLM). So the open-vs-proprietary choice carries a hidden cost: a proprietary API makes serving someone else's problem, while an open model means you own the serving stack -- GPUs, batching, autoscaling, uptime -- in exchange for the control and data-residency you were after.
Sources
- Measuring AI Ability to Complete Long Tasks -- METR
- Zhengyao Jiang -- 7 frontier models benchmarked on autoresearch tasks; an open model beat frontier on ML engineering (X, Jun 2026)
- Paul Gauthier (Aider) -- Qwen2.5 Coder 32B slots between proprietary tiers on the code-editing leaderboard (X, Nov 2024)
- Mitko Vasilev -- two days pair-programming a cloud model vs a local open model; "own your AI" (LinkedIn)
- DeepSeek-R1 release -- MIT-licensed, performance on par with OpenAI o1
- Loop Engineering: The Breakthrough That Makes the Software Factory Real -- Jazz Tong
- What is an inference engine? (turning a trained model into a callable API) -- William Falcon
- Serving LLM 24x Faster on the Cloud with vLLM (PagedAttention) -- Woosuk Kwon et al., UC Berkeley
Last reviewed: 2026-06-25