Tessl
Patterns
Practices for
PatternTechAcross TeamsVerified

Models

The raw capability the rest of the platform is built on. The pattern is not "pick the best model" but "match models to tasks" -- frontier models for hard reasoning, cheaper tiers for high-volume work, open-weight models where control, cost, or data-residency matter more than peak capability.

The Pattern

Models are the raw capability the rest of the platform is built on -- the gateway, runtime, and evals all exist to get more out of them. The pattern is not "pick the best model" but "match models to tasks": frontier models (Claude Opus and Fable, GPT, Gemini) for hard reasoning, cheaper and faster tiers (Haiku-class) for triage, summarization, and high-volume work, and open-weight models (Llama, Qwen, DeepSeek) where control, cost, or data-residency matter more than peak capability. The model router is how you route across them; this pattern is about which to choose and why.

The task-to-model fit is real and measurable. When researchers benchmarked seven frontier models across categories of autonomous research work, one model won overall even under a cost constraint -- yet on ML-engineering tasks an open model surpassed every frontier model. The same story shows up in code editing: on Aider's leaderboard, an open 32B Qwen-Coder slotted in between two proprietary tiers, ahead of a same-generation GPT model. "Best model" is the wrong frame; "best model for this task, at this price" is the right one.

Why It Matters

Model choice is the single biggest lever on both cost and quality, and it is a moving target: the length of task a frontier model can complete unattended has roughly doubled every seven months (METR), so today's right answer expires fast.

There is a real tension to navigate -- proprietary frontier models lead on capability but cede control and send data out, while open-weight models trade some peak capability for ownership, privacy, and predictable cost. That trade has narrowed fast: DeepSeek-R1 shipped MIT-licensed and roughly on par with a proprietary reasoning model. As one practitioner put it after two days pair-programming a cloud model against a local open one, "AI in the cloud is not aligned with you; it's aligned with the company that owns it" -- a blunt way of saying ownership is itself a feature you may be buying.

The discipline is to treat the model as a swappable component: benchmark for your tasks rather than trusting public leaderboards -- which routinely rank two models as "statistical twins" that feel nothing alike in real work -- route by difficulty, and avoid hard-coupling to one provider so you can move as the frontier moves.

Serving the model

Choosing a model is only half of it -- an open-weight model is just weights until something serves it. An inference engine is "the code system that takes a trained model and turns it into a live, callable API," and a general web server like FastAPI is not built for the job: AI workloads need streaming, batching, and speculative decoding (William Falcon). The de-facto standard is vLLM, whose PagedAttention delivers roughly 24x the throughput of naive HuggingFace serving -- the difference that makes self-hosting open models affordable (vLLM). So the open-vs-proprietary choice carries a hidden cost: a proprietary API makes serving someone else's problem, while an open model means you own the serving stack -- GPUs, batching, autoscaling, uptime -- in exchange for the control and data-residency you were after.

Last reviewed: 2026-06-25

PREVIEW