·9 min read·ShipSet team

How a PM Should Pick an AI Model in 2026: The 5-Variable Decision Matrix

Most PMs default to GPT-4 because they read about it. That decision quietly costs them 3-5x more than it should. Here is the actual 5-variable decision matrix for picking between Claude Fable, Sonnet, Haiku, GPT-5, Gemini, and Llama for your feature.

The PM picks the model. This surprises most candidates we talk to. They assume engineering picks it. In production AI features the model choice is a PM call because it has price, latency, and quality trade-offs that the PM owns. The PM who outsources it to engineering ships features that quietly cost 3-5x what they should.

This post is the model-selection framework we teach in ShipSet Lesson 50 ("Model routing and selection"). Five variables to evaluate, three concrete decision trees, and the trap most PMs fall into when they default to "we use GPT-4 because everyone does." No model worship. No vendor evangelism. The decisions a working AI PM actually makes in 2026.

Why the model decision is a PM call, not an engineering call

Engineering owns the integration. The PM owns the trade-off. The trade-off is between four things that move when you swap models:

  • Quality: pass rate on your eval suite
  • Cost: dollars per request
  • Latency: ms per request
  • Capability: what shapes of task the model can even do

Engineering can tell you "this model is faster" or "this is cheaper." Only the PM knows which trade-off is acceptable for the specific feature, and only the PM can defend that trade-off to leadership when the bill arrives.

If your engineer picked the model unilaterally, you cannot answer "why this model" in an interview or a senior review. That answer is the PM's job.

The five variables to evaluate

1Task shape

Some tasks fit a tiny cheap model. Some need a frontier model. Most PMs assume frontier; most tasks do not need it.

Task typeRight model classWhy
Classification (10-20 categories)Small/cheap (Haiku 4.5, GPT-5 nano, Gemini Flash)High precision possible on small models. Cost matters at scale.
Extraction (pull fields from text)Small/cheapSame. Tight output format works well on small models.
SummarizationSmall if quality bar is decent; mid-tier if executive-gradeQuality scales with model size here.
Creative generation (marketing copy, ideas)Mid-tier (Sonnet 4.6, GPT-5 standard)Small models are too repetitive. Frontier overkill.
Multi-step reasoning (math, code, agents)Frontier (Fable 5, Mythos 5, GPT-5 reasoning)Reasoning is where size matters most.
Tool use / agentsFrontier or specifically tuned (Fable 5, Claude Code)Tool calling reliability degrades on smaller models.
Vision (read images, charts, docs)Multimodal frontier (Fable 5, Gemini 2.5 Ultra)Multimodal still wants size for accuracy.

The single biggest PM cost error in 2026: using a frontier model for classification. A Haiku 4.5 classifier matches Fable 5 on most classification eval suites and costs 1/20th as much.

2Quality bar from the eval suite

You picked a candidate model. Now prove it.

Run your eval suite (50+ rows, 20/15/10/5 split) on two candidates. Score each row. Compare pass rates. The model selection becomes a number, not an opinion.

A useful concrete table from a real feature (support ticket auto-router):

ModelHappy path passEdge case passAdversarial passCost/req
Haiku 4.594%78%65%$0.0007
Sonnet 4.696%82%70%$0.004
Fable 596%85%73%$0.018

Reading this: Haiku is good enough for happy path. Sonnet is the right choice if edge-case handling matters. Fable buys 3 percentage points for 4.5x the Sonnet cost. Unless that 3 points unlocks a different product behavior, Sonnet wins.

The PM who walks into a review with this table closes the model conversation in 2 minutes. The PM without it argues vibes for an hour.

3Latency budget

User-facing features have a latency budget. Backend / batch features do not.

Feature typeBudgetImplication
Chat response (streaming)500ms to first token, 30 tok/sNeed streaming + smaller model + Anthropic's prompt caching
Form auto-fill< 800ms totalSmall model + structured output
Background taggingseconds, even minutesUse any model. Optimize for cost.
Agentic workflow5-60s per turnUse any model. Show progress UI.

Latency is where frontier-vs-fast gets non-obvious. Fable 5 is slower per request than Haiku 4.5 for the same task. If you have a UI that streams chat responses, that latency difference is felt. A 1.2 second first-token vs 400ms first-token kills perceived responsiveness even when the actual content is the same.

Engineering will quote latency at p50. You should ask for p95 and p99. Power users hit those.

4Cost per request and at-scale

Calculate the per-user-per-month cost using the seven-variable workbook (see our AI cost modeling post). Compare across model candidates.

Trap: the per-request cost numbers are decimal pennies and easy to dismiss. Multiply by your DAU and you find features that cost $80K/month differ from features that cost $4K/month for the same eval pass rate. The PM who notices this gets promoted.

Worked sanity check for a 10K-user feature, 30 requests/user/month:

ModelCost/reqCost/user/monthCost/month at 10K users
Haiku 4.5$0.0007$0.021$210
Sonnet 4.6$0.004$0.12$1,200
Fable 5$0.018$0.54$5,400

The cost difference between Haiku and Fable is $5,190/month. If your eval shows Haiku is within 5% of Fable's pass rate on your task, the savings buy a junior PM or a senior engineer. The model choice IS a hiring decision in disguise.

5Capability shape (multimodal, tool use, long context)

Some features need things only specific models do well in 2026:

  • Vision: only multimodal models. As of mid-2026: Claude Fable 5, Claude Sonnet 4.6, Gemini 2.5 Ultra, GPT-5 Vision. Quality varies; eval on your specific images, not on benchmarks.
  • Tool use / function calling: Fable 5 and GPT-5 are the most reliable for multi-tool, multi-turn calls. Haiku 4.5 can do basic tool calls but degrades with 5+ tools. Llama 3.3 70B works for simple cases.
  • Long context: Fable 5 (1M tokens), Gemini 2.5 Ultra (2M), GPT-5 (400K). For most PM features you don't need this; if you do, the model choice narrows to two or three.
  • Structured output: Anthropic models with the strict JSON schema setting are most reliable. GPT-5 with response_format works. OSS models still hallucinate JSON sometimes.

If your feature needs any of these capabilities, the selection narrows hard. If it doesn't, you have the full menu.

Three decision trees for common PM features

Decision tree 1: Classification or extraction feature

Is your eval pass rate target above 90%?
├── No (75-90% acceptable) → Haiku 4.5 or GPT-5 nano
└── Yes (90%+) →
    Is the input typically under 1000 tokens?
    ├── Yes → Haiku 4.5 is likely fine. Eval to confirm.
    └── No (long documents) → Sonnet 4.6 or Gemini Flash with long context

Default to Haiku unless eval forces you up. Most classifiers do not need frontier.

Decision tree 2: User-facing chat / Q&A feature

Does the response need to be conversational (streaming, chat-style)?
├── Yes →
│   Does it need multi-step reasoning (math, code, planning)?
│   ├── Yes → Fable 5 or GPT-5 reasoning
│   └── No → Sonnet 4.6 (sweet spot: quality + speed + cost)
└── No (Q&A with single response) →
    Does it need grounded retrieval (RAG)?
    ├── Yes → Sonnet 4.6 + prompt caching for the retrieved context
    └── No → Haiku 4.5 if the answer is fact-lookup; Sonnet if it's open-ended

User-facing features in 2026 mostly land on Sonnet 4.6. Fable 5 is for the cases where the reasoning gap is product-defining.

Decision tree 3: Background or batch feature

Is throughput more important than latency?
├── Yes (batch processing, tagging at scale) →
│   Cost is the dominant variable. Run eval suite on Haiku 4.5 and Llama 3.3 70B.
│   Pick the cheaper one whose eval pass rate meets your target.
└── No (need fast turnaround but server-side) → Sonnet 4.6 or Gemini Flash

Background features should default to the cheapest model that meets the eval bar. They are where cost optimization compounds.

The trap: defaulting to GPT-4 or Fable 5 because you read about it

Most PMs in 2026 default to whatever model they read about most recently. In 2024 that was GPT-4. In 2026 it is Claude Fable 5 (because it launched in June and is everywhere). This is a tax on your roadmap.

The right discipline: every new feature gets a 1-day model evaluation before the prompt is committed. Run the eval suite on three candidates: one cheap (Haiku 4.5 or Flash), one mid-tier (Sonnet 4.6 or GPT-5 standard), one frontier (Fable 5 or GPT-5 reasoning). Build the comparison table from variable 2 above. Pick the cheapest that meets your eval bar.

This discipline saves real money. It also stops you from over-relying on any single vendor — if Anthropic deprecates a model, you already have a comparison table for the migration.

What to put in the PRD

The model section of an AI PRD is four lines:

Model selection

  • Primary: [model name] because [1-sentence justification]
  • Fallback: [different model] (used at high-load or deprecation)
  • Eval pass rate vs alternatives: [link to comparison table]
  • Decision criteria: cost was [primary/secondary/blocking]; latency was [primary/secondary/blocking]

Four lines. Reviewers can interrogate the comparison. The PRD does not pretend the decision was obvious.

What changes between now and end of 2026

Three model trends to watch:

1. Cheap mid-tier is closing the gap with frontier. Sonnet 4.6 in mid-2026 matches GPT-4 from 2024 on most evals at 1/10th the cost. The "you need frontier for X" claims keep getting falsified. Re-evaluate your model choice quarterly.

2. Prompt caching changes the math. Anthropic and OpenAI both ship aggressive prompt caching now. If your system prompt is 1000+ tokens (most production features), you save 30-50% on input cost by structuring requests to maximize cache hits. The PM who specs this in the PRD saves real money.

3. Self-hosted is becoming viable for specific shapes. Llama 3.3 70B and Qwen 3 70B are real options in 2026 for high-volume classification and extraction. If you have an in-house ML team and a feature with 100K+ requests/day, build the cost comparison. For most PMs, sticking with API models is still the right call. But the option exists now in a way it didn't in 2024.

TL;DR

  • The model choice is a PM call. Five variables: task shape, eval pass rate, latency budget, cost at scale, capability requirements.
  • The biggest PM cost error in 2026: using frontier models for classification. Use Haiku 4.5 or similar for classifier and extractor features.
  • Default to Sonnet 4.6 for most user-facing features in 2026. Reach for Fable 5 only when the reasoning gap is product-defining.
  • Always run a 3-candidate eval comparison (cheap / mid / frontier) before committing the model choice. The discipline pays for itself.
  • Trap: defaulting to whatever model you read about most. Re-evaluate every quarter.
  • The PRD has a 4-line model section with the comparison table linked.

In ShipSet Lesson 50 ("Model routing and selection") you build a 3-candidate comparison for the feature you are shipping. By Day 90 the comparison table is one of the 10 portfolio artifacts. Hiring managers cite the comparison table specifically in offer-stage interviews.

If you have a feature in production right now and have not re-evaluated the model in 6 months: do it tomorrow. There is a meaningful chance you can drop one tier and save 3-5x without quality loss. That is the highest-leverage PM hour you'll spend this quarter.

ShipSet

Build the portfolio that actually gets you hired.

ShipSet is a 90-day daily-practice program for PMs shipping a working AI feature. Real eval suites, real cost models, real prototypes. Founding 50 members get lifetime access at $79 (one-time).

Take the diagnostic
Comparing options
Looking for the right AI PM course?
We compared 10 options: ShipSet, Udemy, Maven, Reforge, Lenny's, Coursera, and a few more. Honest write-ups, no affiliate links.
Read the comparison