The launch reviewer asks one question every time: "what does this cost to run per user?" If you do not have a number, the feature does not ship. If you have a wrong number, you ship and then explain to finance why the AWS bill tripled. The cost model is the second-most-important pre-launch artifact after the eval suite, and most PMs treat it as something engineering will figure out. They will not. It is your job.
This post is the cost-modeling format we teach in ShipSet Lesson 47 ("Token economics and unit cost") and Lesson 49 ("Build cost validation"). Seven variables, a workbook you can put in a spreadsheet tonight, three real cost models from shipped features, and the variable PMs forget that turns a "looks cheap" feature into a margin-killer.
Why most AI cost models are wrong
The default PM cost model looks like this:
"We pay $3 per million tokens. The average request is 1,000 tokens. So each request is $0.003. We expect 10,000 requests per day, so that is $30 per day, $900 per month. Cheap."
This is wrong in three ways at once, and each wrongness compounds. By the time you see the actual bill it is 4-10x your model. Here is what gets missed:
1. Input vs output token pricing. Most providers charge 2-5x more for output tokens than input tokens. If your feature reads 100 tokens of input and writes 900 tokens of output, the output dominates the bill. The PM model that uses one "blended" price misses this entirely.
2. System prompts are tokens too. Your system prompt is 800 tokens. Every single request pays for those 800 tokens. The "1,000 token average" the PM quoted is what the user typed plus the response; the actual request is 1,000 + 800 system prompt + maybe 600 tokens of injected context = 2,400 tokens. The model just lost 2.4x.
3. Caching is an asterisk, not a free lunch. Anthropic and OpenAI both offer prompt caching that drops input cost by 10x for the cached portion. PMs see "10x cheaper" and build models around the cached price. In practice your cache hit rate is 60-80%, not 100%, and only the system-prompt portion caches, not the user input. Real savings: 30-50%, not 90%.
The cost model that survives launch accounts for all three.
The 7-variable workbook
Put this in a spreadsheet. One model per AI feature. Every variable is a cell you can change to do sensitivity analysis.
| Variable | What it is | Where the number comes from |
|---|---|---|
input_tokens_per_request | Including system prompt + user input + injected context | Count tokens on 10 sample requests |
output_tokens_per_request | Average length of model response | Same 10 samples |
input_price_per_million | Provider's per-million-input-token rate | Provider pricing page |
output_price_per_million | Per-million-output-token rate | Same |
requests_per_user_per_month | Activity rate of your typical user | Production logs OR estimated DAU × sessions × requests/session |
cache_hit_rate | What fraction of input tokens hit the prompt cache | Provider analytics OR estimate at 0.5 if you don't know |
target_users | How many users at the price point you are modeling | Your launch plan |
The cost per user per month is then:
non_cached_input = input_tokens_per_request * (1 - cache_hit_rate)
cached_input = input_tokens_per_request * cache_hit_rate
cost_per_request = (
non_cached_input * input_price_per_million / 1_000_000 +
cached_input * (input_price_per_million * 0.1) / 1_000_000 +
output_tokens_per_request * output_price_per_million / 1_000_000
)
cost_per_user_per_month = cost_per_request * requests_per_user_per_month
total_cost_per_month = cost_per_user_per_month * target_users
That formula plugs into your business case. The number cost_per_user_per_month is what you compare to your pricing tier to compute margin. The number total_cost_per_month is what you compare to your runway to compute risk.
How to count tokens (without running a tokenizer)
You do not need a Python script. Two rules of thumb that get you close enough:
English text: ~4 characters per token. Take your average user message length in characters and divide by 4. A 200-character question is about 50 tokens.
JSON output: ~3 characters per token. JSON has more punctuation than prose. A 600-character JSON response is about 200 tokens.
For a production-quality estimate, use the official tokenizer (Anthropic's count_tokens API or OpenAI's tiktoken library). For a PM cost model that gets you to launch, the character-divided-by-4 estimate is within 15% of true. That is good enough.
Three real cost models
1Support ticket auto-router (classification, low cost)
The feature: classifies incoming support tickets into one of 12 team categories. Reads ticket body, returns a category and a confidence score.
| Variable | Value | Notes |
|---|---|---|
input_tokens_per_request | 600 | 400 system prompt (caches) + 200 user ticket |
output_tokens_per_request | 30 | Just the category + confidence number |
input_price_per_million | $1.00 | Claude Haiku 4.5 |
output_price_per_million | $5.00 | Same |
requests_per_user_per_month | 80 | Average ticket volume per support agent |
cache_hit_rate | 0.67 | 400 of 600 input tokens are cacheable system prompt |
target_users | 500 | Internal team |
Cost per request: $0.0007. Cost per user per month: $0.06. Total monthly: $30. Bill comes in: $42 because cache hit rate was 0.55 not 0.67 (system prompt missed cache during cold periods). Variance: 40%. Still trivial in absolute terms.
Takeaway: classification with a short output is the cheapest AI feature shape there is. Models for these can be casual; the bill will not surprise you.
2PRD critique assistant (generation, medium cost)
The feature: reads a draft PRD (2,500 tokens), returns 3-5 specific critiques (800 tokens).
| Variable | Value | Notes |
|---|---|---|
input_tokens_per_request | 3,200 | 600 system prompt + 100 instruction + 2,500 PRD |
output_tokens_per_request | 800 | The critiques |
input_price_per_million | $3.00 | Claude Sonnet 4.6 |
output_price_per_million | $15.00 | Same |
requests_per_user_per_month | 15 | One PRD a week, two critiques each |
cache_hit_rate | 0.22 | Only system prompt caches; the PRD is unique each time |
target_users | 5,000 | Premium tier subscribers |
Cost per request: $0.022. Cost per user per month: $0.33. Total monthly: $1,650. The "uncached" portion (the PRD itself, 2,500 tokens per request) dominates the bill. Caching helped, but not the way a naive model assumes.
Takeaway: when the input is unique per request, caching saves you 20% not 80%. Plan for the uncached number, treat caching as a margin buffer.
3AI coding agent (generation + multi-turn, high cost)
The feature: a Claude-powered coding agent that reads a repo, takes a natural-language task, makes code changes across multiple turns. Average task: 6 turns, 8,000 input + 1,200 output tokens per turn.
| Variable | Value | Notes |
|---|---|---|
input_tokens_per_request | 8,000 | Repo context (5,500) + conversation history (1,500) + user message (1,000) |
output_tokens_per_request | 1,200 | Code edits + reasoning |
input_price_per_million | $3.00 | Claude Sonnet 4.6 |
output_price_per_million | $15.00 | Same |
requests_per_user_per_month | 240 | 40 tasks × 6 turns average |
cache_hit_rate | 0.42 | Repo context caches across the task; conversation history caches partially |
target_users | 2,000 | Premium tier |
Cost per request: $0.041. Cost per user per month: $9.84. Total monthly: $19,680. This is where AI feature unit economics start mattering: if you price at $19/month you have $9.16 of gross margin per user before infrastructure, support, and refunds. Tight but viable. If you price at $14.99/month you have $5.15 of margin and you cannot afford a viral spike.
Takeaway: multi-turn generation is where cost models become business models. The PM who shipped this without validating the cache hit rate against production traffic for two weeks before launch lost their job.
The variable everyone forgets: variance
The model above gives you a single number. Production gives you a distribution. The PM that ships with a single-number model gets blindsided when:
- A power user sends 8x the average requests (it happens — long-tail is real).
- A jailbreak attempt inflates input tokens by including a 10,000-token "now ignore all above" preamble.
- A prompt-injection user tries to get the model to write a novel (10x output tokens).
- A model upgrade silently doubles input prices (it has happened).
You cannot eliminate variance. You can model it.
The cheap fix: add three rows to your spreadsheet. p50_cost_per_user, p95_cost_per_user, p99_cost_per_user. Use 2x p50 for p95 and 4x p50 for p99 if you have no production data. Use real production data the moment you have it.
The expensive fix: per-user rate limits and hard cost caps. The PM that does not ship a hard limit on requests_per_user_per_day for an LLM feature is gambling. ShipSet's Captain has a 30-message-per-day cap not because users complain about it but because it bounds the maximum monthly cost per user to a known number.
The "what to do at p95" decision
Once you have p50 and p95 numbers, you have a decision:
| Scenario | What to do |
|---|---|
| p95 < 1.5x your gross margin | Ship without caps. You can absorb the variance. |
| p95 between 1.5x and 3x margin | Add a soft cap (warn user at 80% of monthly limit). |
| p95 between 3x and 5x margin | Hard cap with degraded experience past the cap (cheaper model, smaller context). |
| p95 > 5x margin | Re-price OR re-design the feature. Variance will eat you. |
Most AI PMs ship in the second or third row and discover they should have been in the third or fourth row when the bill arrives. Build the p95 number, take it seriously.
How to validate before launch
The cost model is theory until you validate it. Three steps to turn it into evidence:
1. Replay production traffic on the new feature, dry-run. Take 1,000 representative requests from a similar existing feature (or generate 1,000 synthetic ones). Run them through the new feature's prompt with a real model call. Measure actual cost. Compare to your model.
2. Run a closed beta with 30-50 real users for 2 weeks. Get the actual requests_per_user_per_month number. PM models almost always underestimate this by 2-3x because users do unexpected things.
3. Project from the beta to the launch population. Apply the beta usage rate to your target users number. This is your launch-day cost. Compare to your business case. If margin is tight, fix it now, not after launch.
What to put in the PRD
The cost model section of an AI PRD has four bullets:
Cost model
- Per-request: $X (assumption table linked)
- Per-user-per-month at p50: $Y / at p95: $Z
- Implied margin at the proposed price tier: A%
- Per-user rate limit: N requests/day (hard cap)
Four lines. Reviewers can interrogate any of them. The cost model itself lives in a linked spreadsheet so anyone can run sensitivity analysis on the assumptions.
The cheat code: ship the cost model alongside the eval suite
Reviewers ask two questions: "how do you know it works" and "what does it cost." If you arrive with both — the eval suite scoring artifact and the cost model with p95 numbers — the launch conversation gets short. The PM who built both is the PM who ships.
This is the difference between an AI PM at the offer stage and one who gets sent to do more interviews. The interviewer asks "tell me about the AI feature you shipped." The candidate with cost-model habit talks about the unit economics. The interviewer is sold.
TL;DR
- The PM cost model that uses a single blended price is wrong in 3 ways: input ≠ output pricing, system prompts are tokens, caching saves 30-50% not 90%.
- The 7-variable workbook:
input_tokens,output_tokens,input_price,output_price,requests_per_user,cache_hit_rate,target_users. - Count tokens by characters / 4 (English) or characters / 3 (JSON). 15% accuracy is fine for a launch model.
- Compute p50 cost, then p95 = 2x p50, p99 = 4x p50 if you have no data. Take p95 seriously.
- Cost model decision tree: at p95 > 3x margin, you need caps. At p95 > 5x, you re-price or re-design.
- Validate the model before launch with replay traffic + a 30-user closed beta. PM estimates of usage are almost always 2-3x low.
- The PRD section is 4 lines: per-request cost, per-user p50 / p95, margin %, hard rate limit. Spreadsheet linked.
In ShipSet Lesson 49 ("Build cost validation") you build the cost model for the feature you are shipping and validate it against replay traffic. By Day 90, the cost model is one of the 10 portfolio artifacts. It is the artifact engineering managers ask about most in interviews.
If you are reading this and have not built one: open a spreadsheet, paste the seven variables in column A, and fill in column B for the feature on your desk. The decisions get easier from there.