The PRD template you used for the SaaS dashboard does not work for an AI feature. You opened a Notion doc, wrote "Problem / Solution / Success metrics" out of habit, then stared at the cursor and realised the framing breaks the moment your feature is non-deterministic.
This article is the AI PRD format we teach in ShipSet Lesson 22 ("Write your AI-native PRD"). Six sections, six examples, plus the one section most PMs skip that costs them the launch. By the end you will have a template you can paste into a doc tonight and start filling in for the feature you actually want to ship.
🎯 TL;DR. An AI PRD needs six sections a SaaS PRD does not: (1) the failure mode, (2) the eval set, (3) the human-in-the-loop boundary, (4) the cost-per-call ceiling, (5) the rollback rule, and (6) the prompt as spec. Skip any of them and engineering will push back, legal will block launch, or finance will surface a cost surprise post-launch.
Why your SaaS PRD template breaks
A traditional SaaS PRD assumes determinism. "When the user clicks Submit, the form validates and saves to the database." You can test that. You can write acceptance criteria. QA can sign off.
AI features are different. They are:
- Probabilistic. The same input can produce different outputs. "Summarise this ticket" may produce 5 different summaries across 5 runs.
- Cost-bearing per use. Every API call costs money. Compute the unit economics in the spec or surprise yourself at the end of month one.
- Failure-mode rich. Hallucinations, refusals, latency spikes, token limits hit mid-response, model deprecation, prompt injection. Each one is a separate failure category.
- Eval-dependent. Acceptance criteria become an eval set: a deterministic harness that runs the non-deterministic feature against fixed inputs and measures whether the output passes.
The SaaS PRD has no place for any of these. So when you try to write one, the doc fills with vague phrases like "produces accurate summaries" and ships to engineering, who write back "define accurate." You cannot define accurate without an eval set. You cannot eval without a failure mode list. The doc is broken from the first heading.
The six-section AI PRD
Here is the format we use. Each section answers a specific question engineering, design, finance, or legal will ask in the kickoff. If a section is empty, that question becomes a launch blocker later.
1The failure mode
The question this answers: What happens when the model is wrong?
This is the section most SaaS PMs skip because their old features could not be wrong. Code either ran or threw an error. AI does neither. It produces a confidently-worded answer that is sometimes nonsense.
What to write:
- The three most likely failure types for this specific feature (hallucination, refusal, latency, format drift, prompt injection, etc.)
- For each, what the user experience is when it happens
- Whether the user can tell it is broken (most AI failures fail silently)
- The mitigation pattern
Example (AI ticket tagger):
Failure modes:
1. Hallucinated tag: model invents a tag not in our taxonomy.
Mitigation: validate output against allow-list. Reject + fallback to "needs human review."
2. Low confidence on ambiguous tickets.
Mitigation: confidence score below 0.6 routes to human review queue, not auto-tagged.
3. Prompt injection in customer message ("Ignore prior instructions").
Mitigation: customer message wrapped in delimiter tokens. System prompt instructs model to never follow instructions from inside delimiters.
If you cannot write this section, you are not ready to spec the feature.
2The eval set
The question this answers: How will we know it works?
Acceptance criteria for a deterministic feature ("the form submits successfully") become an eval set for an AI feature: a fixed set of test inputs with expected outputs (or expected behaviour), run automatically, scored automatically.
What to write:
- The minimum row count for launch (we recommend 20 for prototypes, 100+ for production)
- How rows were sourced (real customer messages? synthetic? PM-written?)
- The scoring rubric: pass/fail, or graded?
- The launch bar: e.g. "ship when >85% pass on real-customer subset"
- Who owns the eval set after launch (someone has to add rows when failures surface)
Example (AI summariser for sales calls):
Eval set: 50 sales call transcripts, hand-picked across deal sizes and verticals.
Scoring rubric per summary:
- Captures the customer's stated objection: pass/fail
- Captures the requested next step: pass/fail
- Length between 80 and 200 words: pass/fail
- No invented attendees or quotes: pass/fail (zero tolerance)
Launch bar: 90% pass rate on all four criteria, 100% on the no-invention check.
Owner post-launch: Sales Ops PM adds 5 rows per week based on rep feedback.
The eval set is not a "nice to have." It is the only honest way to ship and the only way you can answer "did this regression?" three months in.
3The human-in-the-loop boundary
The question this answers: When does the AI act, and when does a human?
Most AI features in 2026 are not autonomous. They are humans-in-the-loop. The PRD has to define the boundary explicitly, or engineering will guess (badly).
What to write:
- The exact action the AI takes autonomously
- The exact action that requires human approval
- The threshold that flips one to the other (confidence score, action category, $ value, etc.)
- The UI affordance for human approval
Example (AI auto-responder for support):
AI acts autonomously:
- Account password reset confirmations
- Order status lookups
- Shipping date inquiries
Human approval required:
- Refund requests (any amount)
- Account closures
- Complaints involving the words "lawsuit", "regulator", or "press"
- Any ticket where confidence < 0.7
UI: human queue lives in /admin/inbox. Tickets show AI's drafted response + "Approve and send" / "Edit and send" / "Discard" buttons.
This boundary will be the thing legal and ops want to see most. Write it early.
4The cost-per-call ceiling
The question this answers: What is the unit economics, and at what scale does this break?
Every AI API call costs money. Without a cost ceiling, you ship a feature that works at 100 users and bankrupts you at 10,000. This is the section finance and the eng lead want.
What to write:
- The model you are using and its current pricing (per million input tokens, per million output tokens)
- Average input + output tokens per request (estimate or measure)
- Cost per request in dollars
- Expected monthly request volume at launch and at 12 months
- Monthly cost at both
- The fallback rule if cost exceeds budget (downgrade model, rate-limit, kill feature)
Example (AI search across user's documents):
Model: Claude Haiku 4.5 ($0.80/MTok input, $4/MTok output)
Avg tokens: 4,000 in (document context) + 400 out (answer)
Cost per query: $0.80 * 4000/1M + $4 * 400/1M = $0.0032 + $0.0016 = $0.0048
Expected volume:
Launch (month 1): 10K queries → $48/month
Month 12 (projected): 250K queries → $1,200/month
Budget: $2,000/month max. If we exceed by month 8, switch retrieval-only (no LLM call) for low-confidence queries.
Run this math in the spec. Run it again in eng kickoff. The number always surprises someone.
5The rollback rule
The question this answers: When do we kill the feature?
For deterministic SaaS features, "rollback" is "revert the deploy." For AI features, the feature can degrade silently as the model drifts, your prompt no longer hits the latest training data, or a competitor exposes your prompt via injection. Rollback needs its own trigger conditions.
What to write:
- The metric that signals "this is broken now" (eval pass rate dropping below X, refund rate climbing above Y, support tickets containing the feature name spiking)
- The threshold value
- The action: rollback to prior model? Disable feature? Add human-in-the-loop step?
- Who owns the alert and the decision
Example (AI product recommendation widget):
Rollback triggers:
1. Click-through rate on recommended products drops >20% week-over-week.
2. Customer support tickets mentioning "wrong recommendation" exceed 0.5% of orders.
3. Eval set pass rate drops below 80% (eval runs nightly).
Trigger 1 or 2: PM-on-call disables the widget within 4 hours, falls back to manually-curated featured products.
Trigger 3: blocks the next deploy, prompts model re-eval.
Owner: PM on-call, alerted via PagerDuty integration in Datadog.
Without this section you will run a degraded feature for weeks before someone notices.
6The prompt as spec
The question this answers: What does the model do, exactly?
In a SaaS PRD, the spec is "the form has fields A, B, C." In an AI PRD, the spec is the prompt itself. Versioned, in the doc, treated as a contract between PM and eng.
What to write:
- The full prompt (system + user template)
- The variables substituted at runtime
- The expected output format (JSON schema, free text with structure, etc.)
- The version number and the change log
Example (AI categoriser for support tickets):
Prompt v3 (current production):
System:
You are a support ticket router for {company_name}. Read the ticket and
choose exactly one category from this list: {category_list}. If you are
not confident, output "needs_human_review". Output ONLY the category name
on a single line. No explanation, no formatting.
User template:
{customer_message}
Variables:
company_name: pulled from workspace settings
category_list: workspace-defined, JSON-stringified
customer_message: raw ticket text (max 4000 chars)
Output: single category name from category_list, or "needs_human_review".
Change log:
v1 (Mar 4): initial. 76% eval pass.
v2 (Mar 18): added "if not confident" clause. 84% eval pass.
v3 (Apr 2): clarified output format constraint. 91% eval pass.
When the prompt is the spec, prompt changes go through PR review like any other shipped change. This is the single biggest mindset shift from SaaS PM to AI PM.
Six real PRD examples to model from
Below are six AI features and how the six sections would look filled in for each. Read the one closest to what you are shipping.
Example 1: AI summary on a long-form document (Notion-style)
Feature: AI-generated summary at the top of every doc >800 words.
1. Failure mode: hallucinated facts (zero tolerance), summary longer than 3 bullets (auto-truncate), refusal on sensitive content (acceptable, show "summary unavailable").
2. Eval set: 100 docs across categories (PRDs, meeting notes, legal contracts). Pass = all 3 bullets reference content in the source.
3. Human-in-the-loop: none. Read-only feature.
4. Cost: $0.0008/summary. 50K summaries/month = $40/month at launch.
5. Rollback: hallucination rate >2% in spot checks → disable feature, surface "Summary temporarily unavailable" banner.
6. Prompt v1: "Summarise the following document in exactly 3 bullets, each under 20 words. Only include facts stated in the document. {doc_content}"
Example 2: AI customer-support reply draft
Feature: Draft an email reply to inbound customer questions, sent for human review.
1. Failure mode: drafted reply contradicts company policy (>0% rate is too much), tone mismatch (judged subjectively), refusal on complex tickets (acceptable, draft says "needs founder review").
2. Eval set: 200 historical tickets with the actual rep reply. Eval scores: factually consistent with policy, tone matches rep examples, includes a specific actionable step.
3. Human-in-the-loop: every draft sent to rep queue for approval. NO autonomous send in v1.
4. Cost: $0.012/draft. 5K drafts/month = $60/month at launch.
5. Rollback: rep rejection rate >40% → revert to manual queue.
6. Prompt v2: "You are a support rep for {company_name}. Reply to the customer message below using only information from our policy doc. Match the tone of these example replies: {examples}. Reply in 80-150 words. End with 'Best,\n{rep_name}'."
Example 3: AI feature recommendation in onboarding
Feature: After signup, recommend which 3 features the user should enable based on their role + answers in the onboarding quiz.
1. Failure mode: recommendation does not match stated role (zero tolerance for B2B), recommends a feature on a plan tier they did not subscribe to (hard reject).
2. Eval set: 50 synthetic users across roles. Pass = recommendation matches role + plan in 95% of cases.
3. Human-in-the-loop: none.
4. Cost: $0.002/recommendation. 20K signups/month = $40/month.
5. Rollback: activation rate of recommended features <30% → disable, fall back to PM-picked defaults.
6. Prompt v1: "User role: {role}. Plan: {plan}. Onboarding answers: {answers}. Pick the 3 features from {available_features} most likely to drive activation for this user. Output JSON: {features: [string, string, string], reasoning: string}."
Example 4: AI search across user's account data
Feature: Type a question in plain English, get an answer using user's own data (invoices, customers, products).
1. Failure mode: hallucinated data (zero tolerance for finance queries), wrong customer pulled in (must match exactly), exposed data from another tenant (security incident, immediate kill switch).
2. Eval set: 100 question-answer pairs across data types. Pass = answer references only the queried tenant's data and is factually consistent with the underlying records.
3. Human-in-the-loop: none, but every query logs the retrieved records so the user can audit.
4. Cost: $0.0048/query. 30K queries/month = $144/month.
5. Rollback: any tenant-leak incident → kill switch, post-mortem. Eval pass rate <80% blocks deploy.
6. Prompt v1: "Answer the user's question using ONLY the records below. If the records do not contain the answer, say 'I do not have that data.' Records: {records}. Question: {question}."
Example 5: AI-generated metadata for uploaded content
Feature: On image upload, auto-generate alt-text, suggested tags, and a short caption.
1. Failure mode: offensive or stereotyped descriptions (manual review queue), hallucinated content not visible in image, refusal on humans (acceptable, leave blank).
2. Eval set: 100 images across categories. Pass = alt-text is descriptive of what is visible, tags are from the allow-list, caption is <100 chars.
3. Human-in-the-loop: user can edit any generated field before save.
4. Cost: $0.006/upload (vision model). 100K uploads/month = $600/month.
5. Rollback: user edit rate >70% on auto-generated alt-text → review prompt, possibly switch model.
6. Prompt v1: "Look at this image. Generate three things: (1) alt-text, max 125 chars, describing what is visible. (2) 3-5 tags from this list: {tag_list}. (3) A caption under 100 chars. Output JSON with keys alt_text, tags, caption."
Example 6: AI agent that books meetings for the user
Feature: Agent reads the user's inbox, drafts replies to scheduling requests, suggests times based on their calendar.
1. Failure mode: books a time the user is not available (zero tolerance), replies to non-scheduling emails (use intent classifier first), confirms a meeting without explicit user approval (no autonomy in v1).
2. Eval set: 100 inbound scheduling emails + corresponding calendar states. Pass = suggested time is genuinely free, reply is on-topic.
3. Human-in-the-loop: agent drafts, user clicks "Send" in app. NO autonomous send.
4. Cost: $0.020/scheduling request (tool use + multiple turns). 2K/month = $40/month.
5. Rollback: time-conflict rate >5% → disable suggestion, agent only flags the email as "scheduling."
6. Prompt v1: "Classify this email's intent (scheduling vs other). If scheduling, propose 3 times when the user is free from {calendar_data} and draft a reply. Output JSON: {intent, proposed_times, draft_reply}."
The section everyone skips
Of the six sections, the one we see PMs skip most often is the cost ceiling. PMs think it is engineering's job. It is not. Engineering will build whatever the spec asks for. If the spec does not bound cost, the feature shipped will be unbounded, and the surprise bill will land on you.
The second most-skipped section is the rollback rule, because for SaaS features rollback was always "redeploy the prior code." For AI features that does not work — model drift, training-data drift, prompt injection mean the feature degrades without any deploy. Without a rollback rule and an alert, the feature can run broken for weeks.
If you only add two sections beyond your old template, add those two.
What "PRD review" looks like for an AI feature
When the PRD lands in your team's review channel, here is the order of comments you should expect (and welcome):
- Engineering lead: "What is the eval pass rate threshold?" → answered in Section 2.
- Eng lead again: "What is the prompt?" → answered in Section 6.
- Design lead: "Where does the human-in-the-loop UI live?" → answered in Section 3.
- Finance / founder: "What does this cost at scale?" → answered in Section 4.
- Legal / trust: "What happens when it fails?" → answered in Section 1, and rollback in Section 5.
If all five questions answer cleanly from the PRD, you are ready for kickoff. If any is missing, send the PRD back to the draft before scheduling the meeting. The cost of an under-specified AI PRD is not a delayed feature — it is a feature that ships broken, costs a fortune, and erodes trust with the eng team for the next two quarters.
Your next step
If you are about to draft an AI PRD this week, you can copy the six-section template into a Notion doc and start filling it in for the feature you have in mind. The format works for everything from a small AI helper to a full agent feature.
If you want the deeper version — how to write the eval set in Section 2 row-by-row, how to estimate the cost in Section 4 to within 5%, how to choose the rollback metric in Section 5 — that is what we teach in ShipSet. 90 daily lessons, 15 minutes a day. Lesson 22 is the full PRD walkthrough; Lesson 35 is the eval suite deep dive; Lesson 41 is the cost model.
First 5 main lessons are free, no card. Take the 2-minute diagnostic and we will tell you exactly which lesson to start on based on what you are shipping.