The PM's Guide to AI Features (Without an AI Team)

The mental model shift

Most PM content about AI says: “learn to spec AI features, evaluate quality, and manage costs.” That’s true but useless — it describes the output without changing the thinking.

Here’s the shift that actually matters: you are no longer managing features. You are allocating risk under uncertainty.

Traditional features are deterministic. You spec them, engineering builds them, QA verifies them, they work or they don’t. AI features are probabilistic. They work 94% of the time. They fail in ways that look like success. The cost of running them scales unpredictably. And whether they’re “good enough” is a judgment call that no test suite can fully automate.

The PM who thrives in this world isn’t the one who learned to write better specs. It’s the one who learned to make ship/no-ship decisions when the system is right 94% of the time and wrong 6% of the time — and the 6% might cost you a customer, a lawsuit, or a news cycle.

That’s risk allocation. And it’s the skill that most PM-focused AI content completely ignores.

When NOT to use AI

The most valuable thing a PM can do is kill an AI feature before engineering starts building it. Most AI initiatives fail not because the technology doesn’t work, but because the use case was wrong from the start.

Don’t use AI when deterministic logic works. If the rules are known and finite, write rules. A discount calculator doesn’t need a language model. A shipping rate lookup doesn’t need embeddings. The temptation is to use AI because it’s novel. The discipline is to use it only when the problem is genuinely ambiguous — natural language understanding, unstructured data interpretation, creative generation, or judgment calls that resist hard-coding.

Don’t use AI when the cost of error exceeds the value of automation. If a wrong answer has legal, financial, or safety consequences that outweigh the cost of having a human do the work, the math doesn’t close. A $50/hour human who is right 99.5% of the time is cheaper than an AI that is right 95% of the time if each error costs $2,000 in remediation.

Don’t use AI when your data is dirty. AI systems inherit every problem in your data — and amplify them with confidence. If your product catalog has inconsistent naming, your knowledge base has contradictory articles, or your CRM has duplicate records, the AI will surface those problems to customers with the conviction of an expert. Clean data first. Automate second.

Don’t use AI when you can’t measure success. If you can’t define what “correct” means for this feature — with specific, testable criteria — you can’t evaluate it. And if you can’t evaluate it, you can’t improve it. “Makes the experience better” is not a success metric. “Correctly resolves the customer’s stated issue without escalation, verified by post-interaction survey and order status check” is.

The PM who can articulate why a specific use case is wrong for AI — and redirect the initiative toward one that isn’t — saves more engineering time than the PM who specs a perfect AI feature.

The real cost model

Token costs are the visible part. They’re often not the dominant cost.

The token math (what most people model): Your support agent processes 10,000 tickets per day, averaging 500 tokens in and 200 tokens out. At Claude Sonnet pricing (~$3/million input, ~$15/million output), daily cost is roughly $45. At Haiku pricing, it’s roughly $4. If Haiku handles 85% of tickets correctly and Sonnet handles 94%, is the 9% quality improvement worth 10x the cost? That’s a PM decision.

The costs most PMs miss:

Retry and fallback costs. When the primary model times out or rate-limits (which happens under load), your system retries or falls back to another model. Under normal conditions, retries add 5-10% to your token cost. During an API degradation, they can add 50%. Budget for it.

Eval infrastructure costs. Your eval suite runs against every prompt change. If you’re running 500 test cases through Sonnet for each PR, that’s ~$1-2 per eval run. If your team ships 20 PRs a week, that’s $80-160/month in eval costs alone. Small — but it’s a cost that scales with engineering velocity, not user volume.

Human review costs. If 8% of interactions require human escalation and your reviewers cost $35/hour, that’s a per-ticket cost of $2-4 for escalated interactions. At 10,000 tickets/day with 8% escalation, human review costs $1,600-3,200/day — 35-70x the token cost. This is usually the largest cost line and the one most PMs discover after launch.

Error remediation costs. When the AI gets it wrong — and it will — what does fixing it cost? A wrong refund amount means a manual correction, a customer call, and a credit. A wrong product recommendation means a return, restocking, and reshipping. Model the cost per error and multiply by the expected error rate. This is how you calculate whether 94% accuracy is good enough or whether you need 99%.

The PM calculation that matters: Total cost = tokens + retries + eval infra + (escalation rate x human review cost) + (error rate x cost per error). If that total is less than the current fully-loaded cost of the human process, you have a business case. If it’s not, you don’t — regardless of how impressive the demo was.

Evaluation: making ship/no-ship decisions

Evaluation is the most frequently cited skill in AI job postings. PMs don’t need to build eval harnesses — but they need to make decisions based on eval data. This is harder than it sounds because the data is ambiguous.

The scenario: Your eval suite scores the support agent at 91% accuracy on your golden dataset. Engineering says ship it. Customer success says they’re nervous. Legal wants 99%. What do you do?

What you need to know to decide:

91% accuracy on what? Break it down by query type. If it’s 99% on order status (50% of volume), 95% on returns (30%), and 62% on billing disputes (20%), you don’t have a 91% accuracy system. You have a system that’s great at two things and terrible at one. Ship the first two. Gate the third behind human review.

What does “wrong” look like? Not all errors are equal. A response that’s slightly verbose but factually correct is different from one that cites the wrong return policy. Categorize errors by severity. If the 9% error rate is mostly formatting issues, ship. If it’s mostly factual errors on financial topics, don’t.

What’s the baseline? Your human agents aren’t 100% accurate either. If humans handle billing disputes at 78% accuracy and the AI handles them at 62%, the gap is real but smaller than it looks. If humans handle them at 97%, the AI isn’t ready. Always compare to the actual current baseline, not to perfection.

What’s the detection lag? If a customer gets a wrong answer, how long until someone notices? If it’s caught in the next interaction (hours), the blast radius is contained. If it takes a billing cycle (30 days), you have a silent failure accumulating undetected damage. The detection lag determines how aggressive your monitoring needs to be.

The PM’s eval job is not “score the system.” It’s “decide what to ship, what to gate, and what to kill — using eval data, error categorization, baseline comparison, and cost-of-error analysis.” That is a product decision, not an engineering decision.

Failure modes → product decisions

Every AI failure mode maps to a product constraint. The PM who knows these mappings makes better decisions:

Failure Mode	What Happens	Product Decision
Context degradation	Quality drops in long sessions	Set max session length, auto-handoff to human after N exchanges
Sycophantic confirmation	AI agrees with bad data	Never let AI override system-of-record, build verification loops
Silent failure	Output looks right but isn’t	Require verification for high-stakes actions, sample-audit at scale
Cascading failure	Error compounds through pipeline	Add checkpoints between steps, design rollback capability
Tool selection error	AI uses wrong tool	Scope agent capabilities narrowly, monitor tool usage patterns
Spec drift	AI gradually ignores instructions	Periodic re-anchoring, monitor output distribution shifts

This table is your decision framework. When someone proposes an AI feature, check each row. If the cost of that failure mode exceeds the value of the automation for a given use case, either add a mitigation (which adds cost) or kill the feature.

Spec writing: the precision tax

Writing specs for AI is different from writing specs for human developers — but not for the reason most PM content suggests.

The difference isn’t “be more specific.” You should already be specific. The difference is that AI agents fail silently when specs are ambiguous, while human developers ask clarifying questions. A human engineer who sees a vague escalation policy will Slack you. An AI agent will guess, and it will guess differently each time.

What AI specs must include that human-developer specs don’t:

Explicit negative constraints. “Never promise a specific resolution timeline. Never access payment information directly. Never offer compensation beyond the approved schedule.” Human developers infer these from context. AI agents don’t.
Measurable thresholds for judgment calls. Not “escalate when the customer is frustrated.” Instead: “escalate when consecutive messages contain profanity, all-caps, or explicit requests for a supervisor.” If the spec requires judgment, the AI will exercise judgment — unpredictably.
Fallback behavior for every edge case. What happens when the customer’s question doesn’t match any known category? What happens when the database lookup returns no results? What happens when the context window is full? Human developers handle these with common sense. AI agents need explicit instructions.

The precision tax is real: AI specs take 2-3x longer to write than equivalent human-developer specs. But the alternative is a system that works perfectly for the cases you specified and fails creatively for the cases you didn’t.

The PM’s actual toolkit

“You don’t need to code” is an oversimplification. You don’t need to write production code. But you need technical fluency in four areas or you can’t challenge engineering and you can’t interpret failures:

API basics. You need to read API documentation, understand request/response structures, and follow a token cost calculation. You don’t need to write the integration code.
Eval output interpretation. You need to read an eval report — precision, recall, per-category breakdowns, confidence intervals — and translate it into a product decision. You don’t need to build the eval harness.
Error log reading. When a user reports a bad AI response, you need to trace it: what was the input, what was retrieved (RAG), what was the model output, where did it go wrong. You don’t need to debug the code, but you need to locate the failure.
Cost dashboard fluency. You need to read a cost monitoring dashboard and spot anomalies — a 3x spike in token usage, an escalation rate climbing from 8% to 15%, a model that’s suddenly slower. You don’t need to build the dashboard.

This isn’t coding. It’s literacy. And the gap between a PM who has it and one who doesn’t is the gap between “I trust engineering’s recommendation” and “I can independently evaluate whether this system is ready to ship.”

The 60-day milestone

In eight weeks, a PM can deliver an AI feature assessment for a real initiative at their company:

Viability analysis: Is this use case right for AI? If not, why? If so, what are the constraints?
Cost model: Total cost including tokens, retries, eval, human review, and error remediation. Compared to the current human process cost.
Evaluation criteria: Quality rubric with per-category thresholds, baseline comparison, and ship/gate/kill decision framework
Failure mode mapping: Which failure modes apply, what product constraints they impose, and what mitigations are required
Spec: A production-quality feature spec with negative constraints, measurable thresholds, and explicit edge-case handling

This isn’t a slide deck. It’s an operating document that tells engineering exactly what to build, tells leadership exactly what to expect, and tells you exactly when to ship and when to pull back. The PM who delivers this is the one who gets the AI roadmap.