The TPM Who Delivers AI Initiatives on Time

Why AI projects fail differently

AI projects have a unique failure signature. They don’t blow past deadlines because engineers are slow. They fail because nobody scoped them correctly in the first place.

Traditional software: you define requirements, estimate effort, build, test, ship. The system either works or it doesn’t, and “works” is deterministic. AI projects: the requirements are probabilistic (“it should be accurate most of the time”), the effort is unpredictable (prompt engineering can take two hours or two months), testing is continuous (you can’t pass a test suite and call it done), and “works” is a spectrum.

A traditional PM sees a Gantt chart. An AI initiative needs something closer to a clinical trial protocol — phased, measured, with clear go/no-go criteria at each gate. And critically: with defined kill criteria for when the trial should stop.

The five capabilities that matter

You don’t need to write prompts or understand transformer architecture. You need to understand AI systems well enough to plan, scope, and de-risk them.

1. Use case qualification — including kill criteria. Before a single sprint starts, the TPM needs to determine whether the proposed AI use case is viable. This means asking the questions that engineers and product managers often skip:

What is the specific input and output? (“AI-powered customer insights” is not a use case. “Given a customer’s last 90 days of support tickets, generate a churn risk score with three contributing factors” is a use case.)
What accuracy is required, and what’s the cost of being wrong? The accuracy requirement determines the entire architecture, evaluation strategy, and timeline.
Is there training or evaluation data available? If the data doesn’t exist, the project starts with a data collection phase that most timelines don’t account for.
Build vs. buy vs. API? Each has radically different cost, timeline, and maintenance implications.

And the question most TPMs don’t ask: when do we kill this?

Define kill criteria upfront and get stakeholder agreement before the project starts:

Accuracy plateau below threshold. If after 4 weeks of prompt iteration, accuracy is at 78% and the requirement is 92%, and the improvement curve has flattened — kill it. The remaining gap requires a fundamentally different approach (more data, different model, different architecture), not more prompt tweaking.
Cost exceeds ROI. If the total cost model (tokens + eval + human review + error remediation) exceeds the value of automation, kill it — regardless of how impressive the demo was.
Data unavailable. If the required training data can’t be sourced within 6 weeks, the project is a data engineering initiative masquerading as an AI initiative. Re-scope or kill.
Stakeholder disagreement on “good enough.” If engineering says 91% is ready to ship and legal says only 99% is acceptable and the gap can’t be closed in the budget — kill it, or re-scope to a lower-stakes version of the use case.

Killing projects early is the highest-leverage TPM behavior. Every month a doomed initiative continues is engineering time that could have built something viable.

2. Cost estimation — the full picture. AI projects have a cost structure that traditional software doesn’t: inference costs that scale with usage, not just with engineering headcount.

What most TPMs model: API token costs at projected volume. Input tokens are cheaper than output tokens, and the ratio matters. A chatbot handling 100,000 conversations/month at Sonnet-class quality costs roughly $5-15K/month in base API fees.

What most TPMs miss:

Retry amplification. Under normal conditions, retries add 10-15% to token cost. During API degradation events (2-4 per year per provider), retry costs spike 50-100%. Budget 20% overhead as baseline.
Eval costs at scale. Running quality checks costs additional API calls. 500 test cases × 3 runs × a judge model = 1,500 extra LLM calls per eval cycle. If the team ships 20 PRs/week, eval alone costs $80-160/month.
Human review costs. This is usually the largest line item and the one discovered after launch. If 8% of interactions require escalation at $35/hour and 5 minutes per review, that’s $1,600-3,200/day at 10,000 daily interactions — 35-70x the token cost.
Error remediation. When the AI gets it wrong, what does fixing it cost? A wrong refund is a manual correction + customer call. A wrong medical code is a compliance incident. Model cost per error × expected error rate.

The TPM should be able to present: “At 10x our current volume, total cost including review and remediation is $X/month. The business case holds above Y accuracy and below Z escalation rate.”

3. Spec writing for non-deterministic systems. A traditional PRD says “the system shall return the correct result.” An AI spec needs: “the system shall return a factually accurate result at least 92% of the time, as measured by a labeled evaluation set of 500+ examples, scored by the rubric defined in Appendix A.”

The spec needs to define what “good enough” means in measurable terms, because AI systems are never 100% correct. Explicitly include: accuracy thresholds per category (not just aggregate), acceptable failure types (verbose but correct = OK, factually wrong = not OK), latency requirements, cost-per-interaction ceiling, and the evaluation methodology that determines whether the threshold is met.

4. Human-in-the-loop design — including the tradeoffs. Most production AI systems need human oversight. The TPM needs to plan for it — and plan for its failure modes.

The design questions: Who reviews AI outputs? What’s the review workflow? What’s the escalation path? How does human feedback loop back into improvement?

The tradeoffs most plans ignore:

Reviewers are inconsistent. Two reviewers disagree on 15-30% of edge cases. If your quality bar depends on human review, you need inter-rater reliability measurement and reviewer calibration — not just a review queue.
Queues back up. At peak hours, human review becomes the bottleneck. If the AI processes 500 requests/hour but humans can review 50/hour, your escalation rate determines whether the queue is manageable or growing without bound. Model the queue math during scoping.
Cost vs. latency tradeoff. Human review adds 5-30 minutes of latency per reviewed interaction. For real-time support, that’s unacceptable for most queries. For batch processing (document review, claim processing), it’s fine. The TPM needs to define which interactions get real-time automation and which get queued for review.
Reviewer fatigue. After 200 reviews in a shift, approval rates climb and rejection rates fall — regardless of quality. Build in reviewer rotation, break schedules, and spot-check audits.

5. Compliance and risk management — with real timelines. AI features touch legal, compliance, privacy, and brand risk. The TPM who treats these as a late-stage checkbox creates delays when legal blocks the launch.

What actually takes time:

Legal review of the AI use case: 2-4 weeks. Start before development begins, not after the demo.
Privacy impact assessment: 1-3 weeks. Required for any system processing personal data. Requires data flow documentation that engineering hasn’t written yet.
AI governance committee review: Many enterprises created these in 2024-2025. They meet monthly. If you miss the review cycle, you wait 30 days. Get on the agenda early.
DPA negotiation with AI providers: 2-8 weeks for enterprise terms. Finance and legal negotiate data processing agreements with Anthropic, OpenAI, or the cloud provider. This is on the critical path if the provider hasn’t been approved before.
Bias testing and documentation: 1-2 weeks. EU AI Act requires this for high-risk applications. Even if not legally required, it’s increasingly expected.

Total compliance timeline impact: 30-90 days. Start in week one, in parallel with technical work. Not after the POC succeeds.

Failure modes → planning controls

Every AI failure mode has a planning implication. The TPM who maps these builds a more realistic plan:

Failure Mode	What Happens	Planning Control
Silent failure	Output looks correct but isn’t	Require sampling audits in every phase. Budget for 5% production sampling with human review.
Cascading failure	Error in step 2 compounds through steps 3-7	Insert quality checkpoints between pipeline stages. Build rollback capability to last-known-good state.
Specification drift	AI gradually ignores instructions over long sessions	Require periodic specification re-injection. Monitor output distribution for baseline shifts.
Context degradation	Quality drops in long sessions	Define maximum session length in the spec. Plan for human handoff workflows.
Cost spikes	Pathological inputs or retry storms	Set per-request and per-day cost caps. Include cost monitoring in the launch criteria.
Model deprecation	Provider sunsets the model you built on	Version-pin models. Build eval suites that run against candidate replacement models. Include “model migration” as a maintenance line item.

Managing instability

AI projects are inherently less predictable than traditional software projects. The TPM who pretends otherwise creates false confidence. The TPM who plans for instability builds trust.

Model behavior shifts mid-project. The provider updates the model and your outputs change. This has happened repeatedly — “GPT-4 got worse” was a real incident in 2023, and model updates continue to shift behavior in subtle ways. Plan for it: pin model versions, run eval suites against new versions before adopting, and budget 1-2 sprint cycles for model migration per year.

Prompt iteration is unpredictable. A prompt change that engineering estimates at “2 hours” can take 2 weeks if the eval suite reveals unexpected regressions. Don’t schedule prompt iteration on a Gantt chart. Schedule eval-driven iteration cycles with time boxes: “2 weeks to achieve 90% accuracy on category X. If not achieved, escalate the architectural decision.”

Stakeholder expectations shift. The demo looks great at 85% accuracy. The VP says “ship it.” Legal says “not until 99%.” The VP says “fine, but I need it by Q3.” These aren’t edge cases — they’re the normal state of AI initiatives. The TPM’s job is to make the tradeoffs visible: “here’s what 85% costs, here’s what 99% costs, here’s the timeline difference, here’s the risk at each level.” Force the decision with data, not opinions.

The 60-day milestone

Weeks 1-2: Use case qualification and kill criteria. Pick one AI initiative. Run structured qualification. Define specific go/no-go criteria for each phase AND the kill criteria that stop the project. Map every stakeholder. Get sign-off on the criteria before building starts.

Weeks 3-4: Cost model and technical scoping. Full cost model: tokens + retries + eval + human review + error remediation. Phase the work: Phase 1 (internal pilot with human review), Phase 2 (limited external rollout with monitoring), Phase 3 (general availability with automated quality measurement). Define the quality bar for each phase transition.

Weeks 5-6: Risk register, compliance plan, and failure mode mapping. Every risk with likelihood, impact, mitigation, and owner. Compliance checklist with realistic timelines (start the 30-90 day compliance track now). Failure mode table with planning controls for each.

Weeks 7-8: Initiative plan and phased rollout. Assemble the full plan: phased timeline with milestones, quantified success metrics per phase, cost projections with ranges, risk register, compliance checkpoints, human-in-the-loop design (with queue math), kill criteria, and instability budget (time reserved for model changes and prompt iteration). Present to stakeholders and get sign-off on the phase gates.

The deliverable — a complete AI initiative plan that acknowledges uncertainty, budgets for instability, and defines when to stop — is the artifact that proves you can manage AI initiatives. It’s also the document your organization probably needs right now and doesn’t have.