Measuring AI Products: The Metrics After DAU

An AI feature can be used daily and fail every task. The outcome metrics that replace raw engagement, the definition games inside each number, and the checks that catch inflation.

By Prateek Jain
11 min readIntermediate

Under most vendor definitions, a customer who asks your support bot a question, gets nowhere, and gives up still counts as a contained conversation1. Containment inherited that blind spot from the IVR phone systems where it originated, and it now sits on AI dashboards as a headline number. Most metrics AI products report carry a soft spot like this somewhere in the numerator, and the job, before any of them reaches a board slide, is to pin down what that numerator counts.

Why engagement metrics mislead for AI features

DAU/MAU and conversion rate were built for deterministic software. The button does the same thing every time, so a user who shows up and clicks through almost certainly got what they came for. Even there, the proxy needs care: the DAU/MAU guide argues that "active" has to mean a value-delivering action rather than an app open. For AI features the proxy breaks further, because the output varies per request. A user can show up every day and get a wrong answer every day.

A retry is not engagement. A user who rephrases the same question three times appears in analytics as three interactions, a long session, and a healthy DAU tick. If a model update makes answers worse, message volume can rise, because every failed answer generates a follow-up, and an engagement dashboard reads the extra volume as growth.

Engagement metrics keep one legitimate role for AI features: they measure adoption and habit. Whether people come back is still worth knowing. What they stop being is a proxy for quality, and the metrics below exist to fill that gap.

The outcome layer: did the task get done

The question that replaces "did they show up" is "did the task get done." For a support agent that is resolution rate, the share of issues actually solved end to end. The catch is the word "handled." Vendors have a commercial incentive to define it broadly, and the same label gets applied to three different outcomes2:

  • Genuine resolution. The customer's issue was solved, in channel, with no follow-up needed.
  • Deflection. The system intercepted the contact before a ticket opened, for example by surfacing a help-center article. The ticket never existed, which says nothing about whether the article helped.
  • Containment. The conversation never reached a human. Whether the issue got solved is a separate question, covered below.

Benchmarks for genuine resolution split by tier2. Standard AI assistants land at 40 to 60 percent. Best-in-class AI-native platforms reach 55 to 70 percent first-contact resolution in their first year. Agentic platforms with deep backend integration, the kind that can issue the refund instead of explaining the refund policy, reach 70 to 85 percent end to end. The tiers track integration depth as much as model quality, which matters when a vendor quotes the top band while selling you the chat-only product.

Containment and its definition games

Containment rate is the percentage of conversations an AI agent handles without escalating to a human, whether or not the issue was solved1:

Containment rate = (conversations handled without a human / total conversations) × 100

The metric came from IVR phone trees, where "the caller never reached an agent" was the entire point, and it carried the design flaw over: under most vendor definitions, an abandoned conversation still counts as contained, including customers who gave up in frustration1.

The benchmark bands, with that flaw priced in1: 80 to 90 percent is world-class, and mostly appears in mature, transactional verticals. 65 to 75 percent is solid. 40 to 65 percent is average. 20 to 40 percent is where basic rule-based bots sit. Master of Code reports a similar spread from the build side: enterprise conversational systems often aim for 70 to 90 percent, while simpler FAQ bots average closer to 40 to 603. They also describe a customer service assistant that looked accurate in training and managed 45 percent containment after launch3. That gap between offline accuracy and production outcome is exactly where evals hand off to product metrics: the eval suite gates what ships, and the metrics in this article judge what users did with it.

Containment has honest uses. It feeds staffing models, channel capacity planning, and cost-per-contact math. It stops being honest when it stands in for quality, because it measures whether the conversation stayed in the channel while resolution measures whether the problem got solved, and only the second belongs in a headline1.

The copilot layer: acceptance and what survives the edit

Copilot-style features (code completion, drafted replies, generated summaries) produce suggestions instead of closing tickets, so the outcome layer changes shape.

Acceptance rate, the share of suggestions users accept, is the standard headline. It is also easy to inflate without anyone cheating. Accepting a draft is often the fastest way to get editable text into the buffer, so a user can accept a suggestion and rewrite most of it, and the counter records a win either way. Pair acceptance with edit distance: how much of the accepted output survives into the final version. High acceptance with low edit distance means the feature produces answers. High acceptance with high edit distance means it produces starting points, which has value, but a different value than the acceptance number implies.

For chat-style assistants, Master of Code's response-quality category points at the same problem from another angle: one-answer success rate (did the first response suffice, without rephrasing) and confusion triggers (how often the model misreads the request or asks for clarification it should not need)3.

Public benchmarks for acceptance rate and edit distance are thin, so treat both as internal trend lines rather than numbers to compare against the market4.

The honesty checks

Every metric above can be moved by a definition choice. Four supporting checks catch most of the inflation, and Notch's benchmark work recommends the same set: repeat contact rate, escalation rate, cost per resolution, and CSAT on AI-handled contacts measured against human-handled ones2.

Repeat contact within 48 to 72 hours. The single most reliable check2. If customers whose issues the AI "resolved" come back more often than customers a human resolved, the resolution rate is inflated no matter how the vendor defines it, so instrument this check before the others.

Escalation rate. The share of AI conversations handed to a human. A rising escalation rate is not automatically bad (it can mean the bot learned to stop guessing), which is why it reads alongside resolution rather than alone.

Cost per resolved task. Divide the all-in cost of the AI channel (model, infrastructure, plus the human time spent on escalations it generates) by genuine resolutions, not by conversations. A conversations denominator rewards volume, and failed answers generate volume.

CSAT split, AI versus human. Survey AI-handled contacts as their own segment and put the number next to the human-handled baseline2. A bot with 75 percent containment and a 15-point CSAT gap against human agents is buying cost savings with customer goodwill, and the blended CSAT number will hide it.

Join the numbers with retention

Amplitude's guidance on evals lands on the point that ties this article to the one below it: pass rate is the headline metric for an eval suite, and the score becomes useful when joined with the product metrics teams already track, retention, conversion, and adoption among them5. A high pass rate alone does not tell you whether the interactions that pass keep users around5.

The join is straightforward to set up. Cohort users by the outcome of their early AI interactions (resolved, escalated, abandoned) and compare retention curves across the cohorts. If resolved-cohort retention does not separate from the abandoned cohort within a few weeks, either your resolution label is wrong or the feature is not load-bearing, and either finding changes the roadmap.

For the older metrics these sit beside (CSAT, NPS, first-contact resolution from the human-support era), the PM metrics glossary has the definitions.

The metric set in one table

MetricWhat it measuresWhat it hidesBenchmark range
Resolution rateIssues actually solved, end to endVendors fold deflection and containment into "handled"240-60% standard; 55-70% best-in-class first-contact, year one; 70-85% agentic with deep integration2
Containment rateConversations that never reached a humanAbandoned conversations count as contained under most definitions120-40% basic bots; 40-65% average; 65-75% solid; 80-90% world-class, transactional verticals1
Escalation rateShare of conversations handed to a humanWhether the handoff was a failure or the right callInverse of containment; no standalone band
Suggestion acceptance rateShare of AI suggestions users acceptAccept-then-rewrite; acceptance as a path to editable textNo reliable public benchmark; trend internally4
Edit distanceHow much accepted output survives to finalSuggestions never shown or never acceptedTrack direction, not an absolute target
Repeat contact rate (48-72h)Customers returning about the same issueLittle; this is the check on the other rowsCompare AI-handled vs human-handled2
Cost per resolved taskUnit economics of the AI channelQuality, if the denominator is conversations instead of resolutionsDepends on stack; recompute with genuine resolutions

FAQ

What is containment rate in AI customer support? The percentage of conversations an AI agent handles without escalating to a human, computed as conversations handled without a human divided by total conversations, times 1001. Under most vendor definitions it counts abandoned conversations as contained, so it measures channel behavior, not whether the issue got solved.

What is a good resolution rate for an AI support agent? Standard AI assistants resolve 40 to 60 percent of issues. Best-in-class AI-native platforms reach 55 to 70 percent first-contact resolution in year one, and agentic platforms with deep backend integration reach 70 to 85 percent end to end2. Before comparing against any of these, confirm the numerator counts solved issues rather than deflections or contained chats.

What is the difference between containment, deflection, and resolution? Deflection fires before a ticket opens, for example when a help-center suggestion stops the contact. Containment fires in-channel: the conversation happened with the AI and never reached a human. Resolution means the issue was actually solved1. Vendors apply the word "handled" across all three, which is why the definitions need pinning before the numbers get compared2.

How do I measure a copilot feature that has no ticket to close? Use acceptance rate paired with edit distance. Acceptance alone inflates, because users accept drafts they intend to rewrite. Edit distance shows how much of the accepted output survives, which separates answers from starting points. For chat-style features, add one-answer success rate and confusion triggers3.

Can I still use DAU/MAU for an AI product? Yes, for adoption and habit, which it measures as well as it ever did. It cannot stand in for quality, because a retry and a habit look identical in an engagement metric. Pair it with an outcome metric from this article, and see the DAU/MAU guide for defining "active" correctly in the first place.

Sources

Footnotes

  1. What is Containment Rate? (My AskAI) 2 3 4 5 6 7 8 9

  2. AI Customer Support Resolution Rate Benchmarks (Notch) 2 3 4 5 6 7 8 9 10

  3. AI Agent Evaluation (Master of Code) 2 3 4

  4. Acceptance-rate figures published for coding and writing copilots come almost entirely from the vendors selling them, and we have not found an independent study that pins a range. The accept-then-rewrite failure mode and the edit-distance pairing are practitioner method, not a benchmark; measure against your own baseline. 2

  5. AI Evals for Product Managers: A Beginner's Guide (Amplitude) 2