Measuring AI Products: The Metrics After DAU
An AI feature can be used daily and fail every task. The outcome metrics that replace raw engagement, the definition games inside each number, and the checks that catch inflation.
Under most vendor definitions, a customer who asks your support bot a question, gets nowhere, and gives up still counts as a contained conversation1. Containment inherited that blind spot from the IVR phone systems where it originated, and it now sits on AI dashboards as a headline number. Most metrics AI products report carry a soft spot like this somewhere in the numerator, and the job, before any of them reaches a board slide, is to pin down what that numerator counts.
Why engagement metrics mislead for AI features
DAU/MAU and conversion rate were built for deterministic software. The button does the same thing every time, so a user who shows up and clicks through almost certainly got what they came for. Even there, the proxy needs care: the DAU/MAU guide argues that "active" has to mean a value-delivering action rather than an app open. For AI features the proxy breaks further, because the output varies per request. A user can show up every day and get a wrong answer every day.
A retry is not engagement. A user who rephrases the same question three times appears in analytics as three interactions, a long session, and a healthy DAU tick. If a model update makes answers worse, message volume can rise, because every failed answer generates a follow-up, and an engagement dashboard reads the extra volume as growth.
Engagement metrics keep one legitimate role for AI features: they measure adoption and habit. Whether people come back is still worth knowing. What they stop being is a proxy for quality, and the metrics below exist to fill that gap.
The outcome layer: did the task get done
The question that replaces "did they show up" is "did the task get done." For a support agent that is resolution rate, the share of issues actually solved end to end. The catch is the word "handled." Vendors have a commercial incentive to define it broadly, and the same label gets applied to three different outcomes2:
- Genuine resolution. The customer's issue was solved, in channel, with no follow-up needed.
- Deflection. The system intercepted the contact before a ticket opened, for example by surfacing a help-center article. The ticket never existed, which says nothing about whether the article helped.
- Containment. The conversation never reached a human. Whether the issue got solved is a separate question, covered below.
Benchmarks for genuine resolution split by tier2. Standard AI assistants land at 40 to 60 percent. Best-in-class AI-native platforms reach 55 to 70 percent first-contact resolution in their first year. Agentic platforms with deep backend integration, the kind that can issue the refund instead of explaining the refund policy, reach 70 to 85 percent end to end. The tiers track integration depth as much as model quality, which matters when a vendor quotes the top band while selling you the chat-only product.
Before a resolution rate goes into your reporting, write down which of the three outcomes its numerator counts, and get the vendor's definition in writing. A 70 percent "resolution rate" that includes deflections and abandoned chats is a different number than a 70 percent rate of solved issues, and both get presented with the same confidence.
Containment and its definition games
Containment rate is the percentage of conversations an AI agent handles without escalating to a human, whether or not the issue was solved1:
Containment rate = (conversations handled without a human / total conversations) × 100
The metric came from IVR phone trees, where "the caller never reached an agent" was the entire point, and it carried the design flaw over: under most vendor definitions, an abandoned conversation still counts as contained, including customers who gave up in frustration1.
The benchmark bands, with that flaw priced in1: 80 to 90 percent is world-class, and mostly appears in mature, transactional verticals. 65 to 75 percent is solid. 40 to 65 percent is average. 20 to 40 percent is where basic rule-based bots sit. Master of Code reports a similar spread from the build side: enterprise conversational systems often aim for 70 to 90 percent, while simpler FAQ bots average closer to 40 to 603. They also describe a customer service assistant that looked accurate in training and managed 45 percent containment after launch3. That gap between offline accuracy and production outcome is exactly where evals hand off to product metrics: the eval suite gates what ships, and the metrics in this article judge what users did with it.
Containment has honest uses. It feeds staffing models, channel capacity planning, and cost-per-contact math. It stops being honest when it stands in for quality, because it measures whether the conversation stayed in the channel while resolution measures whether the problem got solved, and only the second belongs in a headline1.
One question sorts vendor demos quickly: "Does an abandoned conversation count as contained?" If the answer is yes, ask for contained-and-resolved as a separate number.
The copilot layer: acceptance and what survives the edit
Copilot-style features (code completion, drafted replies, generated summaries) produce suggestions instead of closing tickets, so the outcome layer changes shape.
Acceptance rate, the share of suggestions users accept, is the standard headline. It is also easy to inflate without anyone cheating. Accepting a draft is often the fastest way to get editable text into the buffer, so a user can accept a suggestion and rewrite most of it, and the counter records a win either way. Pair acceptance with edit distance: how much of the accepted output survives into the final version. High acceptance with low edit distance means the feature produces answers. High acceptance with high edit distance means it produces starting points, which has value, but a different value than the acceptance number implies.
For chat-style assistants, Master of Code's response-quality category points at the same problem from another angle: one-answer success rate (did the first response suffice, without rephrasing) and confusion triggers (how often the model misreads the request or asks for clarification it should not need)3.
Public benchmarks for acceptance rate and edit distance are thin, so treat both as internal trend lines rather than numbers to compare against the market4.
If computing true edit distance is more instrumentation than engineering will take on this quarter, a cheap proxy works: the share of accepted suggestions still present unchanged when the document or PR is finalized. It is coarser, and it still separates answers from starting points.
The honesty checks
Every metric above can be moved by a definition choice. Four supporting checks catch most of the inflation, and Notch's benchmark work recommends the same set: repeat contact rate, escalation rate, cost per resolution, and CSAT on AI-handled contacts measured against human-handled ones2.
Repeat contact within 48 to 72 hours. The single most reliable check2. If customers whose issues the AI "resolved" come back more often than customers a human resolved, the resolution rate is inflated no matter how the vendor defines it, so instrument this check before the others.
Escalation rate. The share of AI conversations handed to a human. A rising escalation rate is not automatically bad (it can mean the bot learned to stop guessing), which is why it reads alongside resolution rather than alone.
Cost per resolved task. Divide the all-in cost of the AI channel (model, infrastructure, plus the human time spent on escalations it generates) by genuine resolutions, not by conversations. A conversations denominator rewards volume, and failed answers generate volume.
CSAT split, AI versus human. Survey AI-handled contacts as their own segment and put the number next to the human-handled baseline2. A bot with 75 percent containment and a 15-point CSAT gap against human agents is buying cost savings with customer goodwill, and the blended CSAT number will hide it.
Join the numbers with retention
Amplitude's guidance on evals lands on the point that ties this article to the one below it: pass rate is the headline metric for an eval suite, and the score becomes useful when joined with the product metrics teams already track, retention, conversion, and adoption among them5. A high pass rate alone does not tell you whether the interactions that pass keep users around5.
The join is straightforward to set up. Cohort users by the outcome of their early AI interactions (resolved, escalated, abandoned) and compare retention curves across the cohorts. If resolved-cohort retention does not separate from the abandoned cohort within a few weeks, either your resolution label is wrong or the feature is not load-bearing, and either finding changes the roadmap.
For the older metrics these sit beside (CSAT, NPS, first-contact resolution from the human-support era), the PM metrics glossary has the definitions.
The metric set in one table
| Metric | What it measures | What it hides | Benchmark range |
|---|---|---|---|
| Resolution rate | Issues actually solved, end to end | Vendors fold deflection and containment into "handled"2 | 40-60% standard; 55-70% best-in-class first-contact, year one; 70-85% agentic with deep integration2 |
| Containment rate | Conversations that never reached a human | Abandoned conversations count as contained under most definitions1 | 20-40% basic bots; 40-65% average; 65-75% solid; 80-90% world-class, transactional verticals1 |
| Escalation rate | Share of conversations handed to a human | Whether the handoff was a failure or the right call | Inverse of containment; no standalone band |
| Suggestion acceptance rate | Share of AI suggestions users accept | Accept-then-rewrite; acceptance as a path to editable text | No reliable public benchmark; trend internally4 |
| Edit distance | How much accepted output survives to final | Suggestions never shown or never accepted | Track direction, not an absolute target |
| Repeat contact rate (48-72h) | Customers returning about the same issue | Little; this is the check on the other rows | Compare AI-handled vs human-handled2 |
| Cost per resolved task | Unit economics of the AI channel | Quality, if the denominator is conversations instead of resolutions | Depends on stack; recompute with genuine resolutions |
FAQ
What is containment rate in AI customer support? The percentage of conversations an AI agent handles without escalating to a human, computed as conversations handled without a human divided by total conversations, times 1001. Under most vendor definitions it counts abandoned conversations as contained, so it measures channel behavior, not whether the issue got solved.
What is a good resolution rate for an AI support agent? Standard AI assistants resolve 40 to 60 percent of issues. Best-in-class AI-native platforms reach 55 to 70 percent first-contact resolution in year one, and agentic platforms with deep backend integration reach 70 to 85 percent end to end2. Before comparing against any of these, confirm the numerator counts solved issues rather than deflections or contained chats.
What is the difference between containment, deflection, and resolution? Deflection fires before a ticket opens, for example when a help-center suggestion stops the contact. Containment fires in-channel: the conversation happened with the AI and never reached a human. Resolution means the issue was actually solved1. Vendors apply the word "handled" across all three, which is why the definitions need pinning before the numbers get compared2.
How do I measure a copilot feature that has no ticket to close? Use acceptance rate paired with edit distance. Acceptance alone inflates, because users accept drafts they intend to rewrite. Edit distance shows how much of the accepted output survives, which separates answers from starting points. For chat-style features, add one-answer success rate and confusion triggers3.
Can I still use DAU/MAU for an AI product? Yes, for adoption and habit, which it measures as well as it ever did. It cannot stand in for quality, because a retry and a habit look identical in an engagement metric. Pair it with an outcome metric from this article, and see the DAU/MAU guide for defining "active" correctly in the first place.
Sources
Footnotes
-
What is Containment Rate? (My AskAI) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
AI Customer Support Resolution Rate Benchmarks (Notch) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10
-
Acceptance-rate figures published for coding and writing copilots come almost entirely from the vendors selling them, and we have not found an independent study that pins a range. The accept-then-rewrite failure mode and the edit-distance pairing are practitioner method, not a benchmark; measure against your own baseline. ↩ ↩2
-
AI Evals for Product Managers: A Beginner's Guide (Amplitude) ↩ ↩2