Synthetic Users: What AI Participants Can and Cannot Tell You | PM Toolkit

Nielsen Norman Group took three studies it had already run with real participants and reran them with AI stand-ins, testing both ChatGPT and the Synthetic Users product¹. One of the three covered online training. In NN/g's data, real learners commonly start courses and never finish them. The synthetic learners reported completing everything.

That gap is the subject of this guide. A synthetic user is an AI-generated profile built to mimic a user group, producing research-shaped findings without any real person involved¹. The pitch is easy to see: answers in minutes instead of weeks, near-zero cost per participant, simulated access to groups you struggle to recruit. The published comparisons keep finding the same problems in the output.

Why teams reach for them

The pull is real, so it is worth stating plainly.

Recruiting is slow and expensive. Getting eight enterprise admins on calls can take three weeks and a few thousand dollars in incentives. A synthetic panel answers before lunch.

Some groups are hard to reach at any budget. Compliance officers, ICU nurses, CFOs at mid-market firms. If your current access to them is zero, a simulated version feels better than nothing.

And the output looks like research. Transcripts, quotes, personas, all formatted like the artifacts a real study produces. The resemblance is part of the problem, because a stakeholder reading the deck cannot tell which kind of study generated it.

What the comparisons found

MeasuringU reviewed 12 papers that compared synthetic users with human participants across four research contexts. They tagged 9 findings encouraging and 14 discouraging, summarizing the field as "the results aren't universally bad, but they definitely aren't great"².

The distribution of those findings matters more than the totals. On matched variance, the review found zero encouraging findings and three discouraging ones: synthetic data shows artificially low variability. On qualitative depth, the same pattern, zero encouraging and three discouraging². Those two zeros land on the two things research exists to capture: how much your users differ from each other, and what they say that you did not expect.

NN/g's three-study comparison fills in the texture¹. The chatbots wanted to please, a tendency the researchers name directly as sycophancy, and did not model human behavior well. The synthetic users' stated values were too shallow to act on. They "seem to care about everything," which tells a PM nothing about what to prioritize, and probing only produced more vagueness. Their past experiences were invented, because a model has never used your product. Whatever it offers as memory is confabulation.

The failures also run in the opposite direction. Jeff Sauro and colleagues tested ChatGPT on tree testing, the task of locating items in a navigation hierarchy, and concluded it was not an acceptable replacement for real users because it vastly outperformed most real people³. So a synthetic participant can be too agreeable about your concept and too skilled at your navigation in the same session, and neither error is visible unless you already have human data to compare against.

Synthetic users fail in both directions. They overrate your ideas (sycophancy) and over-perform on your tasks (no human limits on attention, memory, or patience). A flattering opinion study and a too-easy usability result can come from the same prompt.

Two more results from the academic side. Rafikova and Voronin (2026) found synthetic users can match the direction of human attitudinal trends while missing the deeper structure underneath them². And Wang, Morgenstern, and Dickerson, writing in Nature Machine Intelligence in 2025, found that LLMs substituting for human participants "can harmfully misportray and flatten identity groups"². If you are simulating users from a demographic you do not belong to, the simulation is not a neutral stand-in.

Five failure modes

What breaks	Why	What it looks like in your findings
Sycophancy	Models are tuned to agree with the framing of your question⁴	Lead the question slightly and the answer confirms what you hoped to hear; nobody pushes back on the concept
Flat variance	Synthetic responses cluster around the plausible average²	Suspiciously consistent answers, no outliers, every "segment" wanting roughly the same thing
Confabulated experience	The model has never used your product, so it invents a past¹	Detailed memories of workflows no one performed; completion and success rates near 100%
No social context	Simulated individuals answer as isolated minds⁵	Nothing about shared devices, collaboration, approval chains, or how the product spreads between people
Training-data drift	The model reflects the past internet, not your current customers⁴	Findings that echo old forum threads; no trace of your recent pricing change or redesigned onboarding

Daniel Russell's point about social context deserves a sentence on its own. Collaborative tools, families sharing one tablet, anything with a viral loop: the parts of a product that depend on people interacting with other people are exactly the parts a single-profile simulation does not model⁵. He also notes the depth gap. Real participants articulate thought processes, tell stories, and bring life context that a generated profile cannot supply.

The trap: concept validation

Concept validation is where synthetic users do the most damage. NN/g calls this use "incredibly risky" because "every idea is often seen as a good one"¹. The mechanism is the sycophancy in the table above: models are tuned to agree with the framing of your question, so a concept described with any enthusiasm comes back validated⁴.

If your synthetic study found that users love the concept, you have learned almost nothing. The prior probability of that result was close to 100% before you asked.

The deeper cost is organizational, and NN/g states it plainly. Teams that start with synthetic research get comfortable and never move on to studying real users. And when the fake findings eventually prove wrong, the failure can permanently damage the organization's belief in research itself¹. That second-order cost runs for years, because every future study has to fight the memory of the time "research" steered the company into a wall.

User Evaluation's 2026 analysis adds a third structural problem beyond sycophancy and drift: a model has no capacity for genuine surprise. Research pays off when a real person says something that breaks your mental model. A language model can only produce what is already plausible to it⁴.

Where synthetic users earn a place

The defensible role sits before research⁴. Treat the model as a brainstorming partner. The data still has to come from people.

The concrete uses, drawn from NN/g and User Evaluation¹⁴:

Pilot your discussion guide. Run your interview questions against a synthetic participant before the first real session. Confusing phrasing and leading questions surface cheaply, and you fix them before they cost you a recruited participant.
Explore an unfamiliar user group. Build proto-personas for a segment you have never studied, explicitly labeled as drafts to be corrected by real research.
Generate hypotheses. Ask the model what objections, edge cases, and unmet needs a given segment might have. Each answer is a candidate, never a finding.
Run pre-mortems and stimulus pre-tests. "This launch failed in six months, tell me why" is a prompt a synthetic user handles well, because you want plausible failure stories rather than truth about any individual.
Desk research on edge cases. Use the model to enumerate scenarios your guide should cover, the same way you would use a literature review.

Write every synthetic output as a falsifiable statement before it enters a document: "We believe enterprise admins abandon setup at the SSO step, and we will test this with 8 interviews by July 10." The format forces the follow-up study into existence. The full method is in hypothesis-driven development.

Then run the real study. For the qualitative leg, recruit actual participants and use the synthetic round only to make those sessions sharper. For the quantitative leg, size it properly before you start, because an underpowered test of a synthetic-sourced hypothesis just stacks one unreliable signal on another. The A/B testing guide covers the experiment design.

Interactive Calculator

One boundary worth drawing clearly. AI applied to data from real participants is a different practice with a different risk profile: hallucinated quotes instead of hallucinated people, and a 60-second spot-check that catches them. That workflow is covered in AI for user research synthesis. This article is about AI replacing the participants themselves, and the evidence for the two practices points in opposite directions.

A useful self-test before any synthetic study: "If the synthetic users say X, will we act on it without checking?" If yes, you are using them as research. Move the question to a real study.

FAQ

What is a synthetic user? An AI-generated profile that mimics a target user group and produces artificial research findings, such as interview answers or survey responses, without any real participant involved¹. Vendors sell dedicated products for this, and teams also improvise the same thing with ChatGPT or Claude.

How accurate are synthetic users compared to real participants? Mixed, with the failures concentrated where it hurts. A review of 12 published comparisons tagged 9 findings encouraging and 14 discouraging, with zero encouraging findings on matched variance and zero on qualitative depth². Synthetic users can track the direction of human attitudes while missing the structure underneath, and they both overrate concepts and over-perform on tasks²³.

Can I use synthetic users to validate a product concept? No. NN/g rates this use as "incredibly risky" because synthetic users see almost every idea as a good one¹. A validation that was nearly guaranteed before you asked carries no information. Use real participants for any decision you intend to act on.

What are synthetic users actually good for? Work that happens before research: piloting a discussion guide, drafting proto-personas for an unfamiliar segment, generating hypotheses, running pre-mortems, and enumerating edge cases¹⁴. Treat every output as a hypothesis that a real study then confirms or kills.

Is this the same as using AI to synthesize my user interviews? No. Synthesis applies AI to data real people produced, and the main risk (fabricated quotes) is checkable against the source transcripts. Synthetic users replace the people, so there is no source to check against. See AI for user research synthesis for the first practice.