Building Agentic Products: 2026 Guide for PMs | PM Toolkit

What Changed Between 2025 and 2026

In January 2025 I called multi-agent "the future." By April 2026 it shipped.

Claude Sonnet 4.6 hit 72.5% on computer use benchmarks, what Anthropic calls human-level for most office tasks¹.
Anthropic shipped the Agent SDK alongside Claude 4.6. OpenAI followed with Agents SDK in March. Google launched ADK in April².
MCP, the Model Context Protocol, hit 97M monthly SDK downloads by March 2026, a 970x growth from launch in November 2024. 17,468 indexed servers³.
Cursor at $1.2B ARR. Claude projecting $6B run rate by end of 2026. Cognition (Devin) at $25B valuation⁴⁵.

The agent products that won in 2026 share two things: a clear orchestration pattern and step-level evals. MCP is table stakes. The ones that lost shipped a single LLM call wrapped in a chat interface.

The Three Patterns

After watching dozens of agentic products ship between 2025 and 2026, every successful one used one of three orchestration patterns.

Pattern 1: Sequential (Relay Race)

One agent does its job, hands off to the next. Each step depends on the last.

When to use: linear workflows with clear stages. Research, then analysis, then writing. Intake, then triage, then resolution.

Real example: Claude Code task chains. Plan agent designs the change, implementer agent writes it, reviewer agent checks the diff. Each step is one Claude call with tool access.

Watch for: error compounding. If step 1 is 90% accurate and you have four steps, end-to-end accuracy is 65%. Add validation at each handoff.

Pattern 2: Parallel (Specialist Team)

Multiple agents work on independent pieces at the same time. Results merge at the end.

When to use: independent subtasks. Research five competitors. Analyze five datasets. Generate five variants.

Real example: Claude Cowork sub-agents. Spin up parallel agents to research separate topics, then merge into one report. The constraint is independence: parallel only works when the pieces don't depend on each other.

Watch for: cost. Parallel agents can burn 5-10x the tokens of a sequential pipeline. With Sonnet 4.6 at $3 input and $15 output per million tokens⁶, a parallel research session can cost more than the analyst it replaces. Set a token budget.

Pattern 3: Hierarchical (Orchestrator and Workers)

A planner agent decides what to do and delegates to specialist agents. The planner reads results, replans if needed, returns the final output.

When to use: complex tasks with unknown shape. The plan emerges from the work.

Real example: Cursor's agent mode. The orchestrator decides which files to read, which tools to call, when to ask for human input. It's the most flexible pattern and the hardest to debug.

Watch for: runaway loops. Hierarchical agents can loop on themselves for hours if you don't cap iterations. Set a max-step limit and a kill switch.

Picking a Pattern

A simple decision tree:

Workflow shape	Pattern
Linear, predictable	Sequential
Independent subtasks	Parallel
Adaptive, exploratory	Hierarchical

Most production products use a mix. Cursor's agent mode is hierarchical at the top with sequential pipelines inside. Claude Code is sequential by default with parallel sub-agents on demand.

Start simple. Two-agent sequential chains beat ten-agent orchestras for most products.

MCP: The Standard That Won

When I wrote about multi-agent in 2025, every framework had its own tool-call format. By 2026 that fragmentation is over.

The Model Context Protocol (MCP), released by Anthropic in November 2024, became the universal layer for connecting agents to tools and data. Adoption numbers tell the story³:

97M monthly SDK downloads by March 2026 (970x in 16 months)
17,468 indexed MCP servers in Q1 2026
OpenAI, Microsoft, Google, Amazon all adopted it
Opera Neon ships an MCP connector for browser-level access

What MCP gives you:

One tool format that Claude, ChatGPT, Gemini, and open-source agents all speak
A registry of pre-built servers (databases, APIs, filesystems, internal tools)
Local or remote transport
Stateless HTTP transport for horizontal scaling, in review for v1.1⁷

What it means for product teams: stop building custom tool integrations. Wrap your APIs as MCP servers and any agent can call them. We do this in the PM Toolkit codebase under src/app/api/[transport]/.

If you're building an agentic product in 2026 and not using MCP for tools, you're committing to maintenance work that the rest of the industry stopped doing.

The Five Agent SDKs

Five frameworks dominate by April 2026.

SDK	Strength	Best for
Anthropic Agent SDK²	Tool-use first, sub-agents as tools	Claude-anchored products
OpenAI Agents SDK²	Tight ChatGPT and Operator integration	OpenAI-anchored products
Google ADK²	Bundled with Antigravity dev platform	Gemini-anchored products
LangGraph⁸	Directed graph orchestration, observability	Production at scale (Uber, LinkedIn, Klarna)
CrewAI⁸	Role-based crews, simple mental model	Quick prototypes, structured workflows

How to pick:

One model family in production: use that vendor's SDK. Less abstraction, fewer bugs.
Multi-vendor or self-hosted: LangGraph. 47M monthly downloads, longest production track record⁸.
Prototyping: CrewAI. Fastest to a working demo.

The Anthropic Agent SDK takes the cleanest approach in my opinion. Agents are Claude models with tools, and one of the tools can be another agent. This collapses the orchestration problem into a tool-use problem the model already understands.

Build vs Buy

Three honest paths in 2026:

Buy a vertical agent. Cursor for code, Claude Code for terminal-native dev, Harvey for legal, Devin for autonomous engineering tasks. If your team needs the workflow and someone shipped it, use theirs.

Build with an SDK. You own the prompts, the evals, the user experience. The vendor owns the model and the agent runtime. Best path for differentiated products.

Build from scratch. You write the orchestration loop, the retry logic, the eval harness, the cost controls. Only do this if your workflow is genuinely novel and existing frameworks block you.

In my experience, 80% of agentic products should buy or build with an SDK. Building from scratch is a tax most teams cannot afford.

Eval Design for Multi-Agent

Single-LLM evals are easy. Multi-agent evals are harder because failure can happen at any step.

Three evals you need.

1. Step-Level Evals

Each agent in your pipeline gets its own golden set. If the research agent's job is to summarize a doc, you have 50 hand-labeled summaries. If the planner's job is to break a task into steps, you have 50 task-to-steps examples.

This is the boring middle most teams skip. It's also where the regressions happen.

2. Handoff Evals

When agent A passes to agent B, what context does B need? What context does B mishandle? Test the seams, not just the parts.

A common failure: agent A's output is verbose, agent B's context window fills with low-signal text, agent B misses the key information. Step-level evals would not catch this. Handoff evals do.

3. End-to-End Evals

The full pipeline against a real user task. This is your task completion rate metric.

End-to-end evals are expensive to build and slow to run. Run them on every release candidate, not every commit.

The Four Metrics That Matter

I've seen teams track 20+ metrics for agentic products. Most are noise. These four predict success:

Metric	Definition	Good	Great
Task completion rate	Users who finished the workflow / users who started	70%	85%+
Time to first useful output	Seconds from request to actionable response	Under 30s	Under 10s
Retry rate	Users who reissued the same prompt within 5 min	Under 20%	Under 5%
Cost per completed task	Total tokens times price divided by completed tasks	varies	falling over time

What's missing: per-step accuracy, model selection, prompt iteration count. Those matter for engineering. They do not predict whether the product will work for users.

Five Pitfalls I've Seen Kill Agentic Products

These are real, not invented.

1. Context obesity. Each agent passes everything it has to the next. Context windows fill, signal drops, accuracy crashes. Fix: pass only the keys the next agent needs.

2. No kill switch. Hierarchical agents loop. The user watches their token bill climb. Fix: hard cap on iterations and tokens per session. Show the user the meter.

3. Hidden hallucinations. Reasoning models confabulate at 33-48% on factual tasks⁹. The agent sounds confident, the user trusts it, the action is wrong. Fix: ground every factual step with retrieval, require source citations, eval the grounding.

4. Hand-offs without contracts. Agent A produces free text, agent B parses it heuristically, parsing breaks silently. Fix: use structured outputs (JSON schema) at every handoff. Both Anthropic and OpenAI SDKs support this natively.

5. No observability. When the agent fails, no one knows where. Fix: log every step with input, output, latency, cost. LangGraph and CrewAI Enterprise ship this. If you're rolling your own, build it day one.

A 30-Day Roadmap

If you're starting from scratch.

Week 1: Prototype. Pick one workflow. Pick a pattern (default to Sequential). Wire two agents together with the Anthropic or OpenAI SDK. Use MCP for tools. Ship to internal users.

Week 2: Safety. Add a kill switch, iteration cap, token budget. Add structured-output contracts at every handoff. Add basic logging.

Week 3: Evals. Build a step-level golden set for each agent. Build one end-to-end eval. Run them on every change.

Week 4: Metrics. Wire up the four metrics. Set thresholds. Add a feature flag.

Then ship to a small percentage of real users behind the flag and watch the metrics.

What I'd Do Differently In 2026

If I started a new agentic product today:

Default to Sonnet 4.6 with extended thinking off. Switch to Opus 4.7 only on tasks where the eval says it matters.
Use MCP for every tool. Even internal ones.
Structured outputs everywhere.
Observability before features.
Step-level evals before end-to-end.
Buy the vertical agent if one exists. Compete on workflow, not on model.

The PMs winning in 2026 build tighter workflows around smaller agents, and let the evals carry the load.

Building Agentic Products: A 2026 Guide

Prerequisites

What Changed Between 2025 and 2026

The Three Patterns

Pattern 1: Sequential (Relay Race)

Pattern 2: Parallel (Specialist Team)

Pattern 3: Hierarchical (Orchestrator and Workers)

Picking a Pattern

MCP: The Standard That Won

The Five Agent SDKs

Build vs Buy

Eval Design for Multi-Agent

1. Step-Level Evals

2. Handoff Evals

3. End-to-End Evals

The Four Metrics That Matter

Five Pitfalls I've Seen Kill Agentic Products

A 30-Day Roadmap

What I'd Do Differently In 2026

Sources

Prerequisites

What Changed Between 2025 and 2026

The Three Patterns

Pattern 1: Sequential (Relay Race)

Pattern 2: Parallel (Specialist Team)

Pattern 3: Hierarchical (Orchestrator and Workers)

Picking a Pattern

MCP: The Standard That Won

The Five Agent SDKs

Build vs Buy

Eval Design for Multi-Agent

1. Step-Level Evals

2. Handoff Evals

3. End-to-End Evals

The Four Metrics That Matter

Five Pitfalls I've Seen Kill Agentic Products

A 30-Day Roadmap

What I'd Do Differently In 2026

Sources

Footnotes