Building Agentic Products: A 2026 Guide
Patterns, frameworks, and the metrics that matter. By April 2026, agentic AI shipped. Here's how to design products around it.
Prerequisites
- • ai-literacy-for-pms
What Changed Between 2025 and 2026
In January 2025 I called multi-agent "the future." By April 2026 it shipped.
- Claude Sonnet 4.6 hit 72.5% on computer use benchmarks, what Anthropic calls human-level for most office tasks1.
- Anthropic shipped the Agent SDK alongside Claude 4.6. OpenAI followed with Agents SDK in March. Google launched ADK in April2.
- MCP, the Model Context Protocol, hit 97M monthly SDK downloads by March 2026, a 970x growth from launch in November 2024. 17,468 indexed servers3.
- Cursor at $1.2B ARR. Claude projecting $6B run rate by end of 2026. Cognition (Devin) at $25B valuation45.
The agent products that won in 2026 share two things: a clear orchestration pattern and step-level evals. MCP is table stakes. The ones that lost shipped a single LLM call wrapped in a chat interface.
The Three Patterns
After watching dozens of agentic products ship between 2025 and 2026, every successful one used one of three orchestration patterns.
Pattern 1: Sequential (Relay Race)
One agent does its job, hands off to the next. Each step depends on the last.
When to use: linear workflows with clear stages. Research, then analysis, then writing. Intake, then triage, then resolution.
Real example: Claude Code task chains. Plan agent designs the change, implementer agent writes it, reviewer agent checks the diff. Each step is one Claude call with tool access.
Watch for: error compounding. If step 1 is 90% accurate and you have four steps, end-to-end accuracy is 65%. Add validation at each handoff.
Pattern 2: Parallel (Specialist Team)
Multiple agents work on independent pieces at the same time. Results merge at the end.
When to use: independent subtasks. Research five competitors. Analyze five datasets. Generate five variants.
Real example: Claude Cowork sub-agents. Spin up parallel agents to research separate topics, then merge into one report. The constraint is independence: parallel only works when the pieces don't depend on each other.
Watch for: cost. Parallel agents can burn 5-10x the tokens of a sequential pipeline. With Sonnet 4.6 at $3 input and $15 output per million tokens6, a parallel research session can cost more than the analyst it replaces. Set a token budget.
Pattern 3: Hierarchical (Orchestrator and Workers)
A planner agent decides what to do and delegates to specialist agents. The planner reads results, replans if needed, returns the final output.
When to use: complex tasks with unknown shape. The plan emerges from the work.
Real example: Cursor's agent mode. The orchestrator decides which files to read, which tools to call, when to ask for human input. It's the most flexible pattern and the hardest to debug.
Watch for: runaway loops. Hierarchical agents can loop on themselves for hours if you don't cap iterations. Set a max-step limit and a kill switch.
Picking a Pattern
A simple decision tree:
| Workflow shape | Pattern |
|---|---|
| Linear, predictable | Sequential |
| Independent subtasks | Parallel |
| Adaptive, exploratory | Hierarchical |
Most production products use a mix. Cursor's agent mode is hierarchical at the top with sequential pipelines inside. Claude Code is sequential by default with parallel sub-agents on demand.
Start simple. Two-agent sequential chains beat ten-agent orchestras for most products.
MCP: The Standard That Won
When I wrote about multi-agent in 2025, every framework had its own tool-call format. By 2026 that fragmentation is over.
The Model Context Protocol (MCP), released by Anthropic in November 2024, became the universal layer for connecting agents to tools and data. Adoption numbers tell the story3:
- 97M monthly SDK downloads by March 2026 (970x in 16 months)
- 17,468 indexed MCP servers in Q1 2026
- OpenAI, Microsoft, Google, Amazon all adopted it
- Opera Neon ships an MCP connector for browser-level access
What MCP gives you:
- One tool format that Claude, ChatGPT, Gemini, and open-source agents all speak
- A registry of pre-built servers (databases, APIs, filesystems, internal tools)
- Local or remote transport
- Stateless HTTP transport for horizontal scaling, in review for v1.17
What it means for product teams: stop building custom tool integrations. Wrap your APIs as MCP servers and any agent can call them. We do this in the PM Toolkit codebase under src/app/api/[transport]/.
If you're building an agentic product in 2026 and not using MCP for tools, you're committing to maintenance work that the rest of the industry stopped doing.
The Five Agent SDKs
Five frameworks dominate by April 2026.
| SDK | Strength | Best for |
|---|---|---|
| Anthropic Agent SDK2 | Tool-use first, sub-agents as tools | Claude-anchored products |
| OpenAI Agents SDK2 | Tight ChatGPT and Operator integration | OpenAI-anchored products |
| Google ADK2 | Bundled with Antigravity dev platform | Gemini-anchored products |
| LangGraph8 | Directed graph orchestration, observability | Production at scale (Uber, LinkedIn, Klarna) |
| CrewAI8 | Role-based crews, simple mental model | Quick prototypes, structured workflows |
How to pick:
- One model family in production: use that vendor's SDK. Less abstraction, fewer bugs.
- Multi-vendor or self-hosted: LangGraph. 47M monthly downloads, longest production track record8.
- Prototyping: CrewAI. Fastest to a working demo.
The Anthropic Agent SDK takes the cleanest approach in my opinion. Agents are Claude models with tools, and one of the tools can be another agent. This collapses the orchestration problem into a tool-use problem the model already understands.
Build vs Buy
Three honest paths in 2026:
Buy a vertical agent. Cursor for code, Claude Code for terminal-native dev, Harvey for legal, Devin for autonomous engineering tasks. If your team needs the workflow and someone shipped it, use theirs.
Build with an SDK. You own the prompts, the evals, the user experience. The vendor owns the model and the agent runtime. Best path for differentiated products.
Build from scratch. You write the orchestration loop, the retry logic, the eval harness, the cost controls. Only do this if your workflow is genuinely novel and existing frameworks block you.
In my experience, 80% of agentic products should buy or build with an SDK. Building from scratch is a tax most teams cannot afford.
Eval Design for Multi-Agent
Single-LLM evals are easy. Multi-agent evals are harder because failure can happen at any step.
Three evals you need.
1. Step-Level Evals
Each agent in your pipeline gets its own golden set. If the research agent's job is to summarize a doc, you have 50 hand-labeled summaries. If the planner's job is to break a task into steps, you have 50 task-to-steps examples.
This is the boring middle most teams skip. It's also where the regressions happen.
2. Handoff Evals
When agent A passes to agent B, what context does B need? What context does B mishandle? Test the seams, not just the parts.
A common failure: agent A's output is verbose, agent B's context window fills with low-signal text, agent B misses the key information. Step-level evals would not catch this. Handoff evals do.
3. End-to-End Evals
The full pipeline against a real user task. This is your task completion rate metric.
End-to-end evals are expensive to build and slow to run. Run them on every release candidate, not every commit.
The Four Metrics That Matter
I've seen teams track 20+ metrics for agentic products. Most are noise. These four predict success:
| Metric | Definition | Good | Great |
|---|---|---|---|
| Task completion rate | Users who finished the workflow / users who started | 70% | 85%+ |
| Time to first useful output | Seconds from request to actionable response | Under 30s | Under 10s |
| Retry rate | Users who reissued the same prompt within 5 min | Under 20% | Under 5% |
| Cost per completed task | Total tokens times price divided by completed tasks | varies | falling over time |
What's missing: per-step accuracy, model selection, prompt iteration count. Those matter for engineering. They do not predict whether the product will work for users.
Five Pitfalls I've Seen Kill Agentic Products
These are real, not invented.
1. Context obesity. Each agent passes everything it has to the next. Context windows fill, signal drops, accuracy crashes. Fix: pass only the keys the next agent needs.
2. No kill switch. Hierarchical agents loop. The user watches their token bill climb. Fix: hard cap on iterations and tokens per session. Show the user the meter.
3. Hidden hallucinations. Reasoning models confabulate at 33-48% on factual tasks9. The agent sounds confident, the user trusts it, the action is wrong. Fix: ground every factual step with retrieval, require source citations, eval the grounding.
4. Hand-offs without contracts. Agent A produces free text, agent B parses it heuristically, parsing breaks silently. Fix: use structured outputs (JSON schema) at every handoff. Both Anthropic and OpenAI SDKs support this natively.
5. No observability. When the agent fails, no one knows where. Fix: log every step with input, output, latency, cost. LangGraph and CrewAI Enterprise ship this. If you're rolling your own, build it day one.
A 30-Day Roadmap
If you're starting from scratch.
Week 1: Prototype. Pick one workflow. Pick a pattern (default to Sequential). Wire two agents together with the Anthropic or OpenAI SDK. Use MCP for tools. Ship to internal users.
Week 2: Safety. Add a kill switch, iteration cap, token budget. Add structured-output contracts at every handoff. Add basic logging.
Week 3: Evals. Build a step-level golden set for each agent. Build one end-to-end eval. Run them on every change.
Week 4: Metrics. Wire up the four metrics. Set thresholds. Add a feature flag.
Then ship to a small percentage of real users behind the flag and watch the metrics.
What I'd Do Differently In 2026
If I started a new agentic product today:
- Default to Sonnet 4.6 with extended thinking off. Switch to Opus 4.7 only on tasks where the eval says it matters.
- Use MCP for every tool. Even internal ones.
- Structured outputs everywhere.
- Observability before features.
- Step-level evals before end-to-end.
- Buy the vertical agent if one exists. Compete on workflow, not on model.
The PMs winning in 2026 build tighter workflows around smaller agents, and let the evals carry the load.