By 2026, the agencies that have moved past prompt-engineering have moved into orchestration. The interesting work — research-and-brief, full content drafts, technical audits with actionable findings — is no longer a single agent with a clever prompt. It is a directed graph of specialised agents, each with one job, handing structured outputs to the next agent in the graph, with human review gates placed where they actually catch mistakes.
This playbook is the reference we use across our agency book. It specifies the seven-role agent taxonomy, maps twelve typical agency workflows onto multi-agent graphs, defines the handoff protocols between roles, and recommends the orchestration framework (LangGraph, CrewAI, or Mastra) per workflow shape.
It is not aspirational. Every workflow in the playbook ships in production for at least one agency client at the time of writing.
- 01Multi-agent graphs beat single-agent prompts when the workflow has more than three distinct phases.One agent doing everything degrades quality on each phase. Decomposing into specialised agents with one job each lifts quality consistently. The break-even is around three phases — below that, the orchestration overhead costs more than it adds.
- 02The seven-role agent taxonomy keeps the graph readable.Researcher, drafter, auditor, reviewer, deployer, router, escalator. Every agent in every workflow falls into one of these. The shared taxonomy is what makes the playbook shareable across pods and what keeps engineering reviews tractable.
- 03Handoffs need structured-output schemas, not prose blobs.Agent A's output becomes Agent B's input. If A outputs prose, B has to parse; parsing is unreliable; the graph becomes brittle. Structured outputs (JSON schemas with required fields) are the boring engineering choice that makes the whole pattern work.
- 04Human-in-the-loop gates go after the auditor, not after the drafter.Reviewers add value when there is structured feedback to give. Reviewing a raw draft is exhausting; reviewing an audited draft with surfaced issues is fast. Gate placement is the lever that determines whether HITL becomes a bottleneck.
- 05Pick the framework per workflow shape — LangGraph for graph-heavy, CrewAI for role-heavy, Mastra for TS-stack.There is no single 'best' framework. LangGraph wins on graph-structured durable workflows; CrewAI wins on speed-to-scaffold for role-based; Mastra wins on TypeScript stacks. Most agencies standardise on two — primary plus secondary — and pick per project.
01 — PremiseWhy graphs, not chains.
Linear chains (prompt → output → prompt → output) are the natural first move. They scale until the workflow has any one of: a branch, a retry, a long-running step, a step that needs human input, or a step where the output of two earlier steps must merge. Most agency workflows have all five.
Graphs handle all five natively. Nodes are agents; edges are conditional routing decisions; state is persistent and checkpointed. The graph model has more conceptual overhead than the chain model, and the conceptual overhead pays back the moment the workflow has to handle a real-world failure mode.
"We rebuilt our research-and-brief workflow from a 6-step chain into a 4-node graph. Same agents. Half the prompts. The reliability under load was the difference between flaky and shippable."— Lead engineer, agency platform team, March 2026
02 — TaxonomyThe 7-role agent taxonomy.
Every agent in every workflow falls into one of seven roles. The taxonomy is what makes the playbook portable: a researcher in the content workflow looks like a researcher in the support workflow looks like a researcher in the lead-enrichment workflow. Engineering reviews focus on whether the role is implemented correctly, not whether the role is well-defined.
Researcher
Gathers raw inputs from external or internal sources. Output is always structured (citations, JSON facts, source URLs). Prompt skill: searching, reading, distinguishing primary from secondary sources.
Source-of-truthDrafter
Composes prose, code, or structured outputs from researcher inputs. Output is the artifact under review. Prompt skill: voice, structure, claim-fluency.
Artifact producerAuditor
Scores the drafter's output against a rubric, surfaces issues, suggests revisions. Output is a structured findings list. Prompt skill: rubric application, tight scoring, false-positive avoidance.
Quality gateReviewer (human or model)
Approves, rejects, or annotates the auditor's findings. Often human-in-the-loop in regulated workflows; model-based for high-volume low-stakes flows. Output is a publish/hold/redraft decision.
Decision authorityDeployer
Pushes the reviewed artifact to its destination — CMS, email tool, CRM, file store, downstream agent. Output is a deployment receipt or error.
Side-effectsRouter
Decides which downstream branch the workflow takes. Common in triage/support workflows. Output is a routing decision (always one of N enum values).
BranchingEscalator
Surfaces edge cases that the workflow shouldn't try to handle automatically. Output is an escalation ticket with structured context. The escape hatch that keeps multi-agent graphs from making bad calls under uncertainty.
Safety net03 — Twelve workflowsTwelve agency workflows mapped.
The twelve workflows below are the ones that recur across the agency book. Each row shows the workflow, the agent roles involved, and the framework pick. Use as a starting point; adapt the role mix to the specific engagement.
Research-and-brief
Researcher (multi-source) → Drafter → Auditor → Human Reviewer → Deployer (CMS). Long-running, branchy, retries on flaky sources. Durable execution required. Framework: LangGraph.
LangGraphContent draft + revision
Researcher (light, internal) → Drafter → Auditor (rubric) → Drafter (revision) → Human Reviewer → Deployer. Loop on auditor findings until rubric ≥ 11. Framework: CrewAI for prototypes, LangGraph for production.
CrewAI / LangGraphTechnical SEO audit
Researcher (crawler) → Auditor (checklist) → Drafter (findings narrative) → Human Reviewer → Deployer (PDF + CMS). Output is a structured audit report. Framework: LangGraph or Mastra.
LangGraph / MastraGEO scoring (rubric)
Researcher (multi-engine sample) → Auditor (rubric per page) → Drafter (priority list) → Deployer (dashboard). High-volume, periodic. Framework: LangGraph for state persistence.
LangGraphCompetitive intel
Researcher (competitor watch) → Auditor (signal/noise filter) → Drafter (digest) → Human Reviewer (weekly) → Deployer (Slack + email). Periodic. Framework: CrewAI or Mastra.
CrewAI / MastraLead enrichment
Researcher (firmographic + technographic) → Auditor (data quality) → Router (tier assignment) → Deployer (CRM). High-volume, structured. Framework: Mastra (TS, low cost).
MastraPaid-ad creative generation
Researcher (audience + brand voice) → Drafter (variants) → Auditor (brand-safety + voice) → Human Reviewer → Deployer (ad platforms). Heavy multimodal use. Framework: LangGraph or CrewAI.
LangGraph / CrewAILifecycle email composition
Researcher (segment + behaviour) → Drafter → Auditor (compliance + voice) → Reviewer (model or human) → Deployer (ESP). Mass-personalisation. Framework: Mastra (TS-native, Vercel deploy).
MastraSupport triage
Router (intent classification) → Researcher (knowledge base) → Drafter (response) → Reviewer (model or human, by severity) → Deployer (helpdesk). High-volume, low-latency. Framework: Mastra or CrewAI.
Mastra / CrewAIReporting digest
Researcher (multi-source data pull) → Drafter (narrative) → Auditor (numbers vs source) → Deployer (PDF + Notion). Periodic. Framework: LangGraph.
LangGraphSocial listening
Researcher (stream listener) → Auditor (relevance filter) → Drafter (insight summary) → Router (escalate or queue) → Deployer (CRM + Slack). Continuous. Framework: LangGraph or Mastra.
LangGraph / MastraRFP response
Researcher (past RFPs + current ask) → Drafter (sectional) → Auditor (compliance + voice) → Human Reviewer → Deployer (PDF + portal). Long-running, high-stakes. Framework: LangGraph.
LangGraph04 — HandoffsHandoff protocols.
A multi-agent graph is only as reliable as the handoffs between agents. Three rules consistently separate fragile graphs from reliable ones.
Structured output, not prose
JSON Schema · validated at edgeEvery agent's output that becomes another agent's input is a JSON object with a defined schema. Validate at the edge; reject and retry on schema violation. Prose handoffs feel natural for ~3 weeks until the first parsing failure breaks the workflow.
Hardest-won lessonIdempotent retries by default
deterministic IDs · checkpointed stateEvery node should be safe to retry. Use deterministic task IDs so duplicate runs are detected; checkpoint state at each transition so retries resume from the last success. Idempotency is what lets the graph survive transient failures without manual intervention.
Reliability defaultExplicit failure modes
outcome enum · always one of NEach node returns one of a small set of outcomes (success, partial, retry, escalate). Downstream routing is the same enum every time. No 'unknown' outcomes — they collapse the graph into hand-holding.
Routing clarityTool calls as side-effects, not data
deployer-only · everywhere else read-onlySide-effecting tool calls (sending email, writing to CMS, charging a card) belong in the deployer node and only the deployer node. Researchers, drafters, auditors, reviewers should be read-only. This rule single-handedly prevents the most common production-incident class.
Safety architecture05 — HITL gatesWhere to place gates.
Gate placement is the single biggest determinant of whether a human-in-the-loop workflow becomes a quality control or a bottleneck. Two rules.
Gate after the auditor, not after the drafter
Reviewing a raw draft is exhausting (the reviewer has to identify both what to fix and how to fix it). Reviewing an audited draft with surfaced issues is fast (the reviewer makes accept/reject calls on flagged items). The same human reviewer is 4-6× more productive on audited input than raw input.
After the auditorTwo gates, not one or three
Most agency workflows benefit from exactly two gates: one before deployment (reviewer approves auditor findings), one before escalation (reviewer triages escalator output). One gate misses production safety; three gates produce reviewer fatigue and stop the workflow.
Two gates06 — FrameworkFramework pick per workflow.
The framework matrix below summarises the picks across the twelve workflows. Use it as a starting point; standardise on two frameworks across the agency book to keep depth high.
LangGraph — graph + durable
Python · LangSmith defaultWorkflows 1, 3, 4, 7, 10, 11, 12. Anything graph-heavy, anything long-running, anything that needs durable execution and the deepest observability. Right default for 6-7 of the 12 workflows.
Production defaultCrewAI — role-based
Python · fastest scaffoldWorkflows 2, 5, 7, 9. Role-based delegation maps cleanly when the workflow reads as 'a crew of specialists collaborating'. Fastest from scratch; lighter on durable execution.
Fast scaffoldMastra — TypeScript
TS · Vercel-nativeWorkflows 3, 5, 6, 8, 9, 11. Right default for any workflow that lives in a Next.js / Vercel-native deployment. TS type-safety on tool inputs is invaluable for high-volume structured workflows (lead enrichment, lifecycle email, support triage).
TS-nativeTwo frameworks, not four
agency-wide pickMost agencies converge on LangGraph + Mastra or LangGraph + CrewAI as their two-framework standard. Picking 3+ frameworks spreads the team's depth too thin; picking 1 forces some workflows into the wrong shape. Two is the sweet spot.
Standard stack07 — RolloutRolling out the playbook.
Pick two workflows, pick two frameworks
Don't try to roll out all 12 at once. Pick two workflows that are already painful (research-and-brief and content drafting are typical first picks). Pick two frameworks. Build both workflows on the chosen frameworks. The first workflow is the real cost; the second is mostly framework-template reuse.
FoundationAdd 4 more workflows
Once two workflows are in production, the next four come fast — most of the cost is the role taxonomy, the handoff schemas, and the deployment pipeline, all of which are now reusable. Six workflows in production at day 90 is a typical milestone.
Scale phaseReach 10-12 workflows + retro
By day 120 most agencies have 10-12 workflows in production. The phase-3 retro should focus on which workflows underperformed expectations (usually because the role mix was wrong, not because the framework was wrong) and which workflows surprised on the upside.
MaturityQuarterly playbook review
Each quarter, retro the playbook: which roles need expansion, which workflows have been deprecated, which frameworks have shifted competitively. The playbook is a living document; without quarterly review it drifts within 6 months.
Sustain08 — ConclusionTwelve workflows, seven roles.
Multi-agent graphs replace single-agent prompts the moment a workflow has more than three phases. The playbook is what makes that transition operable.
The interesting agency work in 2026 is not built on cleverer prompts. It is built on graphs of specialised agents handing structured outputs between each other, with HITL gates placed where they catch mistakes, on a framework chosen for the workflow shape rather than the brand.
Adopt the seven-role taxonomy. Map your workflows to it. Use the handoff rules — structured output, idempotent retries, explicit outcomes, side-effects only at the deployer. Place HITL gates after the auditor, not after the drafter. Standardise on two frameworks; pick per workflow shape.
The playbook is the artifact that keeps multi-agent work shippable instead of brittle. The cost is conceptual overhead; the payoff is reliability under load. By day 120 of a rollout, most agencies have 10-12 workflows in production and have stopped writing single-agent prompts for anything non-trivial.