Sakana AI — the Tokyo research lab best known for evolutionary model-merging and the “AI Scientist” line of work — has moved Sakana Fugu from beta into a commercial launch. Fugu is a multi-agent orchestration system that you talk to as if it were a single model: one OpenAI-compatible endpoint, behind which Fugu dynamically assembles and coordinates a pool of frontier models to solve a task. It is the most concrete answer yet to a question the whole industry has been circling — what if the next frontier is not a bigger model, but a model that runs other models?

The framing is what made it go viral. Sakana positions Fugu as standing “shoulder-to-shoulder” with limited-access frontier systems like Anthropic's Fable 5 and Mythos Preview while “delivering frontier capability without the risk of export controls.” That is a product pitch wrapped around a geopolitical argument — and it deserves to be read both for what is genuinely new and for the asterisks the marketing skips.

If you want the conceptual background first, our guide to multi-agent orchestration patterns that actually work and the LLM model-routing engineering guide cover the building blocks Fugu productizes. This post is about what Sakana actually shipped, and how a senior team should weigh it.

Key takeaways

01
Fugu is a model that orchestrates models.You send a request to one OpenAI-compatible endpoint. Internally, Fugu — itself an LLM — selects models, delegates sub-tasks, verifies, and synthesizes a single answer, calling a pool of other LLMs and even instances of itself. The complexity of a multi-agent system never reaches your code.
02
Two tiers, one integration.Fugu balances performance and latency for everyday work; Fugu Ultra coordinates a deeper expert pool for hard, high-stakes problems (Kaggle, paper reproduction, cybersecurity, patent search). Both sit behind the same API, so you can switch without changing your integration.
03
The benchmarks are strong but vendor-reported — and not a clean sweep.Sakana reports Fugu Ultra leading several reasoning and coding benchmarks, but Fable 5 tops SWE-Bench Pro and Humanity's Last Exam, GPT-5.5 leads MRCRv2, and Opus 4.8 leads the CTI-REALM security benchmark. Treat every figure as Sakana's own number until independent evals land.
04
The headline pitch is 'AI sovereignty'.Sakana argues that depending on one vendor for critical infrastructure is a 'material vulnerability', and that because Fugu orchestrates swappable agents it can 'route around' provider restrictions. It is a real point about concentration risk — but a hedge that still rents its intelligence from those same vendors.
05
Pricing is approachable; EU access is not.Subscriptions run $20 / $100 / $200 per month (both models included); pay-as-you-go Fugu Ultra is $5 input / $30 output per 1M tokens. Fugu is not available in the EU/EEA at launch while Sakana works toward GDPR compliance.

01 — The productA multi-agent system, delivered as one model.

The simplest way to understand Fugu is by what it removes. Teams that want the “best model for each step” today wire that up themselves — picking a router, defining which model writes code, which one checks math, which one stitches the chain of thought together, and how results get verified. That work is real, and frameworks like LangGraph and CrewAI exist precisely to manage it (we compare them in our orchestration frameworks guide).

Fugu's claim is that you no longer have to. According to Sakana, Fugu “learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns” — rather than relying on hand-designed roles or workflows. It solves a task directly when that is enough, and coordinates a team of expert models when a problem calls for more. You can also opt specific providers or models out of the pool to meet data, privacy, or compliance requirements. For a sense of the design space Fugu is automating, our agent orchestration workflows guide walks through the patterns by hand.

Access

OpenAI-compatible endpoint

1API

You integrate against a single standard endpoint. Model selection, switching, and multi-agent coordination all happen server-side — your code sees one model.

Drop-in for existing OpenAI clients

Models

Fugu + Fugu Ultra

A balanced everyday tier and a maximum-quality tier, both reachable through the same API. Switch between them without changing your integration.

Same interface, two tiers

Control

Choose who is in the pool

Opt-out

Exclude specific providers or models from Fugu's agent pool to satisfy data-residency, privacy, or organizational policy. The pool is configurable, not fixed.

Compliance lever

Foundation

ICLR 2026 research

2papers

Grounded in two peer-reviewed papers on learned model orchestration — TRINITY and the Conductor — not just a prompt-routing wrapper.

TRINITY + Conductor

02 — ArchitectureHow a model learns to command other models.

The mechanism is the genuinely interesting part. Fugu is not a rules engine that maps keywords to models. In Sakana's words, Fugu “is itself a language model trained to call various LLMs in an agent pool, including instances of itself recursively.” That recursion matters: Fugu can decompose a hard task, spin up specialist models for the sub-parts, call a fresh instance of itself to manage a sub-problem, and then verify and synthesize the pieces into one response — all without that machinery surfacing in your request.

The approach is built on two ICLR 2026 papers. TRINITY uses a lightweight evolved coordinator that assigns Thinker, Worker, and Verifier roles to delegate work across coding, math, reasoning, and knowledge tasks. The Conductor is trained with reinforcement learning to discover natural-language coordination strategies — effectively learning how agents should talk to each other and what focused prompts make a diverse pool of LLMs outperform any single worker. Productized together, they become a single endpoint that manages selection, delegation, verification, and synthesis on your behalf.

Sakana AI — Fugu launch blog, June 2026

“Fugu manages model selection, delegation, verification, and synthesis internally, so the complexity of a multi-agent system never reaches your code.” The design echoes the in-harness dynamic workflows trend we covered for Claude Opus 4.8 — but Sakana packages the orchestration as the model itself, not as a client-side harness. Source: sakana.ai/fugu.

03 — The lineupFugu vs Fugu Ultra: pick the tier, keep the API.

Both models ship through one OpenAI-compatible API; the difference is how deep a pool each coordinates and what it optimizes for. The split mirrors the now-familiar “fast default vs flagship” pattern, applied to orchestration depth rather than raw model size.

Fugu

Balanced everyday default

Performance + low latency

Tuned for the everyday work most teams actually run: it drops into coding tools like Codex for code and review, and powers responsive chatbots and interactive services. You can opt specific agents out of its pool to meet data and compliance constraints.

Coding · review · chat

Fugu Ultra

Maximum answer quality

Deeper expert pool, hard problems

Coordinates a deeper pool of expert agents to maximize quality on hard, high-stakes, multi-step problems. Sakana cites early users running it for Kaggle competitions, paper reproduction, cybersecurity analysis, and literature and patent investigations.

Research · security · analysis

04 — The numbersBenchmarks: shoulder-to-shoulder, not a clean sweep.

Sakana published results across eleven engineering, scientific, and reasoning benchmarks, comparing Fugu and Fugu Ultra against publicly accessible frontier models — Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 — plus Anthropic's limited-access Fable 5 and Mythos Preview, which are not in Fugu's pool because they are not publicly accessible. Two caveats before any number: these are Sakana's own results, the baselines use provider-reported scores, and the SWE-Bench Pro figures used the mini-swe-agent scaffolding. Read them as a vendor self-report pending independent replication — the same standard we applied to the Opus 4.8 independent-eval roundup.

The honest read is more interesting than “Fugu wins.” On its own headline coding benchmark, SWE-Bench Pro, Fugu Ultra (73.7) lands ahead of Opus 4.8 (69.2), GPT-5.5 (58.6), and Gemini 3.1 Pro (54.2) — but trails Fable 5, the very model it cannot include in its pool.

SWE-Bench Pro — Fugu vs frontier baselines (vendor-reported)

Source: Sakana AI Fugu benchmark report, June 2026 (SWE-Bench Pro, mini-swe-agent scaffolding; baselines provider-reported)

Fable 5Anthropic — not in Fugu's pool (limited access)

80.0

Fugu UltraSakana — flagship tier

73.7

Opus 4.8Anthropic — provider-reported

69.2

FuguSakana — balanced tier

59.0

GPT-5.5OpenAI — provider-reported

58.6

Gemini 3.1 ProGoogle — provider-reported

54.2

Across the rest of the suite, Fugu Ultra leads on several reasoning and coding tests — GPQA-D (95.5, tied with the balanced Fugu), LiveCodeBench (93.2), LiveCodeBench Pro (90.8), and TerminalBench 2.1 (82.1) — and edges Opus 4.8 on Humanity's Last Exam (50.0 vs 49.8). But the wins are not universal: Fable 5 tops both SWE-Bench Pro and Humanity's Last Exam (53.3); GPT-5.5 leads the MRCRv2 long-context-recall test (94.8 vs Fugu Ultra's 93.6); and Opus 4.8 edges out the field on the CTI-REALM cybersecurity benchmark (69.6). One quirk worth noting: on SciCode and a few others the balanced Fugu actually scores higher than Fugu Ultra, so “more orchestration” is not always better.

The most defensible reading: Fugu and Fugu Ultra are credibly in the frontier conversation on Sakana's own numbers, and an orchestrated pool can plausibly match or beat any single model it contains. Whether it beats the models it cannot contain — Fable 5, Mythos — is the claim to hold most loosely until third parties run it.

Read benchmarks like an operator

Vendor-run benchmarks tell you the ceiling under ideal conditions, not your median result. Before trusting any of these figures for a production decision, wait for independent evaluations and — better — run the two or three benchmarks that resemble your workload on a representative slice of your own traffic. Provider-reported scores for the baselines also mean the comparison is apples-to-oranges on harness and effort settings.

05 — The pitch“Beyond bigger models”: the sovereignty argument — and its asterisks.

Sakana frames Fugu with an unusually political argument. Progress, it says, has been driven by “giant, monolithic models”, but “the most powerful AI systems will not be isolated monoliths, but collaborative ecosystems.” The new twist is the geopolitics: orchestration, in Sakana's telling, “is no longer just a technical optimization; it has become a geopolitical and operational imperative,” because “relying on a single company's APIs for critical infrastructure, finance, or governance is a material vulnerability.” The payoff line: because Fugu orchestrates swappable agents, “if a single provider restricts access, Fugu dynamically routes around the disruption,” delivering “the resilient blueprint required for AI sovereignty.” Sakana points to recent export controls on models like Fable and Mythos as the proof that access “can disappear overnight.”

There is a real point buried in the marketing. Single-vendor dependency is a genuine operational risk — anyone who has had a model deprecated, rate-limited, or repriced mid-project knows the cost of a hard dependency. A diverse, swappable pool is a sensible hedge, and it is the same instinct behind aggregators we have covered, like OpenRouter's multi-model responses. But the “sovereignty” claim carries three asterisks worth naming plainly.

One: the hedge still rents its intelligence. Fugu routes around the loss of any one provider — but its capability is the pool, and the pool is other companies' models accessed through their APIs. A broad restriction, not a single one, shrinks the pool. The resilience comes from diversity, not from independence.

Two: the terms-of-service question is unresolved. Orchestrating and reselling access to third-party proprietary models through one endpoint sits in a grey area of each provider's usage terms. That is a contractual and compliance question every adopter inherits, not just Sakana.

Three: it benchmarks against what it cannot use. The two systems Fugu claims to stand shoulder-to-shoulder with — Fable 5 and Mythos Preview — are exactly the ones excluded from its pool. “Matching” them is therefore a claim about substitutes, not a way to get their output.

An orchestration layer is a real hedge against single-vendor dependency — but a layer that rents its intelligence from those same vendors is a softer hedge than the marketing implies. Resilience comes from the diversity of the pool, not from any one swap.Digital Applied analysis, June 22, 2026

06 — Cost + accessPricing, and the EU/EEA gap.

Fugu is sold two ways. Pay-as-you-go is aimed at heavier production workloads. For Fugu, you pay the standard rate of the underlying model that handled the request; when multiple agents are active, Sakana says it does not stack model fees — you are charged a single rate based on the top-tier model involved. Fugu Ultra carries fixed pricing on the fugu-ultra-20260615 snapshot: $5 input / $30 output per 1M tokens, rising to $10 / $45 once context passes 272K, with cached input at $0.50 ($1.00 above 272K). That output rate sits in premium frontier territory — roughly in line with the pricier flagships rather than the budget tier.

Subscriptions suit individuals and hands-on daily use: Standard at $20/month, Pro at $100/month (10× the Standard allowance), and Max at $200/month (20×). Every tier includes both Fugu and Fugu Ultra. Sakana is also offering a free second month at your initial tier if you subscribe before the end of July 2026.

The hard constraint is geographic. Fugu is not available in the EU/EEA at launch while Sakana works toward GDPR and EU-specific compliance. For European-resident operations, that makes Fugu a non-starter today regardless of how the benchmarks shake out — a meaningful gap given how much of the "sovereignty" argument is pitched at exactly the kind of regulated, critical-infrastructure buyer the EU rules are written for.

07 — Due diligenceWhat to scrutinize before you route production through it.

None of the following is a reason to dismiss Fugu — it is a strong, research-backed launch. They are the questions a senior team should answer before putting customer-facing or high-stakes work behind it.

Benchmarks are unverified. Every figure is vendor-reported. Independent evaluation is the gate, not the launch post.

Data routing and ToS. By default a request may touch several providers. Use the agent opt-out controls to keep regulated data away from models or regions you have not cleared, and get clarity on how the orchestration interacts with each upstream provider's terms.

Cost and latency predictability. Fan-out across multiple models can widen the distribution of both. The PAYG "single top-tier rate" policy softens cost, but Fugu Ultra's fixed premium output rate still applies to the synthesis you receive — model your real workload, not the cheapest path.

Observability. When a model picks the models, you give up some control over which system produced which answer. Sakana indicates you can inspect usage and which models Fugu used; confirm the granularity of that attribution in your plan before you rely on it for audits or debugging. Our agent observability guide covers what good looks like.

The concentration irony. Adopting Fugu to reduce vendor dependency adds a new dependency — on Sakana's orchestrator. That can still be net-positive, but it is a trade, not an escape.

08 — The takeawayWhat it means for your stack — and when to reach for it.

The bigger signal is not this one product. It is that “model orchestration as a product” is now a real category, sitting alongside three approaches teams already use: aggregator-style routing, do-it-yourself frameworks like LangGraph and CrewAI, and in-harness dynamic workflows from the model vendors themselves. The useful question is which of these fits a given problem.

Reach for an orchestration model

When quality on hard, varied tasks beats control

Your workload spans coding, reasoning, and research; you value top-end answer quality over deterministic control; you want one integration and are comfortable with a managed black box. A hosted orchestrator like Fugu Ultra can earn its premium here — provided your data and region constraints are met.

Best for mixed, high-stakes work

Route it yourself

When you need control and cost discipline

You have a known task mix, tight cost targets, and a team that can own routing logic and observability. Aggregator routing or a LangGraph/CrewAI build gives you transparency over which model runs when, and keeps the orchestration in your codebase where you can tune and audit it.

Best for predictable, owned pipelines

Just use one strong model

When the task is narrow and well-served

A single frontier model handles most single-domain work well and is the easiest thing to reason about, price, and debug. Orchestration adds value when one model genuinely leaves quality on the table — not by default. Start here and add complexity only when the data says to.

Best for focused, single-domain work

For most operators, the move this week is not to rip out a working stack — it is to register that orchestration has matured from a pattern you build into a product you can buy, and to add Fugu to the short list you benchmark against your own traffic when single-model quality plateaus. We help teams make exactly this call — comparing a managed orchestrator against owned routing on cost, quality, latency, and compliance — through our AI transformation and web development work. The right answer is almost always workload-specific, and it changes as the models do.

Conclusion

A genuinely new product, wrapped in a claim worth holding at arm's length.

Sakana Fugu is the most concrete bet yet that the next gains come from coordinating models rather than scaling one — and the research behind it (TRINITY, the Conductor) is real, peer-reviewed, and on-brand for a lab that has spent years on collective-intelligence methods. Delivering all of that behind a single OpenAI-compatible API is a genuinely useful piece of productization.

The part to keep at arm's length is the marketing. The benchmarks are vendor-reported and not a clean sweep; the “route around export controls” sovereignty pitch is a real point about concentration risk dressed in stronger language than the underlying dependency supports; and EU/EEA teams cannot use it at all yet. Treat Fugu as a serious new option to benchmark against your own workload — not as a finished answer to vendor lock-in. The direction of travel, though, is hard to argue with: orchestration is no longer a side technique. It is becoming the product.

Sakana Fugu: a model that orchestrates models.