Sakana Fugu scores 73.7 on SWE-Bench Pro, challenging monolithic AI models

Sakana AI's new Fugu orchestration framework scored 73.7 on SWE-Bench Pro, outperforming Anthropic's Claude Opus 4.8 at 69.2 and OpenAI's GPT-5.5 at 58.6, by routing sub-tasks across a pool of specialized models rather than relying on a single monolithic architecture. The Tokyo-based startup's approach challenges the industry's dominant strategy of scaling ever-larger foundation models.

"Fugu dynamically orchestrates the world's best models to tackle complex tasks. We are proving that a well-orchestrated pool of swappable agents can match restricted frontier models," David Ha, chief executive officer and co-founder of Sakana AI, said in a post on X. Ha, formerly of Google Brain, co-founded Sakana in 2023 with Llion Jones, a co-author of the seminal "Attention Is All You Need" paper.

Fugu operates as a master coordinator rather than a standalone model. When presented with a complex request, it breaks the problem into sub-tasks, delegates them to a pool of expert foundation models, verifies their work, and synthesizes the final output — all behind a single OpenAI-compatible API endpoint. The system is grounded in two of Sakana's 2026 research papers, TRINITY and the Conductor, which teach the model learned coordination strategies rather than hand-designed workflows. Two variants are available: standard Fugu for everyday tasks and Fugu Ultra for high-stakes workloads like AI research and cybersecurity analysis.

The launch comes two weeks after Anthropic revoked public access to its most powerful models, Claude Mythos 5 and Claude Fable 5, following a US government export control order. That move exposed a vulnerability that enterprises and nations had long feared: access to top-tier AI can disappear overnight due to geopolitical decisions. Fugu's architecture builds native redundancy into the AI stack — if one provider faces restrictions, the system routes around the disruption. The specific models in Fugu's pool and how it coordinates them remain proprietary, but developers can opt specific providers out of the routing pool for compliance.

How Fugu's benchmarks stack up against the frontier

Fugu Ultra matched or exceeded restricted frontier models on several key benchmarks. On LiveCodeBench, which tests coding performance on regularly refreshed software problems, Fugu Ultra scored 93.2 and standard Fugu scored 92.9, both beating Anthropic's Claude Fable 5 at 89.8. On GPQA-Diamond, a test of graduate-level multiple-choice questions in biology, physics, and chemistry, both Fugu variants scored 95.5, edging out Claude Mythos Preview at 94.6.

However, Fugu is not a clean sweep. On SWE-Bench Pro, Fugu Ultra's 73.7 trailed Fable 5's 80.0 — a model currently absent from Fugu's swappable pool due to the export control order. On Humanity's Last Exam, Fugu Ultra scored 50.0 versus Fable 5's 53.3. On long-context recall (MRCRv2), OpenAI's GPT-5.5 led at 94.8 versus Fugu Ultra's 93.6. These results suggest that for brute-force reasoning within a single constrained domain, the largest standalone models still hold an edge — provided enterprises can maintain uninterrupted access.

Pricing and the economics of orchestration

Fugu Ultra is priced at $5 per million input tokens and $30 per million output tokens, placing it among the more expensive options in the market — comparable to OpenAI's GPT-5.5 at $5 and $30, respectively, and well below Anthropic's now-restricted Fable 5 at $10 and $50. However, a significant caveat exists: the background tokens consumed when Fugu delegates sub-tasks and routes between agents are not absorbed by the provider. They represent real token usage and are counted toward the final price at standard rates.

A real-world test by creative agency owner Mark Santos illustrated the tradeoffs. Tasked with building a "Crossy Road" game clone using Three.js, Fugu Ultra completed the job in 22 minutes using roughly 89,000 tokens for about $7.32, though the final game suffered from minor logic errors. Claude Opus 4.8 took 79 minutes, burned about 940,000 tokens for nearly $37.85, and required human intervention to escape a retry loop — but ultimately produced superior application design.

The orchestration landscape and what it means for investors

Fugu operates on a fundamentally different paradigm from standard routing platforms like Not Diamond, Martian, or the open-source RouteLLM framework. Those systems make a one-shot routing decision — analyzing an incoming prompt and dispatching it to a single model. Fugu, by contrast, aligns more closely with complex multi-round systems like Router-R1, breaking queries down, interleaving reasoning with delegation, and assigning sub-tasks to multiple models in parallel before synthesizing output.

The emergence of orchestration models that achieve frontier performance without brute-force compute has implications beyond any single company. Goldman Sachs' Rich Privorotsky, head of the 1-Delta desk, has identified server rental costs as a core indicator for the AI hardware investment thesis. If orchestration reduces the need for massive GPU clusters, it could pressure margins for hyperscalers and GPU suppliers. Semiconductor ETFs recorded abnormally high inflows last week, suggesting the market remains positioned for continued compute demand — a bet that orchestration models like Fugu could eventually challenge.

Sakana, which reached a $2.6 billion valuation in its Series B round in late 2025, is also seeing competitive pressure from the open-source side. Zhipu AI's GLM-5.2 scored 74.4 on the FrontierSWE benchmark, within one point of Claude Opus 4.8's 75.1, while pricing 72 percent to 82 percent below Anthropic's model. The model uses an MIT license and supports weight openness, distillation, and quantization.

Fugu is available immediately in most regions, with the temporary exception of the European Union and European Economic Area while Sakana works to align its black-box data routing architecture with GDPR regulations. Subscription tiers start at $20 per month for standard usage, with enterprise pay-as-you-go plans offering higher priority for production workloads.

For investors, the key question is whether orchestration represents a complement or a substitute for traditional compute spending. If Fugu's approach gains broad adoption, it could compress demand for the largest GPU clusters — a headwind for Nvidia and AMD. But if the market views it as an additional layer on top of existing infrastructure, it could expand the total addressable market for AI inference. The next signal will come from enterprise adoption rates and whether hyperscalers adjust their pricing in response.

This article is for informational purposes only and does not constitute investment advice.