Claude Fable 5 vs Opus 4.8 vs GPT-5.5: Benchmarks (2026)
Every published benchmark for Claude Fable 5, Claude Opus 4.8, and GPT-5.5 in one place — coding, agents, vision, reasoning, cost per task, and the fine print the launch posts skip.
Every published benchmark for Claude Fable 5, Claude Opus 4.8, and GPT-5.5 in one place — coding, agents, vision, reasoning, cost per task, and the fine print the launch posts skip.
Claude Fable 5 wins most of the benchmarks. That is not the interesting part. The interesting part is which ones it wins by 20+ points, which ones are effectively ties, where GPT-5.5 still beats it outright, and which published "Fable 5" scores are actually a different model's numbers.
We pulled every score from Anthropic's announcement and the early independent breakdowns into one place, organized by what you actually do with these models. If you read one comparison before picking a model this month, make it this one.
| Claude Fable 5 | Claude Opus 4.8 | GPT-5.5 | |
|---|---|---|---|
| Released | June 9, 2026 | Early 2026 | Late 2025 |
| Input / output per M tokens | $10 / $50 | $5 / $25 | $5 / $30 |
| Context window | Not published | 1M | 1M API (surcharge above 272K) |
| Positioning | Frontier, safeguarded Mythos-class | Previous Anthropic flagship | OpenAI flagship |
One spec gap worth flagging up front: Anthropic has not published Fable 5's context window or maximum output tokens. Anything you read stating one is guessing.
Two caveats before the tables, because the launch coverage mostly skipped them:
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 |
|---|---|---|---|
| SWE-bench Pro | 80.3% | 69.2% | 58.6% |
| FrontierCode Diamond | 29.3% | 13.4% | 5.7% |
| Terminal-Bench 2.1 | 88.0%* | 82.7% | 83.4% (Codex CLI) |
This is the clearest story in the whole comparison. On SWE-bench Pro — real GitHub issues, end to end — Fable 5 leads GPT-5.5 by 21.7 points. On FrontierCode Diamond, Cognition's deliberately brutal production-coding suite, it scores 5x GPT-5.5 and more than double Opus 4.8. These are not within-margin-of-error gaps; they are different tiers.
Terminal-Bench is the exception that proves the rule: 88.0 vs 83.4 is close, and GPT-5.5's score comes through Codex CLI, OpenAI's strongest agentic surface. On terminal-driven work with Codex, GPT-5.5 remains genuinely competitive. Everywhere else in coding, it is not currently a contest — a sharp reversal from the much closer race we documented in Claude Opus 4.8 vs GPT-5.5 just months ago.
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| GDPval-AA (Elo) | 1932 | 1890 | 1769 | 1314 |
| AutomationBench (tool use) | 17.4% | 15.5% | 12.9% | 9.6% |
| Legal Agent Benchmark | 13.3% | 10.4% | 2.1% | 0.0% |
GDPval-AA measures economically valuable white-collar tasks on an Elo-style scale. Fable 5's 42-point lead over Opus 4.8 is solid; its 163-point lead over GPT-5.5 is decisive. The Legal Agent numbers are striking less for the leader than for the floor — GPT-5.5 at 2.1% and Gemini at zero say long-horizon professional agent work is still mostly unsolved, and Anthropic is simply furthest along. Note how low the absolute tool-use and legal scores are across the board: nobody should be selling you a fully autonomous paralegal yet.
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| OSWorld-Verified (computer use) | 85.0% | 83.4% | 78.7% | 76.2% |
| GDP.pdf (document vision) | 29.8% | 22.5% | 24.9% | 16.7% |
| Blueprint-Bench 2 (spatial reasoning) | 38.6% | 14.5% | 36.2% | 26.5% |
Computer use is nearly saturated at the top — 85.0 vs 83.4 vs 78.7 means all three can drive a desktop competently. The interesting row is Blueprint-Bench 2: Fable 5 nearly tripled Opus 4.8's spatial reasoning score in one generation, while GPT-5.5 was already strong there. If your workload involves reading drawings, floor plans, or dense scientific figures, this generation jump matters more than the headline coding numbers.
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 |
|---|---|---|---|
| Humanity's Last Exam (no tools) | 59.0%* | 49.8% | 41.4% |
| Humanity's Last Exam (with tools) | 64.5%* | 57.9% | 52.2% |
| HealthBench Professional | 66.0%* | 56.9% | 51.8% |
| GPQA Diamond | — | 94.2% | 93.6% |
| ARC-AGI-2 | — | — | 85.0% |
| FrontierMath Tier 4 | — | — | 35.4% |
The asterisks matter most here. The Humanity's Last Exam and HealthBench figures are Mythos 5 numbers; the Fable 5 you can buy may answer some of those questions with an Opus 4.8 fallback, especially anything brushing against biology. Meanwhile GPQA Diamond is saturated — a half-point spread across the frontier means graduate-level science Q&A no longer differentiates these models.
And give GPT-5.5 its due: 85.0% on ARC-AGI-2 and the FrontierMath Tier 4 results are real strengths in abstract reasoning and research mathematics where Anthropic published no Fable 5 score at all. Absence of a number is not a win.
Opus 4.8 did not get worse on launch day. At exactly half Fable 5's price, it remains the value pick for everyday chat, writing, and moderate coding — 69.2% on SWE-bench Pro was state-of-the-art four months ago. It is also, ironically, guaranteed un-safeguarded: it is the fallback model, so security and bio teams hitting Fable 5's classifiers get Opus-quality answers anyway. We covered its own generational gains in Claude 4.8 vs 4.7.
| Scenario | Cheapest sensible pick | Why |
|---|---|---|
| High-volume chat / content | Opus 4.8 or GPT-5.5 | Frontier capability is wasted; price per token dominates |
| Agentic coding on hard repos | Fable 5 | 80.3% vs 58.6% means fewer failed runs — failed runs are the real cost |
| Long agent sessions, reused context | Fable 5 with caching | 90% input discount on cached tokens erodes most of the 2x premium |
| Research math / abstract reasoning | GPT-5.5 | Its strongest published results, at half the price |
| Document-heavy vision work | Fable 5 | 29.8% vs 24.9% on GDP.pdf, biggest generational jump in spatial reasoning |
Pro tip: Benchmark deltas compound in agents. A model that is 10 points better per step fails far less often across a 30-step run — which is why Fable 5's per-token premium can net out cheaper on exactly the workloads where it looks most expensive.
Fable 5 is the strongest model you can use today for coding, agents, knowledge work, and vision — with margins that range from decisive (FrontierCode, Legal, GDPval) to cosmetic (OSWorld, Terminal-Bench). GPT-5.5 keeps clear wins in abstract reasoning, research math, published long-context retrieval, and price. Opus 4.8 becomes the smart default for everything that does not need the frontier. Pick by workload, not by leaderboard — and remember that on starred domains, the Fable 5 in your hands is not quite the model in the table.
For the full launch story — safeguards, Mythos 5, Project Glasswing, and real customer results — read Claude Fable 5 and Claude Mythos 5: Everything You Need to Know. For the previous generation, see Claude Opus 4.8 vs GPT-5.5 and our three-way Gemini comparison. And if you want any of these models doing more out of the box, browse installable Agent Skills.
8 questions answered
Best AI Prompts for Claude Fable 5: 10 Templates for Anthropic's Most Powerful Model
Jun 9 · 14 min
Claude Fable 5 and Claude Mythos 5: Everything You Need to Know
Jun 9 · 10 min
Best Midjourney Prompts for Creators in 2026
Jun 8 · 8 min
How to Write Better design.md Files
Jun 8 · 7 min
AI Skills vs AI Agents: What's the Difference?
Jun 8 · 7 min