Is Claude Fable 5 better than GPT-5.5?

On most published benchmarks, yes — decisively in coding (80.3% vs 58.6% on SWE-bench Pro) and knowledge work (1932 vs 1769 on GDPval-AA). GPT-5.5 keeps real wins in abstract reasoning (ARC-AGI-2), research math, long-context retrieval, and price.

Is Claude Fable 5 worth double the price of Opus 4.8?

For agentic coding and long-horizon agent work, usually yes — higher per-step reliability means fewer failed runs, and the 90% prompt-caching discount erodes much of the premium. For everyday chat and writing, Opus 4.8 remains the value pick.

What do the starred benchmark scores mean?

Starred scores are from Claude Mythos 5 — the same model as Fable 5 but with safeguards lifted. On those domains (security, biology, parts of reasoning), the publicly available Fable 5 performs closer to Opus 4.8 because flagged queries fall back to it.

What is Claude Fable 5's context window?

Anthropic has not published it. Opus 4.8 and GPT-5.5 both offer 1M-token windows, and GPT-5.5 adds a surcharge above 272K input tokens. Any article stating a Fable 5 window size is guessing.

Which model is best for coding agents?

Claude Fable 5, by the largest margins in the comparison: 80.3% on SWE-bench Pro and 29.3% on FrontierCode Diamond — 5x GPT-5.5 on the latter. The one close race is Terminal-Bench, where GPT-5.5 via Codex CLI scores 83.4% against Fable 5's 88.0%.

Where does GPT-5.5 beat Claude Fable 5?

Abstract reasoning (85.0% on ARC-AGI-2), research mathematics (FrontierMath Tier 4), published long-context retrieval (74.0% on MRCR v2), and cost — $5/$30 vs $10/$50 per million tokens.

Are these benchmarks independently verified?

Mostly not yet. The scores come from Anthropic's launch materials and early analyses of them. Treat rankings as directionally reliable and exact margins as provisional until independent replications land.

Should I switch from Opus 4.8 to Fable 5?

If you run coding or long-horizon agents, yes — re-benchmark your workloads during the June 9-22 subscription window while Fable 5 is included free. If you mostly chat or write, Opus 4.8 at half the price is still excellent.

Claude Fable 5 vs Opus 4.8 vs GPT-5.5: Benchmarks 2026

Claude Fable 5 wins most of the benchmarks. That is not the interesting part. The interesting part is which ones it wins by 20+ points, which ones are effectively ties, where GPT-5.5 still beats it outright, and which published "Fable 5" scores are actually a different model's numbers.

We pulled every score from Anthropic's announcement and the early independent breakdowns into one place, organized by what you actually do with these models. If you read one comparison before picking a model this month, make it this one.

The lineup

	Claude Fable 5	Claude Opus 4.8	GPT-5.5
Released	June 9, 2026	Early 2026	Late 2025
Input / output per M tokens	$10 / $50	$5 / $25	$5 / $30
Context window	Not published	1M	1M API (surcharge above 272K)
Positioning	Frontier, safeguarded Mythos-class	Previous Anthropic flagship	OpenAI flagship

One spec gap worth flagging up front: Anthropic has not published Fable 5's context window or maximum output tokens. Anything you read stating one is guessing.

How to read these numbers

Two caveats before the tables, because the launch coverage mostly skipped them:

Starred scores are Mythos 5 scores. On benchmarks marked with an asterisk below, Anthropic's table reports the unrestricted Mythos 5 — the same weights as Fable 5, but with safety classifiers lifted. On those domains (security, biology, parts of reasoning), the Fable 5 you can actually buy performs closer to Opus 4.8, because flagged queries silently fall back to it.
Almost everything here is vendor-reported. Anthropic picked the benchmarks and ran the comparisons. Independent replication takes weeks. Treat the rankings as directionally real and the exact margins as provisional.

Agentic coding

Benchmark	Fable 5	Opus 4.8	GPT-5.5
SWE-bench Pro	80.3%	69.2%	58.6%
FrontierCode Diamond	29.3%	13.4%	5.7%
Terminal-Bench 2.1	88.0%*	82.7%	83.4% (Codex CLI)

This is the clearest story in the whole comparison. On SWE-bench Pro — real GitHub issues, end to end — Fable 5 leads GPT-5.5 by 21.7 points. On FrontierCode Diamond, Cognition's deliberately brutal production-coding suite, it scores 5x GPT-5.5 and more than double Opus 4.8. These are not within-margin-of-error gaps; they are different tiers.

Terminal-Bench is the exception that proves the rule: 88.0 vs 83.4 is close, and GPT-5.5's score comes through Codex CLI, OpenAI's strongest agentic surface. On terminal-driven work with Codex, GPT-5.5 remains genuinely competitive. Everywhere else in coding, it is not currently a contest — a sharp reversal from the much closer race we documented in Claude Opus 4.8 vs GPT-5.5 just months ago.

Knowledge work and agents

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
GDPval-AA (Elo)	1932	1890	1769	1314
AutomationBench (tool use)	17.4%	15.5%	12.9%	9.6%
Legal Agent Benchmark	13.3%	10.4%	2.1%	0.0%

GDPval-AA measures economically valuable white-collar tasks on an Elo-style scale. Fable 5's 42-point lead over Opus 4.8 is solid; its 163-point lead over GPT-5.5 is decisive. The Legal Agent numbers are striking less for the leader than for the floor — GPT-5.5 at 2.1% and Gemini at zero say long-horizon professional agent work is still mostly unsolved, and Anthropic is simply furthest along. Note how low the absolute tool-use and legal scores are across the board: nobody should be selling you a fully autonomous paralegal yet.

Computer use and vision

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
OSWorld-Verified (computer use)	85.0%	83.4%	78.7%	76.2%
GDP.pdf (document vision)	29.8%	22.5%	24.9%	16.7%
Blueprint-Bench 2 (spatial reasoning)	38.6%	14.5%	36.2%	26.5%

Computer use is nearly saturated at the top — 85.0 vs 83.4 vs 78.7 means all three can drive a desktop competently. The interesting row is Blueprint-Bench 2: Fable 5 nearly tripled Opus 4.8's spatial reasoning score in one generation, while GPT-5.5 was already strong there. If your workload involves reading drawings, floor plans, or dense scientific figures, this generation jump matters more than the headline coding numbers.

Reasoning, science, and the starred scores

Benchmark	Fable 5	Opus 4.8	GPT-5.5
Humanity's Last Exam (no tools)	59.0%*	49.8%	41.4%
Humanity's Last Exam (with tools)	64.5%*	57.9%	52.2%
HealthBench Professional	66.0%*	56.9%	51.8%
GPQA Diamond	—	94.2%	93.6%
ARC-AGI-2	—	—	85.0%
FrontierMath Tier 4	—	—	35.4%

The asterisks matter most here. The Humanity's Last Exam and HealthBench figures are Mythos 5 numbers; the Fable 5 you can buy may answer some of those questions with an Opus 4.8 fallback, especially anything brushing against biology. Meanwhile GPQA Diamond is saturated — a half-point spread across the frontier means graduate-level science Q&A no longer differentiates these models.

And give GPT-5.5 its due: 85.0% on ARC-AGI-2 and the FrontierMath Tier 4 results are real strengths in abstract reasoning and research mathematics where Anthropic published no Fable 5 score at all. Absence of a number is not a win.

Where GPT-5.5 still wins

Cost. $5/$30 vs $10/$50. On a typical 100K-input, 20K-output task, GPT-5.5 runs about $1.10 to Fable 5's roughly $2.00 — 80% more before caching.
Abstract reasoning and research math. ARC-AGI-2 and FrontierMath are its showcase results, uncontested by published Fable 5 scores.
Long-context retrieval. GPT-5.5 posts 74.0% on OpenAI's MRCR v2 at 512K-1M tokens, with a published 1M window; Fable 5's window is unpublished.
Terminal work via Codex CLI. 83.4% on Terminal-Bench keeps it within striking distance on the surface where many developers actually live.

Where Opus 4.8 still makes sense

Opus 4.8 did not get worse on launch day. At exactly half Fable 5's price, it remains the value pick for everyday chat, writing, and moderate coding — 69.2% on SWE-bench Pro was state-of-the-art four months ago. It is also, ironically, guaranteed un-safeguarded: it is the fallback model, so security and bio teams hitting Fable 5's classifiers get Opus-quality answers anyway. We covered its own generational gains in Claude 4.8 vs 4.7.

Cost per task, not cost per token

Scenario	Cheapest sensible pick	Why
High-volume chat / content	Opus 4.8 or GPT-5.5	Frontier capability is wasted; price per token dominates
Agentic coding on hard repos	Fable 5	80.3% vs 58.6% means fewer failed runs — failed runs are the real cost
Long agent sessions, reused context	Fable 5 with caching	90% input discount on cached tokens erodes most of the 2x premium
Research math / abstract reasoning	GPT-5.5	Its strongest published results, at half the price
Document-heavy vision work	Fable 5	29.8% vs 24.9% on GDP.pdf, biggest generational jump in spatial reasoning

Pro tip: Benchmark deltas compound in agents. A model that is 10 points better per step fails far less often across a 30-step run — which is why Fable 5's per-token premium can net out cheaper on exactly the workloads where it looks most expensive.

The verdict

Fable 5 is the strongest model you can use today for coding, agents, knowledge work, and vision — with margins that range from decisive (FrontierCode, Legal, GDPval) to cosmetic (OSWorld, Terminal-Bench). GPT-5.5 keeps clear wins in abstract reasoning, research math, published long-context retrieval, and price. Opus 4.8 becomes the smart default for everything that does not need the frontier. Pick by workload, not by leaderboard — and remember that on starred domains, the Fable 5 in your hands is not quite the model in the table.

Keep going

For the full launch story — safeguards, Mythos 5, Project Glasswing, and real customer results — read Claude Fable 5 and Claude Mythos 5: Everything You Need to Know. For the previous generation, see Claude Opus 4.8 vs GPT-5.5 and our three-way Gemini comparison. And if you want any of these models doing more out of the box, browse installable Agent Skills.

The lineup

	Claude Fable 5	Claude Opus 4.8	GPT-5.5
Released	June 9, 2026	Early 2026	Late 2025
Input / output per M tokens	$10 / $50	$5 / $25	$5 / $30
Context window	Not published	1M	1M API (surcharge above 272K)
Positioning	Frontier, safeguarded Mythos-class	Previous Anthropic flagship	OpenAI flagship

One spec gap worth flagging up front: Anthropic has not published Fable 5's context window or maximum output tokens. Anything you read stating one is guessing.

How to read these numbers

Two caveats before the tables, because the launch coverage mostly skipped them:

Starred scores are Mythos 5 scores. On benchmarks marked with an asterisk below, Anthropic's table reports the unrestricted Mythos 5 — the same weights as Fable 5, but with safety classifiers lifted. On those domains (security, biology, parts of reasoning), the Fable 5 you can actually buy performs closer to Opus 4.8, because flagged queries silently fall back to it.
Almost everything here is vendor-reported. Anthropic picked the benchmarks and ran the comparisons. Independent replication takes weeks. Treat the rankings as directionally real and the exact margins as provisional.

Agentic coding

Benchmark	Fable 5	Opus 4.8	GPT-5.5
SWE-bench Pro	80.3%	69.2%	58.6%
FrontierCode Diamond	29.3%	13.4%	5.7%
Terminal-Bench 2.1	88.0%*	82.7%	83.4% (Codex CLI)

Knowledge work and agents

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
GDPval-AA (Elo)	1932	1890	1769	1314
AutomationBench (tool use)	17.4%	15.5%	12.9%	9.6%
Legal Agent Benchmark	13.3%	10.4%	2.1%	0.0%

Computer use and vision

Benchmark	Fable 5	Opus 4.8	GPT-5.5	Gemini 3.1 Pro
OSWorld-Verified (computer use)	85.0%	83.4%	78.7%	76.2%
GDP.pdf (document vision)	29.8%	22.5%	24.9%	16.7%
Blueprint-Bench 2 (spatial reasoning)	38.6%	14.5%	36.2%	26.5%

Reasoning, science, and the starred scores

Benchmark	Fable 5	Opus 4.8	GPT-5.5
Humanity's Last Exam (no tools)	59.0%*	49.8%	41.4%
Humanity's Last Exam (with tools)	64.5%*	57.9%	52.2%
HealthBench Professional	66.0%*	56.9%	51.8%
GPQA Diamond	—	94.2%	93.6%
ARC-AGI-2	—	—	85.0%
FrontierMath Tier 4	—	—	35.4%

Where GPT-5.5 still wins

Cost. $5/$30 vs $10/$50. On a typical 100K-input, 20K-output task, GPT-5.5 runs about $1.10 to Fable 5's roughly $2.00 — 80% more before caching.
Abstract reasoning and research math. ARC-AGI-2 and FrontierMath are its showcase results, uncontested by published Fable 5 scores.
Long-context retrieval. GPT-5.5 posts 74.0% on OpenAI's MRCR v2 at 512K-1M tokens, with a published 1M window; Fable 5's window is unpublished.
Terminal work via Codex CLI. 83.4% on Terminal-Bench keeps it within striking distance on the surface where many developers actually live.

Where Opus 4.8 still makes sense

Cost per task, not cost per token

Scenario	Cheapest sensible pick	Why
High-volume chat / content	Opus 4.8 or GPT-5.5	Frontier capability is wasted; price per token dominates
Agentic coding on hard repos	Fable 5	80.3% vs 58.6% means fewer failed runs — failed runs are the real cost
Long agent sessions, reused context	Fable 5 with caching	90% input discount on cached tokens erodes most of the 2x premium
Research math / abstract reasoning	GPT-5.5	Its strongest published results, at half the price
Document-heavy vision work	Fable 5	29.8% vs 24.9% on GDP.pdf, biggest generational jump in spatial reasoning

Pro tip: Benchmark deltas compound in agents. A model that is 10 points better per step fails far less often across a 30-step run — which is why Fable 5's per-token premium can net out cheaper on exactly the workloads where it looks most expensive.

Turn prompts into followers

Teach your AI new tricks

Learn AI, the practical way

Claude Fable 5 vs Opus 4.8 vs GPT-5.5: Benchmarks (2026)

The lineup

How to read these numbers

Agentic coding

Knowledge work and agents

Computer use and vision

Reasoning, science, and the starred scores

Where GPT-5.5 still wins

Where Opus 4.8 still makes sense

Cost per task, not cost per token

The verdict

Keep going

Frequently Asked Questions

You May Also Like

Claude Opus 5 vs Gemini 3.6 Flash

33+ Best Prompts for Claude Opus 5

40+ Best Prompts for Gemini 3.6 Flash

Claude Fable 5 vs Opus 4.8 vs GPT-5.5: Benchmarks (2026)

The lineup

How to read these numbers

Agentic coding

Knowledge work and agents

Computer use and vision

Reasoning, science, and the starred scores

Where GPT-5.5 still wins

Where Opus 4.8 still makes sense

Cost per task, not cost per token

The verdict

Keep going

Frequently Asked Questions

You May Also Like

Claude Opus 5 vs Gemini 3.6 Flash

33+ Best Prompts for Claude Opus 5

40+ Best Prompts for Gemini 3.6 Flash