Fable 5 replays the matrix: slower, hungrier, and the cleanest cycle yet
A Mythos sneak peek
Anthropic released Claude Fable 5 on June 9 — the first publicly available model from its Mythos class. It's effectively the same underlying model as Claude Mythos 5, with safeguards that fall back to Opus 4.8 in sensitive areas; the less-restricted variant stays inside the invite-only Project Glasswing. The detail that matters here: through June 22 Fable 5 is included in Pro/Max plans at no extra cost — after that it requires usage credits.
A 12-day included window, and the orchestration matrix from last week still warm? Obviously I re-ran it.
Run 9: same feature, new model
Same setup as before: fazon PRD-009 (per-item edit/delete on the food log, 22 acceptance criteria), same baseline commit, monolithic cell — Fable 5 implementing at effort xhigh, Opus 4.8 reviewing. Against the directly comparable monolithic runs:
| Opus 4.8 | Sonnet 4.6 | Fable 5 | |
|---|---|---|---|
| implementation wallclock | 71.8 min | 64.4 min | 94.0 min* |
| implementation quota burn | 6% | 4% | 19% |
| code-review r1 blockers | 1 | 0 | 0 |
| fix loop | killed by window reset | never reached | completed clean |
*includes a cold-build penalty worth a few minutes.
So yes: slow, and roughly 3× the quota per round — and there's no tokenizer excuse. Fable 5 uses the same tokenizer introduced with Opus 4.7, so the Opus comparison is token-for-token; the extra burn is real work, not bigger units. (Only the Sonnet column, on the older ~30%-smaller-unit tokenizer, isn't strictly 1:1.)
What the extra time and tokens bought:
It ran my workflow by itself. During implementation it wrote and ran per-AC XCUITests, caught two bugs in-run, and fixed them before review — the exact bug classes earlier Opus runs shipped as review blockers. The review-fix loop my pipeline implements as separate workflow steps, Fable internalized. It was noticeably more thorough about verifying acceptance criteria during execution than any previous run.
Code review came back clean. Zero blockers on a 3.7k-line diff.
And then QA found issues — a first across all nine runs. Seven real blockers, all in the agent's attention shadow: the PRD's parity surface (the same feature on day-view, which it implemented but barely verified) and Swedish n=1 pluralization. All seven were findable by simply running the full existing test suite — it ran focused subsets instead. The static review couldn't have caught them either; a context menu that doesn't appear is a runtime fact, invisible in a diff. The depth-first thoroughness that killed the old failure mode (implementation gaps) introduced a new one: verification-coverage gaps. A sharper prompt in the implementation step — run the full suite before declaring done — might plausibly have pulled these in-run too.
First clean fix cycle in the matrix. Seven blockers → three root-cause commits, 51 minutes, 9% quota, zero compactions, zero scope creep, 0-blocker re-review, all original blockers green at QA round 2. Sonnet went pathological at this exact stage; Opus 4.8 never got measured through it. Full measured cycle: ~3.5 hours active, 33% quota, zero operator interventions.
Am I switching?
No — for now. Two reasons. Quota per round is ~3×, and on a subscription, window headroom is the binding constraint. And Fable 5 is only included in my Max plan until June 22 — building my defaults around a model that turns into usage credits in two weeks is backwards.
But quality per round is the best the matrix has produced, and Run 9 hints at something bigger: a tuned single-agent prompt (independent test author + implementer required to run the full suite) might capture most of what my workflow's machinery does. That's worth knowing before the window closes — more experiments coming.
Usual caveat applies: one run per cell. This matrix has already punished single-run generalization once.