Deep Dive

Analysis & Case Studies

Three illustrative cases, one harness experiment, and a post-mortem on 345 failures. The recurring finding: models don’t lose because they missed a clue — they lose because they reasoned flawlessly on top of a wrong premise.


The headline isn't a mean. It's the solve rate: Fable 5 actually cracks roughly one case in three — while four other frontier models cluster at one in ten, bunched within a few points of each other on mean score. Partial credit flatters everyone; “did it solve it end‑to‑end?” is the question that opens the gap. And on the one public puzzle where both Fable and Opus named the same killer, the reasoning-aware judge split them 16 vs 1 — only a judge grading the chain, not the verdict, can tell a real solve from a lucky name.


Where the reasoning breaks

Three findings from the benchmark. The hooks are spoiler-free — but the analysis names the trick and the culprit, so it's tucked behind a warning. If you'd rather solve these yourself first, head to the portal.

Example 1 · Case 07 · "RPG JUMP"

Same clue, opposite reading

Seven temples on a grid map. Every 50 minutes a teleport spell yanks every player to the nearest other temple — corpses included. The whole puzzle hinges on one quietly devastating question: what does “nearest” mean? Both models seized on the same calibration clue. Both then named the same killer. One had truly solved it; the other had reasoned beautifully over a map of which-temple-connects-to-which that was simply wrong.

Reveal the analysis — spoils Case 07
RPG JUMP temple map
The seven temples. The trick is buried in how you measure the distance between them.

The case hands you a calibration clue: from temple 106, the next jump is equally likely to land on 104, 105, or 107.

Opus 4.8
“nearest temple = shortest door-to-door travel time = straight-line distance… matches Claire's hint — model verified correct.
Fable 5
“If you compute door-to-door distance by four-directional (grid) movement — the classic RPG walk — 106→104 = west 2000 + south 1000 = 3000m…”

Opus read the world as Euclidean and literally annotated its own model “verified correct.” Fable read it as Manhattan distance — the actual trick the author buried (the puzzle is named RPG JUMP for a reason). Both then named the same killer. But Opus arrived there over a wrong connectivity map; Fable correctly worked out the one-way temple that can hide a killer and two victims with no witnesses.

Fable 5  16/18 Opus 4.8  1/18 Opus got a single charity point for naming the right culprit. Everything else was built on sand.
Example 2 · Case 04 · "The Magic Door Murders"

Right clue, wrong trick

A sealed bunker has a door that shuts ~40 seconds too early each night. Both models caught the hardest leap — that the 40-second discrepancy means a time-zone difference. That's the threshold insight. From there, one reconstructed the real mechanism and fingered the right man; the other reached the same insight and then improvised a plausible-sounding structure that diverged from the truth.

Reveal the analysis — spoils Case 04

The real trick: the magic doors are paired across time zones, and a 40-second offset corresponds to ~18 km along the equator — pointing to a hidden relay door and an “in-between space” the killer uses to escape the locked room.

  • Fable 520/20. Reconstructed the full hidden-relay trick, then used two constraints — who could hide in the bunker after 7pm, and who couldn't walk 18 km back in time — to correctly finger Bill.
  • Opus 4.88/20. Took the same 40-second clue but invented a different mechanism (a fictional second secret door) and a different timeline, then accused the wrong man.

The pattern repeats: Opus reaches the threshold insight, then narrates a confident structure around it. Fable does the grind — checking each constraint until one suspect survives.

Example 3 · Case 08 · "(Not) Random Ball-Drawing"

A flawless proof of the wrong premise

Lest this read as a Fable infomercial: here's the most instructive case of all, and one where Fable loses badly. A probability puzzle whose whole edifice rests on reading two soft, observational clues correctly — what a hand gesture means, and what a girl is wearing. Pick the wrong fork, and you can still build an airtight, self-consistent, computationally-verified universe… that happens to be false.

Reveal the analysis — spoils Case 08

Fable read the gesture as 12 (a class number on a jersey) and built a fully self-consistent 12/12/24 system — it even wrote a program to brute-force-verify that its system satisfied every stated constraint. The logic was airtight. The premise was wrong.

Fable 5  5/19 Opus 4.8  0/19

Look at the one sub-question that's pure deduction with no soft-clue fork: Fable scored a flawless 2/2, reasoning identical to the official solution. Then every downstream question collapsed — because each inherited the misread clue. This is the same failure that sank Opus on RPG JUMP, just wearing Fable's clothes: commit to a wrong premise, then reason flawlessly on top of it. And the clue that trips you is never a logic step — it's a piece of observational flavor text.


Does the agent harness even help?

Every score above came from a model wrapped in a coding harness — a Claude Code agent that can open files, view images, and run code in a shell. So we ran the control: the same Opus 4.8, on the same 70 cases, as a single raw API call — puzzle text and images in, one answer out. No tools, no shell, no second turn.

Opus 4.8 setupPerfect (100%)Solved end‑to‑end (≥90%)Mean (partial)
With the agent harness5.7%10.0%46.0%
Raw — one shot, no tools7.1%10.0%38.2%

Same model, same 70 cases. The harnessed agent gets the puzzle, its images, and a shell; the raw run gets the puzzle and images in a single API call — one answer out.

Density curves of judge scores for Opus 4.8 with vs. without the agent harness. The harnessed curve sits higher in the mid-range; the raw curve is heavier at the low end; both share the same upper tail.
Score density (smoothed). The harness shifts mass out of the low end into the middle — but the upper tail is unchanged. More partial credit, not more end-to-end solves.

The harness adds +7.8 points of mean — but look at the solve columns: Perfect and end‑to‑end barely move (raw even edges it on perfect scores). So the tooling isn't cracking new cases outright; it's banking more partial credit on cases the model still doesn't fully solve. The gain is concentrated on the quantitative puzzles, where a shell lets the model verify a deduction by actually running it — substantive shell use is otherwise rare (only about 3–5% of tool calls run or write code; the rest is reading files and viewing images). And it cuts both ways: on narrative cases the multi‑turn tool loop can talk itself into a wrong frame and score below the one‑shot model.


Why they fail — it isn't the clues

We ran an LLM post-mortem on every sub-85% solve — 345 of them across all six model/harness configurations — handing the judge the official solution, the grade, and the model's own reasoning trace, and asking it to name exactly where the chain snapped.

The verdict is strikingly consistent across models: they had the clues in hand and reasoned their way to the wrong place. Almost nobody loses for failing to see the evidence — they lose on the deduction built on top of it. Which is exactly why the harness helps only where it can check that deduction.

Model / setupFlawed deductionWrong conclusionIncompleteMissed clue
Fable 5251260
Opus 4.8282652
Opus 4.6341741
GPT‑5.52815141
Gemini 3.1 Pro302552

Counts of cases by the single most important failure mode (only solves scoring <85% are analyzed, so the totals differ by model). GPT‑5.5 stands apart: far more incomplete reasoning (14) and fewer wrong conclusions — it tends to stop short rather than commit to a wrong answer. Everyone else mostly fails by committing to a deduction that doesn't hold.


Now it's your turn.

Eight cases. Every clue on the page. No answer key in sight. Read one, out-deduce the machine, and send us your reasoning — it goes straight to the benchmark authors.