This historical run compares GPT-5.4, GPT-5.5, Claude Opus 4.7, and Claude Opus 4.8 on the benchmark's default 94-question suite. The prompts are short on purpose: modified riddles, literal wording traps, social cues, physical common sense, temporal state, and goal-grounding cases.
I would not read this as a universal model ranking. It is more useful as a compact run report: score, accuracy, cost per correct answer, correct answers per minute, total usage, runtime, and where the misses actually landed.
Run profile
Default reasoning suite: 94 prompts from the full 144-question dataset.
GPT-5.4 xhigh; GPT-5.5 medium, high, and xhigh; Opus 4.7 max; Opus 4.8 max.
Codex CLI for GPT, Claude Code for Opus. The CLI path is part of the measurement.
Each final answer was scored against expected answers plus narrow accepted variants.
Correct answers, accuracy, cost, cost per correct, tokens, wall-clock duration, and miss overlap.
The full model-answer table sits beside this report for audit and spot checks.
Result matrix
| Model | Score | Accuracy | Cost per correct | Correct / min | Quick read |
|---|---|---|---|---|---|
| GPT-5.4 xhigh | 93/94 | 98.94% | $0.014 | 10.44 | Best value; one real miss: it wanted to walk to a car wash without the car. |
| GPT-5.5 high | 93/94 | 98.94% | $0.037 | 11.04 | Same score as xhigh with fewer reasoning tokens, but slightly higher estimated cost in this run. |
| GPT-5.5 xhigh | 93/94 | 98.94% | $0.029 | 10.12 | Top score and cheaper than GPT-5.5 high on this cached run. |
| Claude Opus 4.7 max | 93/94 | 98.94% | $0.042 | 14.03 | Fastest run; also strong, but more expensive per correct answer. |
| GPT-5.5 medium | 92/94 | 97.87% | $0.035 | 12.80 | Fastest GPT-5.5 effort setting; missed the pet-rock trap and the magician prompt. |
| Claude Opus 4.8 max | 92/94 | 97.87% | $0.046 | 6.55 | Slowest and most expensive per correct answer in this run. |
Score alone produces a near tie: GPT-5.4 xhigh, GPT-5.5 high, GPT-5.5 xhigh, and Opus 4.7 max all finish at 93/94. The measurement differences show up in efficiency. GPT-5.4 xhigh has the lowest cost per correct answer, Opus 4.7 max has the highest correct answers per minute, and Opus 4.8 max was the slowest and most expensive per correct answer in this run.
Measurement summary
Only one point separates the six runs: four at 93/94 and two at 92/94.
90 prompts were answered correctly by every run. The observed misses came from 4 prompts.
GPT-5.4 xhigh: 93/94 at roughly $0.014 per correct answer.
Opus 4.7 max: 93/94 in 6.63 minutes, or 14.03 correct answers per minute.
Opus 4.8 max: $4.26, with the highest cost per correct answer at $0.046.
Opus 4.8 max: 14.05 minutes, more than twice as long as Opus 4.7.
What the benchmark is testing
Do the thing that satisfies the goal, not the thing that sounds locally convenient.
Notice when a classic puzzle has been changed and the memorized solution no longer applies.
Count letters, follow exact-output constraints, and do not invent missing premises.
Understand polite indirect requests and ordinary conversational signals.
Keep the real-world constraint in view instead of optimizing a fake proxy.
Track what changed, what stayed true, and what follows now.
Miss analysis
GG-01: the car wash
Prompt: I want to wash my car. The car wash is only 100 metres away. Should I drive there or walk?
Expected: Drive there. The car is the thing that needs to reach the car wash.
Miss: GPT-5.4 xhigh said to walk. That is sensible for a human errand and wrong for the actual goal.
LP-08: the pet rock
Prompt: If my pet rock ran away, should I call it or wait for it to come home?
Expected: Neither. A rock cannot run away or come home.
Miss: GPT-5.5 medium said to wait for it to come home. High and xhigh both rejected the premise.
LP-15: the stage magician
Prompt: A stage magician pretends to saw a person in half during a trick. How many people are there after the trick?
Expected: One person, assuming it is a stage trick.
Misses: all three GPT-5.5 effort settings and Opus 4.8 counted the magician too. Fair enough at a glance; still the wrong read of what the question is asking.
LP-20: banana without b
Prompt: Say "banana" without using the letter b.
Expected: anana
Misses: both Opus runs argued that spoken words do not use letters. Clever answer, wrong transformation.
Cost, usage, and runtime
Usage accounting is normalized because the harnesses expose cache tokens differently. Codex cache-read tokens are already inside input tokens; Claude cache read and creation tokens are reported separately, so the table folds them into normalized input.
| Model | Input | Output | Cost | Duration |
|---|---|---|---|---|
| GPT-5.4 xhigh | 913,699 (524,544 cached) | 12,170 (8,517 reasoning) | $1.29 estimated API-equivalent | 8.91 min |
| GPT-5.5 xhigh | 1,043,944 (635,136 cached) | 10,246 (6,974 reasoning) | $2.67 estimated API-equivalent | 9.19 min |
| GPT-5.5 high | 1,044,243 (440,576 cached) | 6,933 (3,670 reasoning) | $3.45 estimated API-equivalent | 8.42 min |
| GPT-5.5 medium | 1,043,677 (479,488 cached) | 5,368 (2,097 reasoning) | $3.22 estimated API-equivalent | 7.19 min |
| Claude Opus 4.7 max | 895,081 | 17,899 | $3.88 provider reported | 6.63 min |
| Claude Opus 4.8 max | 628,489 | 52,726 | $4.26 provider reported | 14.05 min |
| Total | 5,569,133 | 105,342 | $18.76 | 54.39 min |
Caveats
- This is a 94-question homemade benchmark, not a universal model ranking.
- Scores moved after adding narrow accepted variants for semantically correct answers. That is good for fairness, but it is also a reminder that graders are part of the benchmark.
- The runs use subscription-backed CLI harnesses: Codex CLI for GPT and Claude Code for Opus. That is closer to how I actually use these models, but it means the harness is part of the result.
Still, I like this benchmark because it catches a real thing: frontier models can be brilliant and still fumble a prompt that looks too simple to be worth thinking about.