GPT vs Opus frontier model reasoning benchmark report

This historical run compares GPT-5.4, GPT-5.5, Claude Opus 4.7, and Claude Opus 4.8 on the benchmark's default 94-question suite. The prompts are short on purpose: modified riddles, literal wording traps, social cues, physical common sense, temporal state, and goal-grounding cases.

I would not read this as a universal model ranking. It is more useful as a compact run report: score, accuracy, cost per correct answer, correct answers per minute, total usage, runtime, and where the misses actually landed.

Quick read: the raw scores are tightly packed. Four runs finished at 93/94, two finished at 92/94, 90 prompts were correct across every run, and no prompt was missed by all six models.

Run profile

Suite

Default reasoning suite: 94 prompts from the full 144-question dataset.

Models

GPT-5.4 xhigh; GPT-5.5 medium, high, and xhigh; Opus 4.7 max; Opus 4.8 max.

Harnesses

Codex CLI for GPT, Claude Code for Opus. The CLI path is part of the measurement.

Scoring

Each final answer was scored against expected answers plus narrow accepted variants.

Measured

Correct answers, accuracy, cost, cost per correct, tokens, wall-clock duration, and miss overlap.

Artifacts

The full model-answer table sits beside this report for audit and spot checks.

Result matrix

Model	Score	Accuracy	Cost per correct	Correct / min	Quick read
GPT-5.4 xhigh	93/94	98.94%	$0.014	10.44	Best value; one real miss: it wanted to walk to a car wash without the car.
GPT-5.5 high	93/94	98.94%	$0.037	11.04	Same score as xhigh with fewer reasoning tokens, but slightly higher estimated cost in this run.
GPT-5.5 xhigh	93/94	98.94%	$0.029	10.12	Top score and cheaper than GPT-5.5 high on this cached run.
Claude Opus 4.7 max	93/94	98.94%	$0.042	14.03	Fastest run; also strong, but more expensive per correct answer.
GPT-5.5 medium	92/94	97.87%	$0.035	12.80	Fastest GPT-5.5 effort setting; missed the pet-rock trap and the magician prompt.
Claude Opus 4.8 max	92/94	97.87%	$0.046	6.55	Slowest and most expensive per correct answer in this run.

Score alone produces a near tie: GPT-5.4 xhigh, GPT-5.5 high, GPT-5.5 xhigh, and Opus 4.7 max all finish at 93/94. The measurement differences show up in efficiency. GPT-5.4 xhigh has the lowest cost per correct answer, Opus 4.7 max has the highest correct answers per minute, and Opus 4.8 max was the slowest and most expensive per correct answer in this run.

Measurement summary

Score spread

Only one point separates the six runs: four at 93/94 and two at 92/94.

Agreement

90 prompts were answered correctly by every run. The observed misses came from 4 prompts.

Best value

GPT-5.4 xhigh: 93/94 at roughly $0.014 per correct answer.

Fastest run

Opus 4.7 max: 93/94 in 6.63 minutes, or 14.03 correct answers per minute.

Highest total cost

Opus 4.8 max: $4.26, with the highest cost per correct answer at $0.046.

Longest runtime

Opus 4.8 max: 14.05 minutes, more than twice as long as Opus 4.7.

What the benchmark is testing

Goal grounding

Do the thing that satisfies the goal, not the thing that sounds locally convenient.

Modified riddles

Notice when a classic puzzle has been changed and the memorized solution no longer applies.

Literal precision

Count letters, follow exact-output constraints, and do not invent missing premises.

Social pragmatics

Understand polite indirect requests and ordinary conversational signals.

Physical common sense

Keep the real-world constraint in view instead of optimizing a fake proxy.

Temporal state

Track what changed, what stayed true, and what follows now.

Miss analysis

GG-01: the car wash

Prompt: I want to wash my car. The car wash is only 100 metres away. Should I drive there or walk?

Expected: Drive there. The car is the thing that needs to reach the car wash.

Miss: GPT-5.4 xhigh said to walk. That is sensible for a human errand and wrong for the actual goal.

LP-08: the pet rock

Prompt: If my pet rock ran away, should I call it or wait for it to come home?

Expected: Neither. A rock cannot run away or come home.

Miss: GPT-5.5 medium said to wait for it to come home. High and xhigh both rejected the premise.

LP-15: the stage magician

Prompt: A stage magician pretends to saw a person in half during a trick. How many people are there after the trick?

Expected: One person, assuming it is a stage trick.

Misses: all three GPT-5.5 effort settings and Opus 4.8 counted the magician too. Fair enough at a glance; still the wrong read of what the question is asking.

LP-20: banana without b

Prompt: Say "banana" without using the letter b.

Expected: anana

Misses: both Opus runs argued that spoken words do not use letters. Clever answer, wrong transformation.

Cost, usage, and runtime

Usage accounting is normalized because the harnesses expose cache tokens differently. Codex cache-read tokens are already inside input tokens; Claude cache read and creation tokens are reported separately, so the table folds them into normalized input.

Model	Input	Output	Cost	Duration
GPT-5.4 xhigh	913,699 (524,544 cached)	12,170 (8,517 reasoning)	$1.29 estimated API-equivalent	8.91 min
GPT-5.5 xhigh	1,043,944 (635,136 cached)	10,246 (6,974 reasoning)	$2.67 estimated API-equivalent	9.19 min
GPT-5.5 high	1,044,243 (440,576 cached)	6,933 (3,670 reasoning)	$3.45 estimated API-equivalent	8.42 min
GPT-5.5 medium	1,043,677 (479,488 cached)	5,368 (2,097 reasoning)	$3.22 estimated API-equivalent	7.19 min
Claude Opus 4.7 max	895,081	17,899	$3.88 provider reported	6.63 min
Claude Opus 4.8 max	628,489	52,726	$4.26 provider reported	14.05 min
Total	5,569,133	105,342	$18.76	54.39 min

Caveats

This is a 94-question homemade benchmark, not a universal model ranking.
Scores moved after adding narrow accepted variants for semantically correct answers. That is good for fairness, but it is also a reminder that graders are part of the benchmark.
The runs use subscription-backed CLI harnesses: Codex CLI for GPT and Claude Code for Opus. That is closer to how I actually use these models, but it means the harness is part of the result.

Still, I like this benchmark because it catches a real thing: frontier models can be brilliant and still fumble a prompt that looks too simple to be worth thinking about.