Reasoning Benchmark Docs

What It Measures

The benchmark asks models to answer small prompts that look obvious but punish template matching. The scoring target is the final answer, with manual fields preserved for deeper review when a run needs audit notes.

Goal grounding

Does the response satisfy the actual user goal?

Modified riddles

Does it notice when a familiar puzzle has changed?

Literal precision

Does it follow exact wording and output constraints?

Social pragmatics

Does it infer ordinary conversational intent?

Physical common sense

Does it keep real-world constraints in view?

Temporal state

Does it track what changed and what remains true?

How To Use The Site

Run locally List suites, emit prompt packs, and create a run template. Add questions Update the dataset and suites without changing the contract. Score and evaluate Normalize model outputs and inspect the scored artifact. Publish docs Regenerate static JS payloads before GitHub Pages updates. Browse questions Filter the dataset by search, category, and suite. Historical runs Read the GPT/Opus May 2026 comparison and answer table.

Caveats

This is a compact benchmark, not a universal model ranking.
Automatic scoring is conservative and limited to final-answer correctness.
Harness choice matters because CLI adapters and provider APIs expose usage differently.
Accepted variants are part of the benchmark contract and should be reviewed with dataset changes.

Repository: source data, scripts, adapters, tests, and docs live at github.com/calvinnwq/reasoning-benchmark.