Reasoning Benchmark

Instructional docs for a compact reasoning benchmark

A pure-Python benchmark for short prompts that expose goal-grounding, literal-precision, social-pragmatics, physical-common-sense, and temporal-state mistakes in frontier model runs.

144 questions in the dataset
94 default auto-scored public slice
4 suite selectors in the docs payload
0 runtime dependencies beyond Python stdlib

What It Measures

The benchmark asks models to answer small prompts that look obvious but punish template matching. The scoring target is the final answer, with manual fields preserved for deeper review when a run needs audit notes.

Goal grounding

Does the response satisfy the actual user goal?

Modified riddles

Does it notice when a familiar puzzle has changed?

Literal precision

Does it follow exact wording and output constraints?

Social pragmatics

Does it infer ordinary conversational intent?

Physical common sense

Does it keep real-world constraints in view?

Temporal state

Does it track what changed and what remains true?

How To Use The Site

Caveats

  • This is a compact benchmark, not a universal model ranking.
  • Automatic scoring is conservative and limited to final-answer correctness.
  • Harness choice matters because CLI adapters and provider APIs expose usage differently.
  • Accepted variants are part of the benchmark contract and should be reviewed with dataset changes.
Repository: source data, scripts, adapters, tests, and docs live at github.com/calvinnwq/reasoning-benchmark.