Documentation site

Reasoning Benchmark

A browsable benchmark catalogue for short prompts that catch weak reasoning: goal grounding, world-state tracking, social pragmatics, modified riddles, literal precision, physical constraints, and instruction ambiguity.

Browse questions Project overview Historical run Model answers Scoring contract

144 questions total

94 default auto-scored

8 categories

4 suites

Benchmark Reports

Future result reports can live here as static summaries from scored run artifacts: model comparisons, per-category accuracy, reasoning notes, and cost or latency telemetry.

Latest comparison Reserved for the newest scored model-vs-model report.

Category breakdowns Reserved for accuracy by task family and failure mode.

Run archive Reserved for stable links to historical benchmark bundles.