Documentation site

Reasoning Benchmark

A browsable benchmark catalogue for short prompts that catch weak reasoning: goal grounding, world-state tracking, social pragmatics, modified riddles, literal precision, physical constraints, and instruction ambiguity.

144 questions total
94 default auto-scored
8 categories
4 suites
Loading questions...

Benchmark Reports

Future result reports can live here as static summaries from scored run artifacts: model comparisons, per-category accuracy, reasoning notes, and cost or latency telemetry.

Latest comparison Reserved for the newest scored model-vs-model report.
Category breakdowns Reserved for accuracy by task family and failure mode.
Run archive Reserved for stable links to historical benchmark bundles.