Add Questions · Reasoning Benchmark

Update The Dataset

Add each case to data/questions.json with a stable id, category, prompt, expected_answer, accepted variants, rationale, and failure mode. Keep data/questions.csv aligned when it is used for review workflows.

Review Scoring Inputs

Accepted variants should be narrow enough to avoid rewarding the common wrong answer, but broad enough to cover semantically equivalent final answers.

Place Cases In Suites

Suite manifests live in data/suites/. Add new IDs deliberately so starter, holdout, and default slices keep their intended coverage.

Regenerate Docs Data

python3 scripts/build_docs_data.py
python3 -m unittest tests.test_docs_site