Update The Dataset
Add each case to data/questions.json with a stable id, category, prompt, expected_answer, accepted variants, rationale, and failure mode. Keep data/questions.csv aligned when it is used for review workflows.
Review Scoring Inputs
Accepted variants should be narrow enough to avoid rewarding the common wrong answer, but broad enough to cover semantically equivalent final answers.
Place Cases In Suites
Suite manifests live in data/suites/. Add new IDs deliberately so starter, holdout, and default slices keep their intended coverage.
Regenerate Docs Data
python3 scripts/build_docs_data.py
python3 -m unittest tests.test_docs_site