How-to guide

Add benchmark questions

Keep new cases explicit, reviewable, and compatible with the scorer and static docs payload.

Update The Dataset

Add each case to data/questions.json with a stable id, category, prompt, expected_answer, accepted variants, rationale, and failure mode. Keep data/questions.csv aligned when it is used for review workflows.

Review Scoring Inputs

Accepted variants should be narrow enough to avoid rewarding the common wrong answer, but broad enough to cover semantically equivalent final answers.

Place Cases In Suites

Suite manifests live in data/suites/. Add new IDs deliberately so starter, holdout, and default slices keep their intended coverage.

Regenerate Docs Data

python3 scripts/build_docs_data.py
python3 -m unittest tests.test_docs_site