Score And Evaluate · Reasoning Benchmark

Score A Completed Run

python3 scripts/score_run.py --input runs/example-run.json --output runs/example-run.scored.json

The scorer accepts the canonical top-level results list, compatibility keys such as runs or answers, and bare lists.

Read The Output

score_answer is auto-filled from the final-answer matcher. Manual fields such as score_reasoning, penalties, and notes are preserved but not automatically assigned.

Compare Runs

Use the scored rows, normalized usage fields, and raw provider artifacts together. The historical run pages show one static presentation of those outputs.