Inspect The Dataset
python3 scripts/run_benchmark.py --list
python3 scripts/run_benchmark.py --list-suites
python3 scripts/run_benchmark.py --list --suite starter
The default list command shows the 94-question auto-scored slice. Use a suite selector for calibrated slices such as starter or holdout.
Create Prompts Or A Run Template
python3 scripts/run_benchmark.py --sample-run
python3 scripts/run_benchmark.py --emit-prompts runs/prompts.jsonl
The sample run gives the expected JSON shape. The prompt pack is useful when sending cases to an external harness.
Run Through Adapters
The adapter entrypoints live in scripts/api_adapter.py and scripts/cli_adapter.py. They share the prompt contract from scripts/benchmark_contract.py and adapter helpers from scripts/benchmark_adapters.py.