Run Locally · Reasoning Benchmark

Inspect The Dataset

python3 scripts/run_benchmark.py --list
python3 scripts/run_benchmark.py --list-suites
python3 scripts/run_benchmark.py --list --suite starter

The default list command shows the 94-question auto-scored slice. Use a suite selector for calibrated slices such as starter or holdout.

Create Prompts Or A Run Template

python3 scripts/run_benchmark.py --sample-run
python3 scripts/run_benchmark.py --emit-prompts runs/prompts.jsonl

The sample run gives the expected JSON shape. The prompt pack is useful when sending cases to an external harness.

Run Through Adapters

The adapter entrypoints live in scripts/api_adapter.py and scripts/cli_adapter.py. They share the prompt contract from scripts/benchmark_contract.py and adapter helpers from scripts/benchmark_adapters.py.