What It Measures
The benchmark asks models to answer small prompts that look obvious but punish template matching. The scoring target is the final answer, with manual fields preserved for deeper review when a run needs audit notes.
Goal grounding
Does the response satisfy the actual user goal?
Modified riddles
Does it notice when a familiar puzzle has changed?
Literal precision
Does it follow exact wording and output constraints?
Social pragmatics
Does it infer ordinary conversational intent?
Physical common sense
Does it keep real-world constraints in view?
Temporal state
Does it track what changed and what remains true?
How To Use The Site
Caveats
- This is a compact benchmark, not a universal model ranking.
- Automatic scoring is conservative and limited to final-answer correctness.
- Harness choice matters because CLI adapters and provider APIs expose usage differently.
- Accepted variants are part of the benchmark contract and should be reviewed with dataset changes.