Latest benchmark run

Model answers for every benchmark question

Browse the actual responses from GPT-5.4 xhigh, GPT-5.5 medium, GPT-5.5 high, GPT-5.5 xhigh, Claude Opus 4.7 max, and Claude Opus 4.8 max for each prompt in the latest 94-question comparison.

Browse answers Back to run Dataset cards Project overview

Loading answers...