Explore evaluations across 49 distinct benchmarks, covering mathematics, coding, agentic action, and more.
50 exceptionally difficult research-level math problems.
300 expert-written math problems covering advanced undergrad through early career research.
GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.
100 novel puzzles, generated programmatically with a chess engine. Each puzzle has a single best next move.
1,000 factoid questions about politics, science and technology, art, sports, geography, music, and more.
A challenging multiple-choice question set in biology, chemistry, and physics, authored by PhD-level experts.
45 competition-style math problems from OTIS, harder than MATH Level 5 but easier than FrontierMath.
The hardest tier of problems from the MATH dataset, drawn from competitions like the AMC 10, AMC 12, and AIME.
500 GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.
A benchmark that evaluates models’ performance on a set of challenging programming problems from Exercism, an online programming education platform.