Benchmarks

49 results

Evaluated by Epoch AI

FrontierMath Tier 4

50 exceptionally difficult research-level math problems.

XX models evaluated models

Highest scoreXX%

Mathematics Agent

Evaluated by Epoch AI

FrontierMath Tiers 1-3

300 expert-written math problems covering advanced undergrad through early career research.

XX models evaluated models

Highest scoreXX%

Mathematics Agent

Evaluated by Epoch AI

SWE-bench Verified

GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.

XX models evaluated models

Highest scoreXX%

Software engineering Agent

Evaluated by Epoch AI

Chess Puzzles

100 novel puzzles, generated programmatically with a chess engine. Each puzzle has a single best next move.

XX models evaluated models

Highest scoreXX%

Games

Evaluated by Epoch AI

SimpleQA Verified

1,000 factoid questions about politics, science and technology, art, sports, geography, music, and more.

XX models evaluated models

Highest scoreXX%

World knowledge

Evaluated by Epoch AI

GPQA Diamond

A challenging multiple-choice question set in biology, chemistry, and physics, authored by PhD-level experts.

XX models evaluated models

Highest scoreXX%

Science

Evaluated by Epoch AI

OTIS Mock AIME 2024-2025

45 competition-style math problems from OTIS, harder than MATH Level 5 but easier than FrontierMath.

XX models evaluated models

Highest scoreXX%

Mathematics

Evaluated by Epoch AI

MATH Level 5

The hardest tier of problems from the MATH dataset, drawn from competitions like the AMC 10, AMC 12, and AIME.

XX models evaluated models

Highest scoreXX%

Mathematics

Evaluated by benchmark creator

SWE-bench (Bash Only)

500 GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.

XX models evaluated models

Highest scoreXX%

Software engineering Agent

Evaluated by benchmark creator

Aider Polyglot

A benchmark that evaluates models’ performance on a set of challenging programming problems from Exercism, an online programming education platform.

XX models evaluated models

Highest scoreXX%

Software engineering

Research & Commentary

More

Datasets

Benchmarking Data

By Epoch AI

Benchmarks

Research & Commentary

More

Datasets

Benchmarking Data

By Epoch AI

AI Trends & Statistics

Papers & Reports

Newsletter: Gradient Updates

Data Insights

Podcast: Epoch After Hours

Models

Frontier Data Centers

Hardware

Companies

Chip Sales

Polling on Usage

AI Capabilities

FrontierMath

Benchmarks

Filter