About our benchmarking

Epoch AI’s benchmarking brings together results from many of the most informative AI benchmarks—both those we run ourselves and those reported by reputable external sources—into one consistent, searchable place.

Why we made this

AI capabilities are moving quickly, but results can be scattered and hard to compare. We track diverse, informative, and challenging benchmarks so that researchers, practitioners, and policymakers can see where the state of the art is, and how it is changing.

Methodology at a glance

Our internal evaluations are powered by Inspect, and are run with consistent and well documented settings across models. External results come from official leaderboards or primary sources. For the full details (prompting, temperatures, implementations), see the FAQ below.

Frequently asked questions

How did you choose what benchmarks to evaluate on?

How did you choose which models to evaluate?

How do you implement and run your internal evaluations?

Can I see how models answered each question?

How accurate is the data?

Why are some of your scores different from those reported elsewhere?

What do the error bars represent?

Can I see the evaluation code?

Why do some models underperform the random baseline?

How is the data licensed?

How can I access this data?

Who funds Epoch AI's benchmarking?

Who can I contact with questions or comments about the data?

Licensing

Epoch AI’s data is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons Attribution license.

This hub also includes data sourced from external projects, which retains its original licensing. Specifically:

Users are responsible for ensuring compliance with the respective license terms for the specific data they use. Appropriate credit should be given to the original sources as indicated.