Epoch Capabilities Index

The Epoch Capabilities Index (ECI) combines scores from many different AI benchmarks into a single “general capability” scale, allowing comparisons between models even over timespans long enough for single benchmarks to reach saturation.

Learn more about how the ECI is calculated.

Settings
Explore by

Overview

ECI is a composite metric which uses scores from 37 distinct benchmarks to generate a single, general capability scale. At a high level, ECI stitches together its component benchmarks, determining their relative difficulty by making comparisons wherever models are evaluated on multiple benchmarks. Individual models obtain higher ECI scores if they perform better on harder benchmarks.

We give an overview of our methodology below; further technical details will be available in our paper, A Rosetta Stone for AI Benchmarks, which was funded by Google DeepMind, and written in collaboration with researchers from their AGI Safety & Alignment team. However, the ECI is an independent Epoch AI product that Epoch has full rights over.

Model

The technical foundation for the ECI comes from Item Response Theory (IRT), a statistical framework originally developed for educational testing. IRT enables comparisons between students, even when they took tests from different years, and one test might be harder than another.

The core of our model is a simple logistic function:

\(\textrm{performance}(m,b) = \sigma(\alpha_b [C_m - D_b])\)

Here, \(\sigma\) represents the logistic function, \(C_m\) is the model’s capability, \(D_b\) is the benchmark’s difficulty, and \(\alpha_b\) is a slope parameter related to the distribution of difficulty across questions within the benchmark. The formula says that a model’s performance on a given benchmark is dependent on how capable it is relative to the difficulty of the benchmark, and how “steep” the benchmark is. Higher \(\alpha_b\) values correspond to “steeper” benchmarks, where individual questions have a narrower range of difficulties, and there is no long tail of much harder questions. These benchmarks tend to saturate quickly as soon as models are able to gain some headway.

We fit model capability (\(C_m\)), benchmark difficulty (\(D_b\)), and benchmark slope (\(\alpha_b\)) parameters that best explain the full set of observed scores. We do not assume any relationship between capability and time or compute inside the model.

We fit the model via non-linear least-squares estimation, using a ridge regularization penalty to discourage overfitting. The scale of the resultant values is arbitrary; we currently rescale so that Claude 3.5 Sonnet is fixed at 130, and GPT-5 is fixed at 150, in order to allow for consistent scoring over recent models in a way that balances communicating our uncertainty with providing detailed information.

Data

We fit the model to benchmark scores covering 42 benchmarks spanning 2023–present, drawn from the Epoch Benchmarking Hub. We use the following benchmarks:

Internal evaluations

Chess Puzzles, FrontierMath Tiers 1-3, FrontierMath Tier 4, GPQA Diamond, MATH Level 5, OTIS Mock AIME 2024-2025, SimpleQA Verified

External benchmark leaderboards

Aider polyglot, APEX-Agents, ARC-AGI-2, BALROG, DeepResearch Bench, Fiction.liveBench, GeoBench, GSO, HLE, SimpleBench, SWE-bench (Bash Only), Terminal-Bench, The Agent Company, VPCT, WeirdML V2

Developer reported scores

ANLI, ARC AI2, ARC-AGI, BIG-Bench Hard, CADEval, CSQA2, Cybench, GSM8K, HellaSwag, LAMBADA, Lech Mazur Writing, MMLU, OpenBookQA, OSWorld, PIQA, ScienceQA, SuperGLUE, TriviaQA, Video MME, WinoGrande

To be used in our methodology, benchmarks need to be scored on a 0-1 scale (or equivalently, 0% to 100%). For benchmarks where random guessing would score above 0 (e.g. those with multiple choice responses), we rescale so that random guessing performance is scaled to zero.

In order to capture the upper end of each model’s capabilities, we aggregate across model evaluation settings (e.g. thinking effort and inference provider), taking the highest score for each benchmark. We only aggregate over models released on the same day with the same name (e.g. we do not aggregate across versions of GPT-4o released on different days). We drop any models with fewer than 4 benchmark scores from our model fit, to avoid low-certainty estimates. We also exclude models released before 2023 due to data sparsity; we hope to increase coverage in the future.

Frequently asked questions

What does the ECI represent?

How should I interpret ECI values?

Why isn’t this model’s ECI higher, if it leads some benchmarks?

How do you decide which benchmarks to use?

Why did the ECI score of a model change?

Isn’t it a problem if model developers release only their best scores?

Isn’t it a problem if model developers optimize for benchmark scores?

Why isn’t my model included?

Why isn’t my benchmark included?

Acknowledgements

This work was based on research conducted with support from Google DeepMind, and thus draws directly on the methodology introduced in our paper, A Rosetta Stone for AI Benchmarks. However, the ECI is an independent Epoch AI product that Epoch has full rights over. We thank Rohin Shah, Samuel Albanie, Anna Wang, Eli Lifland, Nate Rush, Ezra Edelman, and Isabel Juniewicz.