Data Insight

Nov. 27, 2024

Updated Feb. 7, 2025

Accuracy increases with estimated training compute

By Jean-Stanislas Denain

GPQA Diamond and MATH Level 5 accuracies increase with estimated training compute. For GPQA Diamond, below 10²⁴ FLOP most models struggle to rise above random chance performance — or even perform worse than random chance, due to failing to understand question formatting. Past 10²⁴ FLOP, performance increases around 12 percentage points with every 10x increase of compute.

Benchmark

On MATH Level 5, models with high compute estimates also tend to have higher scores: performance increases around 17 percentage points with every 10x increase in pretraining compute. However, the trend is much noisier than for GPQA Diamond.

On both benchmarks, more recent models such as DeepSeek-R1, Phi-4, or Mistral Small 3 outperform older models trained with the same amount of compute, highlighting the role of algorithmic progress. Finally, note that these trends exclude many of the top-performing models, such as OpenAI’s o1, which we lack compute estimates for.

Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.

Explore this data

AI Capabilities

Benchmark results featuring the performance of leading AI models on challenging tasks.

Research & Commentary

More

Datasets

Benchmarking Data

By Epoch AI

Accuracy increases with estimated training compute

Explore this data

Accuracy increases with estimated training compute

Research & Commentary

More

Datasets

Benchmarking Data

By Epoch AI

AI Trends & Statistics

Papers & Reports

Newsletter: Gradient Updates

Data Insights

Podcast: Epoch After Hours

Models

Frontier Data Centers

Hardware

Companies

Chip Sales

Polling on Usage

AI Capabilities

FrontierMath

Explore this data

Related topics