Data Insight
Sep. 3, 2025

LLMs have not yet solved the hardest problems on high school math contests

By Greg Burnham

A year ago, LLMs could only solve some of the easier problems on premier high school math contests. Now they have achieved gold-medal equivalent scores on the International Math Olympiad (IMO), the pinnacle of such contests. However, no LLM has solved even a single problem from the highest tier of difficulty on these contests.

Small sample size at this highest tier leaves uncertainty about LLMs’ precise capabilities here, but complete saturation is unlikely.

Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.

Learn more about this graph

We construct a unified problem difficulty scale by combining the Art of Problem Solving’s competition ratings and US IMO coach Evan Chen’s Math Olympiad Hardness Scale (MOHS). We source accuracy data from MathArena for the AIME, USAMO, and IMO, and plot models that were state-of-the-art at the time of their release. We also plot the scores announced for the unreleased models that achieved gold-medal equivalent scores at the IMO.

Publicly available models achieve >95% on problems up to a rating of 5. These models have not fully saturated problems rated 6-7, though GPT-5 achieves >67%. The remaining gap is due to reliability, i.e., it solves each problem at least once when sampled four times. The 2025 IMO contained three problems rated 7, two rated 8, and one rated 9. Certain experimental LLMs, e.g. Google’s Deep Think, solved all but the last of these. Since this sample size is small, we also test the publicly available Deep Think model on two easily-checkable 2024 IMO problems, rated 8 and 9, and find that it fails to solve each even a single time out of 10 samples.

Data

Analysis

Assumptions

Explore this data