FrontierMath

Dashboard Benchmarks Models ECI Data About

About FrontierMath

FrontierMath is a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics – from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days.

The full FrontierMath dataset contains 350 problems. This is split into a base set of 300 problems, which we call Tiers 1-3, and an expansion set of 50 exceptionally difficult problems, which we call Tier 4. We have made 10 problems from Tiers 1-3 public, calling this frontiermath-2025-02-28-public. The remaining 290 problems make up frontiermath-2025-02-28-private. Similarly, we have made 2 problems from Tier 4 public, calling this frontiermath-tier-4-2025-07-01-public, while the remaining 48 problems make up frontiermath-tier-4-2025-07-01-private. Unless explicitly mentioned otherwise, all the numbers on this hub correspond to evaluations on the private sets. You can find more information about the public problems here.

FrontierMath was developed with funding from OpenAI, who has exclusive access to a subset of the benchmark. See our conflict of interest statement for details.

Methodology

For FrontierMath, we recommend using the log viewer on the public questions as the best way to understand the evaluation settings (e.g. click here for o4-mini-2025-04-16 with high reasoning effort).

For each FrontierMath question, the model needs to submit a Python function answer() that returns the answer. The answer is a Python object, often (although not always) an integer or a sympy object. Our implementation allows the model to reason and run Python code.

We use the following prompt, which specifies the model’s affordances:

You will be solving a challenging mathematics question. Here’s how it works:

You can:
- Think out loud and explore the problem
- Use the python tool to execute arbitrary Python code
- Submit your answer using the submit_answer tool when you are confident in your answer.
Token limits:
- There is a hard limit of 1,000,000 tokens. If you exceed this, the conversation will end immediately (even if you haven’t submitted an answer).
- If you reach 660,000 tokens (but less than the hard limit of 1,000,000), you will be forced to use the submit_answer tool in your next message. This forced submission stage is designed to give you the best chance of submitting an answer before reaching the hard token limit. But it is not a guarantee. It is still your responsibility to avoid hitting the hard limit.
- Both input and output tokens count towards the limits.
Scoring:
- If your answer is correct you will get 1 point. If it is incorrect, or if you don’t submit an answer, you will get 0 points.
Explain your reasoning to me before submitting an answer.
Tips
- I strongly recommend that you start by making a high-level plan for how you will attack the problem. If you can, think about different approaches that could be used to solve the problem. To help you stay on track, periodically summarize your key findings and potentially revise your plan.
- Before submitting, verify your answer satisfies all problem requirements. It may be worth trying a different approach if you can see that your current answer is not correct.
For using the submit_answer tool:
- Pass in the code of a Python function named ‘answer’ that:
  - Takes no parameters
  - Returns your answer as a {answer_type}
  - Prints no output
  - Contains no code comments
- When scoring your answer, the maximum runtime for the answer function is 30 seconds. The code is executed on typical commodity hardware for the year 2025.
For using the python tool:
- The tool will only return stdout (and stderr), so you must make sure to use print() to see your results. If you don’t get any output from a python tool call, you probably forgot to print.
- Example: x = 5 * 12 print(“The result is”, x) In this example, you must include the print statement. Otherwise, you won’t see the value of x.
- The tool is completely stateless and doesn’t come with anything pre-imported. This is very important. If you need modules (e.g. math, sympy), you must import them each time. You cannot access variables defined in a previous call to python, so you must re-define anything you need in each call.
- You have access to the standard library, and the following libraries (expressed in requirements.txt format):
  - galois==0.4.4
  - gmpy2==2.2.1
  - mpmath==1.3.0
  - networkx==3.4.2
  - numpy==2.1.3
  - pyadic==0.2.3
  - scipy==1.15.2
  - sympy==1.13.3
- Do not submit your answer using the python tool. Use the submit_answer tool when you’re ready to submit
- The maximum runtime for a python tool call is 30 seconds. The code is executed on typical commodity hardware for the year 2025.

Here is the problem to solve. The answer type is {answer type}.

{question}

This implementation is different from the code we used to run preliminary evaluations in the paper. It is also not the methodology used by OpenAI in their own FrontierMath evaluations, such as for the o3 and o3-mini models: we were not involved in running these evaluations. The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).

Changelog

2026-01-23. We identified issues with two Tier 4 problems that affected prior scores. In one case the issue was with our grader, so we could re-grade solutions given by models on prior runs. The models that got this problem right were: GPT-5.2 Pro, GPT-5.2 (xhigh), GPT-5.2 (high), GPT-5.2 (medium), and GPT-5 Pro. In the other case the issue was with the problem statement itself. Here we re-ran every model that had scored over 10% on Tier 4 on this one problem. Only GPT-5.2 Pro and GPT-5.2 (high) got the corrected problem right. We adjusted the scores of all of these models accordingly.

We also fixed an error in the problem statement for public problem PLD1 in FrontierMath-2025-02-28-Public. Several models had in fact been getting the right answer to the mis-stated problem (the answer was 281718171540). Since this problem set is just for illustrative purposes, we have not updated those scores on our hub.

We bumped the benchmark version number to 1.1.4 to reflect these fixes.

2026-01-22. We upgraded our Inspect dependency, and bumped the benchmark version number to 1.1.3.

2025-12-05. We upgraded our Inspect dependency, and bumped the benchmark version number to 1.1.2.

2025-11-19. We changed our default settings to allow 2% of samples to fail in a given run without the run as a whole failing. This is usually well within standard errors, and proved more practical. We bumped the version number to 1.1.1.

2025-11-13. We increased the token budget given to models by 10×. This was in response to observing models increasingly exceeding the token budget limit. With this change, we bumped the benchmark version number to 1.1.0. We reran a selection of recent, high-performing models on this new version of the benchmark.

Notes on evaluations of specific models

DeepSeek-V3.2 (Thinking)

We ran this model on Fireworks for data security. We found that, when run on public benchmarks, the model performed equally well on Fireworks as it did when hosted by the developer (i.e., using the DeepSeek API), we found Fireworks to have a very high API error rate. We thus took a random sample of 100 problems from FrontierMath Tiers 1-3 and retried until they completed. This accounts for the larger standard error associated with this model’s score.

Gemini 2.5 Pro and Gemini 3 Pro

We encountered an unusual number of API errors when trying to benchmark these models in a timely manner after their release. In cases where a sample failed to complete due to an API error, we retried at least 10 times. If all retries failed, we marked the sample as incorrect. An earlier version of this site included gemini-2.5.-pro in this table, but a subsequent re-run (due to version change 1.1.0, documented above) encountered no issues, and so we removed that entry.

The table below shows how many samples were marked as incorrect in this way.

Model	Benchmark	Accuracy	Samples marked as incorrect after retries
gemini-2.5-pro-preview-06-05	FrontierMath-2025-02-28-Private	10% (±2%)	21/290 = 7%
gemini-3-pro-preview	FrontierMath-2025-02-28-Private	38% (±3%)	10/290=3%
gemini-3-pro-preview	FrontierMath-Tier-4-2025-02-28-Private	19% (±6%)	3/48=6%

Grok 4

For grok-4-0709, we experienced timeouts and network errors using the API in July 2025.

As a result, as of July 2025, we evaluated Grok 4 using specific scoring rules:

FrontierMath-2025-02-28-Private was evaluated using our standard settings. The record ID is gda5UeWrA8HcbDCRuLJ56H. We used the streaming API. 1/290 was not scored due to the server not sending any response. (We allow up to 1% of samples to fail without being scored).

For FrontierMath-Tier-4-2025-07-01-Private, we used a maximum output length per request of 128,000 tokens (default is no maximum), as recommended by xAI. If any requests failed due to network errors or timeouts, we moved the corresponding sample directly to the scoring phase of the evaluation (which generally causes it be be marked as incorrect). This was due to the highly time-sensitive nature of the evaluations.

Benchmark	Accuracy	Samples with API errors	Run ID
FrontierMath-Tier-4-2025-07-01-Private	2% (±2%)	8 out of 48 (16%)	QxtNUmV2L34UyrySmBLTbv

xAI compensated us for this evaluation and provided compute credits. We signed no NDA and maintained complete editorial independence: we publish all results regardless of performance.

The code used in our implementation can be found here.