The Gemini 3 release included a massive table showing how the model was state-of-the-art on nineteen diverse benchmarks. Such tables are commonplace by now, but they add up to an odd statistical situation. Benchmarks ostensibly measure different things, but since models tend to improve on many benchmarks at once, the dataset of benchmark scores is dominated by a single “General Capability” dimension.

In this post, I’ll describe the statistics of this dataset, look into what’s left when you factor out this dominant dimension (hint: it’s “Claudiness”), and discuss how this relates to an important question about cross-task generalization.

Benchmarking data is dominated by a single underlying dimension

This is one of the lessons of our recent work on the Epoch Capabilities Index (ECI), which combines thirty-nine benchmarks into a single capabilities score. If benchmarks were generally uncorrelated with each other, you’d expect to see large residuals: the benchmark scores predicted by a model’s ECI number wouldn’t match the model’s actual benchmark scores. As it turns out, we see a very good match. In other words, our nominally high-dimensional dataset is well-approximated by just a single dimension.

To look beyond this dimension, we can do a Principal Component Analysis (PCA). This basically asks: if we make synthetic “components” by taking weighted sums of the different benchmark scores, what’s the most variance in the dataset we can account for with the fewest number of components?

When we do this on the raw data underlying ECI, the first component captures about half the variance of the dataset.12 The table below shows the weights on the different benchmarks in this component, accounting for 80% of the total weight. Note that the weights are all positive and not very dispersed. That is, PCA also finds a single “general capability” component.

Benchmark 1st Principal Component Weights
GPQA diamond 0.30
Aider polyglot 0.29
OTIS Mock AIME 2024-2025 0.28
LiveBench 0.27
SimpleBench 0.27
Balrog 0.27
WeirdML 0.26
FrontierMath Tiers 1–3 0.25
CadEval 0.24
SWE-bench Verified 0.23
GSO-Bench 0.22
FrontierMath Tier 4 0.19

Moving beyond the first principal component, the chart below shows the magnitudes of all the principal components, plotted against the size of components found in randomly generated data of the same shape (i.e., a parallel analysis). We see the single large component mentioned above, a second component that is borderline significant, and the rest having small sizes, consistent with noise.

Benchmarking data shows a smaller “Claudiness” dimension

What is this second component? Here are the top weights, by absolute value, that this component assigns to the benchmarks, again accounting for over 80% of the total weight. By construction, this component is orthogonal to the main “general capability” component. When I first saw this, I said it looked something like, “good at agentic tasks, but bad at vision… and also bad at math?”

Benchmark 2nd Principal Component Weights
The Agent Company 0.42
OSUniverse 0.37
VideoMME -0.35
OSWorld 0.35
Factorio Learning Environment 0.33
VPCT -0.28
FrontierMath Tier 4 -0.26

But I showed it to a colleague and he just said, “it’s Claude”. He was right. Here are the top five models on this component, as well as the bottom five.

Most Claude-y Least Claude-y
Claude Sonnet 3.5 (Oct '24) GPT-5
Claude Sonnet 3.5 (Jun '24) GPT-4o-mini
Claude Sonnet 3.7 GPT-4o (Nov '24)
Claude Sonnet 4 GPT-4o (Aug '24)
Claude Opus 4 o4-mini

I think this second component shows that benchmarks aren’t entirely “general capability” plus “noise”, even if that is a pretty good approximation. Even though this second component isn’t so statistically significant, I think it’s fair to say that it aligns with the general public sense of Anthropic’s priorities, i.e. they seem to be making Claude like this on purpose. This updated my thinking a bit on a broader question, as I’ll explain next.

Is the “general capability” dimension deep, or contingent?

The big question is why a single dimension captures so much of the variance in benchmark scores. I can think of two possible reasons, corresponding to two possible worlds we may be in. I’ll call these worlds “deep” and “contingent”.

In the “deep” world, there is a single underlying ability that governs how well models do at superficially unrelated tasks. In this world, the only thing a model developer can do is make this ability go up. If they succeed, their model gets better at everything.

In the “contingent” world, there are many orthogonal abilities that models can have. These are orthogonal in the sense that model developers have to do completely unrelated work to get a model to improve on each ability. Still, in the world I’m imagining, customers demand models with many capabilities, and so developers put in the work to make this happen.

Which world more resembles our own? Sometimes in the history of AI, things have looked like the contingent world. AlphaGo was superhuman at go but it was nonsense to ask it to do anything else. At other times, things have looked like the deep world. When LLMs were picking up steam, next-token prediction on relatively uncurated web text was tearing through NLP tasks that had previously been dominated by specialized models.3

To a first approximation, benchmark scores look the same in both worlds. But the existence of the Claudiness dimension feels to me like a bit of evidence for the “contingent” world. Anthropic has focused on making models that are state-of-the-art at agentic coding. Without additional focused investment, the models turn out not to be exceptional at advanced math. There is surely some generalization across tasks, but perhaps this is a sign of its limits.

A trillion dollar question

The Claudiness dimension is not very strong evidence for the contingent view. Stronger evidence might be how model developers are investing heavily in collecting specialized data, like reinforcement learning (RL) environments for industry verticals. Even so, it’s possible that they’re doing that and that RL shows excellent cross-task generalization.

One way to test this would be to find an uncontaminated benchmark that measures something unusual, and see if it correlates with the “General Capability” dimension. Unfortunately, we don’t know what counts as “unusual” for models because we don’t know what they saw in training. Also, I suspect there’s a selection effect where benchmarks that show top models scoring poorly tend to capture attention. Still, this seems worth pursuing.

Even if the explanation for what we see in benchmark data is that model developers are pursuing an “everything at once” strategy, they have the resources and the scalable architectures necessary to keep it going. In other words, they can keep making all the benchmarks go up so long as they can get the right training data plus enough compute to make use of it.

What does this mean for the future? I like how Steve Newman put it recently: “how far can you get by simply putting an insane number of things in distribution” is one of the trillion dollar questions.

I doubt that there are in-principle limits to putting everything in distribution, but if we’re more in the contingent world then we shouldn’t expect much of a tailwind from generalization either. Every percentage point of improvement on every benchmark must be paid for. Here I think we should expect to see capabilities continue to improve quite generally, but only so long as the flywheel of growth and investment continues to allow developers to devote resources to actively making this happen.

Stay one step ahead
Get the latest from Gradient Updates in your inbox
Subscribe

Notes

  1. Methodology: we filter our dataset to benchmarks created in 2023 and beyond, and models with at least 8 benchmark scores. We combine different reasoning settings for the same model, taking the max scores. We use k-nearest neighbors to impute missing data, transforming [0-1] scores by a logit first, weighting by distance, and using 5 neighbors. We then do PCA. Data and code can be found here

  2. This main finding accords with previous work, although we now have a much larger dataset of benchmarks. 

  3. Even now there are some prominent specialized models, like Cursor’s Tab or OpenAI’s Codex series. But it seems fair to characterize the current landscape as dominated by models that at least try to “do it all”.