On This Page
AI Engineering

The LLM Evaluation Crisis: Contamination, Saturation, and the Judge Problem

LLM evaluation is breaking down: benchmark saturation, contamination, and biased LLM-as-a-judge setups make leaderboard numbers misleading. Here is what to measure instead.

RayZ
Diagram of a saturating benchmark curve and a verbosity-biased LLM judge feeding a single leaderboard number

In early 2024, Scale AI's research team did something the rest of the field had mostly avoided: they wrote a brand-new grade-school math test, by hand, in the exact style of the GSM8K benchmark that every model reports on, and then ran the frontier on it. The new test, GSM1k, was designed to be indistinguishable in difficulty from GSM8K. If a model had learned to do grade-school arithmetic, the two scores should match. For several model families they did not. Some models dropped by as much as 13 points, and the size of the drop correlated with how often a model would spontaneously regurgitate verbatim GSM8K problems. The models had not learned arithmetic as well as the leaderboard said. They had partly learned the test.

That study (Zhang et al., 2024) is the cleanest illustration of why LLM evaluation is in trouble, but it is one symptom of three. Benchmarks are saturating faster than we can build them. The data that benchmarks are made of is leaking into training sets. And the tool we reach for when static benchmarks run out, the LLM-as-a-judge, has measurable biases that quietly reshape the rankings. None of these is a reason to stop evaluating. All of them are reasons to stop trusting a single number with a model name next to it. This piece is about what is actually breaking and what an honest evaluation looks like once you accept that a leaderboard position is a hypothesis, not a result.

Saturation: when the test runs out of headroom

Start with the most visible failure, because it looks like success. The hardest reasoning benchmarks of 2023 are effectively solved. GPQA Diamond, a set of graduate-level science questions written to be "Google-proof," tells the cleanest version of the story. In late 2023 the best models scored around 39%, below the roughly 70% baseline of a non-expert human with web access. OpenAI's o1 reached about 77% in September 2024. By early 2026 the frontier had cleared 94%: per Epoch AI's tracker, Gemini 3-class models now sit more than 20 points above the PhD-expert baseline. That is essentially the entire usable range of the benchmark consumed in a little over two years. AIME 2025, the competition-math set meant to separate strong reasoners from the rest, is further gone still: GPT-5, Gemini 3, and the current DeepSeek line all sit in the upper-80s to mid-90s, and the very top of the range is closing on a perfect score.

A saturated benchmark is not a useless one, but it stops doing the job people still use it for. Three things break at once.

First, the ceiling compresses the signal. When five models score between 93% and 96%, the gaps are inside the noise of the test itself. GPQA Diamond has 198 questions. A one-question swing is half a percentage point, so a "two-point lead" can be three questions, several of which may be mislabeled. Ranking models by a saturated benchmark is ranking them by the residual error of the benchmark.

Second, the remaining headroom is the least representative part of the test. The questions that frontier models still miss on a near-solved benchmark are disproportionately the ambiguous ones, the mislabeled ones, and the ones with contested answers. Optimizing for them does not make a model better at the underlying skill. It makes the model better at the benchmark's idiosyncrasies, which is the opposite of what you wanted to measure.

Third, saturation hides distribution. A model can score 94% on AIME and still fail in ways that matter, because competition math is a narrow, clean, heavily-represented slice of "reasoning." The benchmark tops out long before the capability does, so the number stops tracking anything you care about while continuing to look authoritative. This is the awkward subtext of the reasoning-model surge: the benchmarks that made the capability legible are the first ones it saturates.

The honest baseline question (the first move in any evaluation) cuts straight through this. Before celebrating a frontier score, ask what the simplest credible alternative scores. On most saturated benchmarks the answer is "almost the same," and the moment the gap between your model and a year-old open-weights model is two points on a 200-item test, the benchmark has stopped being evidence.

Contamination: the benchmark is in the training set

Saturation would be a manageable problem if the scores were real. Often they are not, because the test data is in the training data.

The mechanism is mundane. Benchmarks live on the public web. GSM8K, MMLU, HumanEval, GPQA, and their solutions are on GitHub, in papers, in blog posts, in Stack Overflow answers, and in the dozens of derivative datasets that scrape all of the above. Web-scale pretraining corpora ingest all of it. By the time a model is trained, some fraction of the "held-out" test set has been seen, sometimes verbatim, sometimes paraphrased, sometimes as a worked solution that is more useful than the question alone.

The GSM1k study is valuable precisely because it isolates this effect instead of speculating about it. By building a fresh test matched to GSM8K's distribution, the authors could attribute the score gap to memorization rather than difficulty. The correlation they found, between a model's tendency to emit verbatim GSM8K text and its GSM8K-to-GSM1k drop, is the signature of contamination rather than an honest generalization gap. Notably, the strongest frontier models at the time showed little overfitting, which is the part of the result that gets quoted to dismiss the problem. That reading is too comfortable. Contamination is not uniform across models, it is worst exactly where the incentive to train on benchmark-like data is highest, and it is invisible unless you build a control set, which almost nobody does for every benchmark they cite.

Detecting contamination after the fact is hard and partial. The common approaches each see only part of the picture:

  • N-gram and substring overlap between training data and test items. Catches verbatim leakage, misses paraphrase.
  • Membership-inference signals such as a model's per-token loss or the Min-K% probability on test items relative to a reference distribution. Catches memorization, produces false positives on common text.
  • Canary strings, like the BIG-bench canary GUID, embedded in benchmark files so you can later grep training data for them. Only works if everyone respects the canary, which they do not.
  • Perturbation tests, the GSM1k approach generalized: rewrite items, swap names and numbers, reorder options, and watch how much score evaporates. This is the most reliable signal and the most expensive.

The uncomfortable conclusion is that for any benchmark old enough to be useful as a comparison, you should assume nonzero contamination and treat cross-model comparisons on it as contaminated until a perturbation or held-out variant says otherwise. This is not paranoia. It is the same reasoning move that the retrieval evaluation problem runs into from a different direction: the metric looks clean, the system underneath is leaking, and only separated, adversarial measurement exposes it.

The judge problem: measuring with a biased instrument

When static benchmarks saturate or get contaminated, teams reach for two replacements: human preference arenas and LLM-as-a-judge evaluations. Both move the goalposts to "which answer is better" on open-ended prompts, which is exactly where real products live. Both also introduce an instrument with its own systematic error, and unlike a multiple-choice benchmark, that error is correlated with the thing being measured.

Start with the automated judge, because it is now the default for fast iteration. Using a strong model to score another model's outputs is cheap, reproducible, and scales. It is also biased in ways that are well documented:

  • Position bias. Present two answers as "A then B" versus "B then A" and the judge's preferred answer changes more often than it should. Zheng et al. (2023), the MT-Bench and Chatbot Arena paper that established LLM-as-a-judge, measured this directly and recommended swapping positions and averaging. Most pipelines still score in a fixed order.
  • Verbosity bias. Judges prefer longer answers, controlling for quality. Measurements across GPT-4, Claude, and PaLM-2 judges have put the inflation at roughly 15 to 30 points of preference for the longer option. If your "improved" model just learned to write more, a verbosity-biased judge will reward it and you will ship length as if it were quality.
  • Self-preference. Judges rate their own outputs, and outputs stylistically similar to their own, above what a neutral grader would give. The self-preference bias study (2024) ties the effect to the judge recognizing its own generation style. The practical hazard: evaluating a GPT-family model with a GPT-family judge inflates the score in a way that does not transfer to users.

None of these biases is fatal on its own. The danger is that they are not random noise, they are directional, and they compound with the thing you are optimizing. If you tune a model against a judge with verbosity and self-preference bias, gradient pressure will discover both, and your eval will keep improving while the product gets worse. That is Goodhart's law operating inside your eval harness: the measure became a target, so it stopped being a measure.

The fixes are known and underused: shuffle option order and average over both positions, control for length explicitly or cap it, use a judge from a different model family than the one under test, score per-criterion rather than asking for a single holistic verdict, and calibrate the judge against a human-labeled slice before trusting it on the rest. A judge you have not validated against human labels is a vibe with an API key.

When humans are the judge, the gaming moves upstream

Human preference arenas were supposed to be the contamination-proof, bias-resistant answer. You cannot memorize a held-out human, and crowd preferences are real signal. Then the incentives arrived. The Leaderboard Illusion (Singh et al., 2025), a 68-page audit of Chatbot Arena from researchers across Cohere Labs, AI2, Princeton, Stanford, Waterloo, and UW, documented how the most-cited human-preference leaderboard can be optimized rather than simply measured.

The findings worth internalizing: providers could test many private variants in parallel, up to 27 in a single month in the period studied, and publish only the best, which turns the leaderboard into a multiple-comparisons search where the winner is partly the luckiest draw. Proprietary models were sampled in more battles and de-listed less often than open-weights models, skewing the data each model's rating was built from. And prompts repeat: 7.3% of December 2024 prompts reappeared verbatim in January 2025, rising to about 9% by semantic similarity, which is exploitable by anyone who can see the distribution. LMArena disputed parts of the framing and noted open models made up a large share of traffic, and that back-and-forth is healthy. The structural point survives the rebuttal: any leaderboard with stakes attached becomes a target, and a target is no longer a clean measurement.

This is the same lesson as contamination, one level up. Contamination games a static test by leaking its contents. Arena optimization games a dynamic test by exploiting its sampling and resubmission rules. In both cases the number goes up and the underlying capability does not move with it.

Agentic evaluation: where contamination and reliability collide

Everything above gets worse when the thing under test is an agent rather than a single response, because agentic evaluation stacks a second failure mode on top of contamination. Coding agents are the clearest case, because they have the most-cited benchmark and the worst contamination exposure.

SWE-bench, and the human-validated 500-problem SWE-bench Verified subset that OpenAI released in August 2024, scores a model on its ability to resolve real GitHub issues from real open-source repositories. The problem is right there in the description: real GitHub issues, with their real merged pull requests, are public, and they were public well before the model was trained. The benchmark asks a model to reproduce a fix that, in many cases, exists verbatim in its training data. There are documented cases of agents recovering the actual merged patch rather than reasoning to it, sometimes citing the fixing commit. A high SWE-bench Verified number conflates two very different abilities: solving the bug, and recalling the public solution. The benchmark cannot tell them apart, and neither can you from the score alone.

The defense that actually works is the GSM1k logic turned into a moving window. LiveCodeBench timestamps every problem with its contest release date, drawn from LeetCode, AtCoder, and Codeforces, so you can evaluate a model only on problems published after its training cutoff. The post-cutoff slice is guaranteed contamination-free, and scores on it are reliably lower and more honest than the all-time number. Any coding evaluation that does not condition on a release date is reporting a blend of skill and recall, with no way to say how much of each.

Then there is reliability, the failure mode unique to agents. A single response is right or wrong once. An agent runs a stochastic trajectory of tool calls, and the same task on the same agent can succeed on one attempt and take an irreversible wrong action on the next. τ-bench, Sierra's tool-agent benchmark built around airline and retail customer-service tasks, made this measurable by reporting pass^k: the probability that an agent succeeds on all k independent attempts, not just one. The gap between pass@1 and pass^k is brutal. Agents that look strong at a single attempt degrade sharply when you demand they succeed two, three, or five times in a row, because consistency is a different and harder property than capability. A single-shot success rate, which is what almost every agent demo reports, hides this entirely.

Put the three together and an agentic benchmark number is conflating at least three things: did the agent solve the task, did it recall a public solution, and will it do so again on a retry. Honest agentic evaluation separates all three. Condition on post-cutoff problems so the score is not recall. Report pass^k, not just pass@1, so the score reflects reliability rather than a lucky trajectory. And count irreversible wrong actions taken along the way, not just final success, because in production the cost of an agent that succeeds 80% of the time and deletes data the other 20% is not 80% of the value. This is the eval discipline that production agents require and rarely get, and it is why "five nines" is the wrong frame: the right questions are whether the downside is bounded and whether a retry is cheap, both of which pass^k and irreversible-action counts measure directly.

What honest evaluation looks like

The point of cataloguing failures is not despair, it is discipline. Evaluation is a craft with known practices, most of which the "AI Engineer" wave skipped because software testing taught us to ask "did the assertion pass" and never taught us to ask "does the system behave correctly over a distribution, with what variance, on what tail." Here is the standard worth holding.

Report a distribution, not a point. A single accuracy number with no interval is half a result. Bootstrap a confidence interval over the eval set and report it. If two models' intervals overlap, you do not have a ranking, you have a tie, and saying so is the honest move.

python
import numpy as np

def bootstrap_ci(correct, n_resamples=10_000, alpha=0.05, seed=0):
    """95% CI for mean accuracy via bootstrap resampling.
    `correct` is a 0/1 array of per-item outcomes."""
    rng = np.random.default_rng(seed)
    correct = np.asarray(correct, dtype=float)
    n = len(correct)
    means = rng.choice(correct, size=(n_resamples, n), replace=True).mean(axis=1)
    lo, hi = np.quantile(means, [alpha / 2, 1 - alpha / 2])
    return correct.mean(), (lo, hi)

# Two models on a 198-item benchmark
acc_a, ci_a = bootstrap_ci(model_a_results)  # 0.945, (0.912, 0.970)
acc_b, ci_b = bootstrap_ci(model_b_results)  # 0.930, (0.894, 0.960)
# Overlapping intervals: the "1.5 point lead" is not significant on this set.

On a 198-item benchmark like GPQA Diamond, the 95% interval is roughly plus or minus three points before you account for label noise. That single fact invalidates most of the "model X beats model Y" claims made on saturated benchmarks.

Build or buy a private held-out set. The only durable defense against contamination is data the model has never seen and you never publish. Keep a private eval set drawn from your real distribution, rotate it, and never commit it to any repo that could be scraped. Public benchmarks are for rough triage. Decisions get made on private data.

Perturb before you trust. For any public benchmark you must cite, run the cheap GSM1k move: rewrite a sample of items, swap entities and numbers, shuffle answer options, and measure the drop. A model that loses ten points under trivial perturbation has told you the public score is contaminated or brittle, and you should weight it accordingly.

Separate the dimensions. A single quality score hides which part is broken. Retrieval systems need context precision and recall measured apart from faithfulness and answer relevance, the argument made in detail in the RAG evaluation breakdown. Reasoning systems need accuracy split from calibration and from cost-per-correct-answer. Agents need task success split from the number of irreversible wrong actions taken along the way. Conflated metrics are how pipelines regress silently.

Validate your judge. If you use an LLM judge, treat it as an instrument that needs calibration. Run it against a human-labeled slice, measure its agreement, swap answer positions and average, control for length, and use a different model family than the system under test. A position-controlled pairwise judge is a few lines:

python
def judge_pairwise(judge, prompt, ans1, ans2):
    """Score both orderings and average to cancel position bias.
    Returns P(ans1 preferred) in [0, 1]."""
    p_forward = judge.prefers(prompt, a=ans1, b=ans2)   # 1 if 'a' wins
    p_reverse = judge.prefers(prompt, a=ans2, b=ans1)   # 1 if 'a' (=ans2) wins
    # ans1 wins forward when a-wins; ans1 wins reverse when b-wins
    return (p_forward + (1 - p_reverse)) / 2

# If p_forward and p_reverse disagree often, the judge is position-driven,
# not quality-driven, and the raw single-order score was noise.

Tie the metric to a decision. The reason to measure is to choose: ship or not, this model or that one, this prompt or the old one. An eval that cannot change a decision is theater. Before building a harness, write down the decision it informs and the threshold that would flip it. If no number would change your mind, you are not evaluating, you are collecting reassurance.

This is the same scientific frame that production AI agents demand: an agent pointed at an unsolved problem is a model in disguise, and it deserves a hypothesis, a baseline, an eval harness, and an ablation, not a prompt-engineering lap around the problem. The reliability conversation is not "did it pass," it is "how does it behave across the distribution, and what does the tail cost."

The reality gap, restated for evaluation

Every beat this site cares about comes back to the gap between what a demo shows and what production survives. Evaluation is where that gap is supposed to be measured, which makes a broken evaluation the most expensive kind of broken. A contaminated benchmark, a saturated leaderboard, or a biased judge does not just give you a wrong number. It gives you a confident wrong number, and confident wrong numbers are what get shipped.

The fix is not a better single benchmark. There will not be one, because any benchmark good enough to matter becomes a target the moment it matters. The fix is methodological: distributions over points, private data over public, perturbation over trust, separated dimensions over a single score, validated judges over raw ones, and every metric tied to a decision it could actually flip. None of that is novel. It is ordinary measurement discipline, imported from the experimental sciences that have been fighting contamination and instrument bias for a century. The LLM field is rediscovering it the hard way, one saturated leaderboard at a time.

Key Takeaways

  1. A leaderboard number is a hypothesis, not a result. On saturated benchmarks the gaps between frontier models are inside the test's own noise; a "two-point lead" on a 198-item set like GPQA Diamond is often three questions, some mislabeled.
  2. Assume nonzero contamination on any benchmark old enough to compare with. The GSM1k study showed up to 13-point drops for some model families on a fresh, distribution-matched test, with the drop correlating with verbatim memorization of the original.
  3. Contamination detection is partial. N-gram overlap misses paraphrase, membership inference produces false positives, canary strings require cooperation. Perturbation testing (rewrite, swap, shuffle, re-measure) is the most reliable signal and the most expensive.
  4. LLM-as-a-judge has directional, not random, bias. Position bias, verbosity bias (roughly 15 to 30 points toward longer answers), and self-preference compound with whatever you optimize, so an unvalidated judge will reward length and stylistic mimicry as if they were quality.
  5. Human preference arenas move the gaming upstream, not away. The Leaderboard Illusion documented private-variant farming (up to 27 in a month), sampling skew, and ~7-9% prompt repetition, all of which let a model be optimized for the arena rather than measured by it.
  6. Report distributions, not points. Bootstrap a confidence interval; if intervals overlap, report a tie. This single practice invalidates most "model X beats Y" claims on near-saturated benchmarks.
  7. Decisions get made on private, rotated, perturbed data. Public benchmarks are for triage. A private held-out set drawn from your real distribution is the only durable defense against contamination.
  8. Tie every metric to a decision it could flip. An eval that no number could change is theater. Write down the decision and the threshold before you build the harness.

If you found this useful, the Acing AI newsletter covers the gap between what AI research claims and what production systems actually do, one measured failure mode at a time. No hype, no leaderboard worship.