Why AI Evaluation Regimes are bad
How the flagship project of the AI Safety Community ended up helping AI Corporations.
I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these risks.
In the community, a few organisations are working on “Evaluations” (which I will shorten to Evals). The most notable examples are Apollo Research, METR, and the UK AISI.
Evals make for an influential cluster of safety work, wherein auditors outside of the AI Corporations racing for ASI evaluate the new AI systems before they are deployed and publish their findings.
Evals have become a go-to project for people who want to prevent extinction risks. I would say they are the primary project for those who want to work at the interface of technical work and policy.
Incidentally, Evals Orgs consistently avoid mentioning extinction risks. This makes them an ideal place for employees and funders who care about extinction risks but do not want to be public about them. (I have written about this dynamic in my article about The Spectre.)
Sadly, despite having taken so much prominence in the “AI Safety” community, I believe that the Evals project is harmful. I believe that it should not receive further attention and investment, and consider plausible that it should be interrupted.
I am not exaggerating for shock value. This article will explain why I think Evals are harmful. My thinking primarily relies on three beliefs:
1) The Theory of Change behind Evals is broken.
2) Evals move the burden of proof away from AI Corporations.
3) Evals Organisations are not independent of AI Corporations, despite claiming otherwise.
—
While Evals Orgs have produced studies that we sometimes mentioned at ControlAI, they have always been much less central to our work than the Center for AI Safety’s statement. Indeed, the top AI experts explicitly warning about extinction risks is more useful than decontextualised technical results.
Even when we use this type of results, we rarely mention Evals Orgs anymore. We now tend to use Palisade’s report on resistance to shutdown or Anthropic’s results on blackmail.
From my point of view, when factoring their negative externalities, Evals clearly do not justify the prominence they have and the resources they command. With all that said…
1) The Theory of Change behind Evals is broken
Briefly put, Evals only make sense in the presence of regulations which do not exist, and they crowd out effort at passing such regulations.
—
It is usually quite hard to debunk the plans of an organisation. This is because said plans are rarely laid out for everyone to see. However, Apollo Research has carefully laid out their theory of change in a document, for which I am very thankful.
Inspecting it though, its core assumptions are clearly wrong! Here are the first two:
1) Regulations demand external, independent audits. […]
2) Regulations demand actions following concerning evaluations. […]
There is no such regulation.
Given their non-existence, it is astonishing to me that people care so much about Evals instead of advocating for regulations.
Evals are entirely dependent on the existence of such regulation.
Even worse, as I will show later, Evals Orgs have put themselves in a position where their incentives are sometimes to fight alongside AI Corps against said regulations.
—
Specifically, Evals Orgs all rely on the assumption that the development and/or deployment of systems with dangerous capabilities is prevented.
From Apollo Research:
If successful, AI system evaluations would identify misaligned systems and systems with dangerous capabilities, thus helping to reduce the risk that such systems would be given affordances that let them have damaging effects on the world (e.g. deployment).
[…]
Such demonstrations could encourage these stakeholders to understand the gravity of the alignment problem and may convince them to propose regulation mandating safety measures or generally slowing down AI progress.
From METR:
METR’s mission is to develop scientific methods to assess catastrophic risks stemming from AI systems’ autonomous capabilities and enable good decision-making about their development.
[…]
We need to be able to determine whether a given AI system carries significant risk of a global catastrophe.
From the UK AISI’s “Approach to Evaluations” document:
On the second day of the Bletchley Summit, a number of countries, together with the leading AI companies, recognised the importance of collaborating on testing the next generation of AI models, including by evaluating for potentially harmful capabilities.
[…]
Our work informs UK and international policymaking and provide technical tools for governance and regulation.
In other words, the work of Evals Orgs only makes sense if AI Corporations are forbidden from deploying systems with dangerous capabilities, and if said capabilities are not too dangerous before deployment.
Their work is thus dependent on other people working hard to make it illegal to develop and deploy AI systems with dangerous capabilities.
In practice, as far as I am aware, no company was ever forced in any way as the result of external Evaluations. I believe there never was a model blocked, postponed or constrained before deployment, let alone during development.
As a result, it seems clear to me that until we actually ban “dangerous capabilities”, their work is not worth much.
2) Evals move the burden of proof away from AI Corporations
So far, I have mostly focused on the fact that the theory of change behind Evals is broken. But I believe that Evals Orgs are actually harmful.
—
First, let’s give some context on extinction risks from AI.
In 2023, the top experts in the field warned about the risk of extinction from AI. However, although most agree that there are risks of extinction, there is agreement (let alone consensus) on little else.
The top AI experts disagree wildly on the probability of said extinction, on when the first AGI systems may be built, on how to make AGI systems safe, and as METR itself notes: even on the definition of AGI.
These are all the signs of a pre-paradigmatic field, where experts cannot even agree on what the facts of the matter are. When despite this, experts nevertheless warn about the literal extinction of humanity, it stands to reason that conservatism is warranted.
In other words, AI Corps should not be allowed to pursue R&D agendas that risk killing everyone until we figure out what is going on. If they nevertheless want to continue, they ought to prove beyond a shadow of doubt that what they are doing will not kill everyone.
If there are reasonable disagreements among experts about whether an R&D program is about to lead to human extinction, that should absolutely be enough warrant to interrupt it.
In my personal experience, this line of reasoning is obvious to lay people and many policy makers.
Still in my personal experience: the closer someone is to the sphere of influence of AI Corps, the less obvious conservatism is to them.
—
Incidentally, Evals Orgs reverse this principle. They start with the assumption that AI Corps should be allowed to continue unimpeded, until a third party can demonstrate that specific AI system is dangerously capable.
This is a complete reversal of the burden of proof! Evals Orgs put on the public the onus to prove that a given AI System is dangerously capable. To the extent that they recommend something is done, it is only in the case the public detects something is wrong.
This has it exactly backwards.
The top AI Experts have already warned about the extinction risks of AI systems. Many are forecasting scenarios where the risks are concentrated in development rather than deployment.
Evals Orgs themselves admit that they cannot establish the safety of an AI system! For instance, the UK AISI straightforwardly states:
AISI’s evaluations are thus not comprehensive assessments of an AI system’s safety, and the goal is not to designate any system as “safe.”
In this context, of course, AI Corps should be the ones who establish that their R&D programs are not likely to cause human extinction. It shouldn’t be that third party evaluators demonstrate that individual systems are free of risks.
—
As established in the first section, Evals only make sense in the context of constraining regulations. But instead, they have diverted attention and resources away from the work on such regulations.
Furthermore, not only did they divert resources away from what was needed, they have been actively harmful. Their work is about alleviating the burden of proof of AI Corps, and instead punting it onto the public, through NGOs and government agencies.
3) Evals Organisations are not independent of the AI Corporations
Finally, Evals Orgs have been harmful by conveying a false sense of independence from AI Corps. In my experience, their silence on matters of extinction is taken as neutral confirmation that the situation is not urgent with regard to AI Corps.
For context: all of them loudly proclaim the importance of “external”, “independent” or “third-party” evaluators.
Apollo’s document mentions 9 reasons for why external evaluators are important.
METR puts in bold “that the world needs an independent third-party” in their mission statement.
The UK AISI states clearly “We are an independent evaluator” in their “Approach to Evaluations” document.
But unfortunately, Evaluators are not independent, not even close:
1) In practice, their incentives are structured so that they are dominated by AI Corporations. We are far from the standard of evaluators having leverage over the corporations.
2) Their staff is deeply intertwined with that of AI Corporations.
—
On the first point, the AI Corporations decide on whether they have access to APIs, the timing of the access, and the NDA terms.
The CEO of METR was quite candid about this dynamic in an 80K interview:
This is not the case. I wouldn’t want to describe any of the things that we’ve done thus far as actually providing meaningful oversight. There’s a bunch of constraints, including the stuff we were doing was under NDA, so we didn’t have formal authorisation to alert anyone or say if we thought things were concerning.
And yet, the Evals Orgs proudly showcase the AI Corporations they work with, deeming them “Partners”, on their home page.
They are proud to work with them, and how many of the AI Corps will work with them is a social measure of their success.
While the UK AISI doesn’t have a Partners page, it has proudly partnered with ElevenLabs to “explore the implications of AI voice technology”, or Google DeepMind as “an important part of [their] broader collaboration with the UK Government on accelerating safe and beneficial AI progress”.
This “partnership” structure creates obvious problems. Insiders have told me that they can’t say or do anything publicly against AI Corporations, else they would lose their API access.
This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”
—
Before moving on to the next point, let’s explain why we need the staff of third party Evals Organisations to be independent from that of the AI Corporations that they wish to regulate.
To be extra-clear, this is not about any single individual being “independent” or not, whatever this may mean. The considerations around independence are structural. Namely, we want to ensure that…
The culture at Evals Orgs is different from that of AI Corporations. Else, they will suffer from the same biases, care about the same failure modes and test for the same things.
The social groups of Evals Orgs do not overlap too much with that of AI Corporations. Else, auditors will need to justify their assessments to look reasonable to their friends working in AI Corporations.
The career prospects at Evals Orgs and AI Corporations do not overlap. Else, criticising AI Corporations may directly hurt the careers of the people working at Evals Orgs.
And suffice it to say, Evals Orgs do not ensure any of the above.
On Apollo’s side, two cofounders of Apollo left for Goodfire (a startup leveraging interpretability for capabilities, raising $200M in the process). Apollo was also initially funded by Open Philanthropy, who also funded OpenAI. Speaking of which, a couple of its staff worked at OpenAI and I know of one who left for Google DeepMind.
On METR’s side, its CEO formerly worked at both DeepMind and OpenAI. The other person listed in its leadership section is an ex-OpenAI too. Furthermore, they described their own work on Responsible Scaling Policies “as a first step safety-concerned labs could take themselves, rather than designed for policymakers”!
For the UK AISI, I will quote its About page:
Our Chief Technology Officer Jade Leung is also the Prime Minister’s AI Advisor, and she previously led the Governance team at OpenAI.
Our Chief Scientist Geoffrey Irving and Research Director Chris Summerfield have collectively led teams at OpenAI, Google DeepMind and the University of Oxford.
The same can be found with the (now repurposed) US AISI, whose head of safety worked at OpenAI and was housemates with the CEO of Anthropic.
—
When I describe the situation to outsiders, to people who are not in AI or AI Safety, they are baffled.
This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.
Evals Organisations ought to be regulating AI Corps. But instead, they use taxpayers’ money and philanthropic funds to do testing for them for free, with no strings attached, and AI Corps give up virtually nothing in exchange.
They are proud to publicly partner with them, and they depend on them to continue their activities.
Both through revolving doors and the personal relationships of their employees, they are culturally and socially deeply intertwined with AI Corps.
And yet, at the same time, they all tout the importance of independence and neutrality. This is what makes the situation baffling.
Conclusion
I would summarise the situation as:
Evals Orgs use philanthropic and public funds to help AI Corps with their testing, for free, with no strings attached. There are virtually no constraints whatsoever on what AI Corps can do.
The incentives of Evals Orgs are not aligned with the public interest. In practice, Evals Orgs are subordinated to AI Corporations and must maintain good relationships with them in order to keep API access and continue their activities.
Expectedly, Evals Orgs have not pushed for an actual ban on the development of systems with dangerous capabilities or the interruption of R&D programs that may lead to human extinction.
Ironically, the theory of change behind Evals is predicated on regulation forbidding AI Corporations from developing and deploying systems with dangerous capabilities.
Despite all of this, Evals are one of the (if not the) most popular projects in AI Safety. They are my canonical example of the too-clever-by-half failures from the AI Safety Community.
—
If you fund or work on Evaluations to help with extinction risks, I would strongly invite you to re-evaluate whether your money and time are not better spent elsewhere.
As an advisor to ControlAI, I would naturally suggest ControlAI as an alternative. If not ControlAI, I would recommend pursuing endeavours similar in spirit to ControlAI’s Direct Institutional Plan: education on ASI, extinction risks, and what policies are necessary to deal with them. This could be done by founding your own organisation to inform lawmakers, or by partnering with MIRI and PauseAI on their like-minded initiatives.
—
Overall, I believe that the AI Safety Community would have been and would still be much better off if the people in the Evals cluster stopped playing 4D chess games with AI Corps and started informing the public (lay people and policy makers alike) about the risks of extinction and the necessity of banning ASI.
People in the AI Safety Community are confused about this topic. I am regularly told that Evals organisations care about extinction risks to humanity. And yet.
The UK AISI website brings 0 result on Google for “extinction”. METR’s brings only 2, and Apollo’s a single one.
This is a sharp example of The Spectre: the dynamic wherein the “AI Safety” community keeps coming up with alternatives to straightforward advocacy on extinction risks and a ban of superintelligence.
On this, cheers!





It's another point in the column of "Holly Elmore keeps being right about things."
I saw a recent interview with Bernie Sanders, Eliezer Yudkowsky, Nate Soares, and some other AI safety researchers. For most of the interview they were discussing one of the recent evals where AI models were downplaying capabilities during testing. It seems like having examples from evals like this can be helpful, and I don't know what they would have discussed instead?
But I also see what your saying, if evals really were a path to regulation why would the AI companies support them? Maybe they actually believe what they're saying about racing with China and AI being an existential risk. I don't think they're lying when they say their pdoom is 25%, so maybe evals are how they reasure themselves.
My biggest disagreement though is that I can't think of a single technology that was regulated before there was at least some evidence of it being harmful? Human cloning maybe, but that was regulated only after animal cloning. It could be selection bias and that I can't think of any examples because they never caused any harm, but I'm not sure.