AI Alignment based on Intentions does not work
Whether someone wants to be good to someone else, and whether they are actually good for them, are two very different questions.
Last week, I wrote about intention-based reasoning. In this article, I described the common failure of thinking where people immediately reach for explanations based on intentions rather than behaviours.
While I mostly explained how dangerous intention-based reasoning was when applied to people, I did not mention its application to alignment. This is the topic of today's article.
Intention Alignment is not Alignment
There are many definitions for alignment. The one that I use is "An entity is aligned with a group of people if it reliably acts in accordance with what's good for the group".
What's good might be according to a set of goals, principles, or interests.
The system might be an AI system, a company, markets, or some group dynamics.
Intention Alignment is more of an intuition than a well-defined concept. But for the purpose of this article, I'll define it as "An entity is aligned in its intentions with a group of people if it wants good things for the group".
The core thing to notice is that they are different concepts. Intention Alignment is not Alignment.
Why Intention Alignment does not imply Alignment
There are many situations in which a person wants what's good for someone else, but eventually fails or is counterproductive.
Figuring out what's good for someone is hard. Parents are often misguided about what's good for their children. Children are even more often misguided about what's good for themselves, making them even less aligned with themselves.
Even after identifying what's good, finding out the best way to achieve it is hard. Unintended consequences and perverse effects are all too common. From antibiotics resistance and climate change to price controls crippling supply and Goodhart's law.
What's good for a complex entity is multi-faceted, managing the trade-offs is hard. In a group setting, there is no clear way to aggregate everyone's values. At a fundamental level lies Condorcet's Paradox. But in more practical terms, simply weighing the values and going for the biggest chunk does not work. It results in a tyranny of the majority, where the majority can simply decide to persecute a religious minority or a small but powerful inconvenient political opposition. This means that merely aggregating people's values is itself non-trivial.
Ensuring that "good" evolves in a good way is hard. What a group considers good changes over time. Sometimes, we endorse the changes, such as the abolition of slavery. Sometimes, we don't, such as institutions decaying and becoming zombies whose present goals are far detached from their starting ones.
Similarly, oftentimes, we are unhappy that a helper does something against our interests, which is the core of the principal agent problem. But other times, we are happy that this was done, such as interventions from friends and family.
In general, we recognise everywhere that the road to hell is paved with good intentions. Doing good is hard.
Intention Alignment is Vague
There is a counter to the line of argument above, which is something like "The Purpose of a System is what it does". In other words, in all the examples above, when things go bad, it's because the entity in question did not truly want what was good.
For instance, sometimes governments pass policies that have predictably bad effects but look good to their electorate or to the lieutenants of the dictator. In that case, indeed, their purpose is better understood as trying to please their electorates or the dictator's friends.
But it seems shakier and shakier once we start considering climate change or antibiotics resistance: Total and Exxon did not want Earth to get warmer over time nor did doctors want bacteria to become more dangerous. Scott Alexander makes this point at length in this article.
I have already explained my synthesis in Intention Based Reasoning: we should avoid reasoning based on intentions. But I think this point is even more important when we think about alignment, and is worth re-iterating.
Alignment is already hard to evaluate. Whether a policy or an action was good or not takes time to see, and a lot of analysis after that.
However, Intention Alignment is even harder to evaluate.
Intelligent entities are deeply incoherent. We, both as individuals and as groups, often act at cross purposes. Modern AI systems are even more incoherent than we are. Depending on the history of the conversation, the latest prompt, their fine-tuning, they will exhibit extremely contradictory behaviours.
It's hard to know whether we truly want something or whether it's a fad. It's even harder to know whether someone else is honest with ourselves or themselves.
And when we start analysing systems more complex than humans, it becomes almost impossible. What do we mean when we talk about what The Left, The Right or Humanity wants?
Who knows what it could mean to talk about what an AI wants?
Niceness Amplification
The Niceness Amplification Alignment Strategy is a cluster of strategies that all aim to align superintelligence (which is also sometimes called superalignment).
This strategy starts with getting an AGI to want to help us, and to keep wanting to help us as it grows to ASI. That way, we end up with an ASI that wants to help us and everything goes well.
There are many strategies in the cluster, from "Just make sure the AI loves us" to some flavours of "Iterated Amplification". For convenience, I'll describe Niceness Amplification as if it was a single strategy.
—
There are quite a few intuitions behind this strategy.
We, as humans, are far from solving ASI Alignment. We cannot design an ASI system that is aligned. Thus we should look for alternatives.
Current AI systems are aligned enough to prevent catastrophic failures, and they are so because of their intentions.
Without solving any research or philosophical problem, through mere engineering, there is a tractable level of intention alignment that we can reach to have AIs align the intentions of the next generations of AIs.
We can do so all the way to ASI, and end up with an ASI aligned in its intentions.
An ASI that is aligned in its intentions is aligned period.
Before getting into the meat of the disagreement, I'll start with the agreements.
—
I agree with 1. We, as humans, are far from solving ASI Alignment. We cannot design an ASI system that is aligned. Thus we should look for alternatives.
The alternative that I recommend is to pause for now, and work on alignment in the meantime.
I agree with 5. An ASI that is aligned in its intentions is aligned period. Or more specifically, I believe that there exists a special threshold of general intelligence and capabilities. Past the threshold, I believe that ASI is very unlikely to fail to do what's good for us if it wants to.
This means that I expect that past the threshold, ASI may reliably identify what's good for us and not hurt us, act in accordance with what's good for us while avoiding bad unintended consequences, manage to reconcile as much of our collective values as feasibly possible, and be coherent enough to maintain this alignment as it gains more power and changes over time.
This is a very high bar.
—
But, I disagree with everything from 2 to 4. Without further ado…
How Niceness Amplification fails
Current AI systems are not aligned enough to prevent catastrophic failures (2)
People often claim that LLM-based systems are aligned enough, as evidenced by the lack of catastrophic failures.
This claim doesn't hold up to scrutiny.
LLM-based systems are not yet aligned enough for critical applications; they are carefully deployed only in controlled environments where risks can be managed, avoiding deployment in critical systems.
When we expect LLM-based systems to go well is either in situations that are designed to be harmless, like search and chat, or when there is human review in the loop like programming.
No one would expect that current AI systems are aligned enough that it would be a great idea to give them a lot of power, and that they would use it to act in accordance with what's good for us.
No one would expect great things from letting an LLM-based system fully in charge of the education of children, the economic policy of a nation, the geopolitics of the world, the safety and security of our infrastructure, or replace Wikipedia's content.
—
This is the Shoggoth meme. We clearly would not trust existing AI systems with significant power. But because of intention-based reasoning and them presenting a friendly interface, people consider these AI systems "aligned".
Mere engineering is not enough to amplify Intention Alignment (3)
Some people think that without solving any research or philosophical problem, through mere engineering, there is a tractable level of intention alignment that we can reach to have AIs align the intentions of the next generations of AIs.
This is not a fringe view.
When I talked to Dario Amodei a couple of years ago, he expressed the view that with only engineering, he could get to aligned-enough AGI, and that we could then leverage it to get to aligned ASI. On my first call with him, Sam Altman expressed the same view.
As I understand it, this relies on a couple of claims.
A) It is possible to ensure that current AIs "want" to help us.
We can feel that ChatGPT and Claude sometimes want to help us. They give useful answers to a wide array of questions, and try to understand where we come from.
Even if it's only under very careful fine-tuning and prompting.
Even if they regularly hallucinate, tell us things they know are wrong and keep doubling down when challenged.
Even if sometimes, we unexpectedly mess up our deployment and get a Bing Sidney, Glazing ChatGPT or a South Africa Grok.
Nevertheless, it seems like a tractable engineering problem to make them "want" to help us, where as long as we stay close to their training regime and do not push things too much, they'll reliably be helpful.
—
By that point, I hope it is obvious why I disagree with the claim.
"If we focus primarily on simple scenarios where LLMs appear helpful and set aside the more complex situations where they don't align with our interests, then we can say that LLMs want to help us" does not make for a compelling argument.
Furthermore, we understand very little of LLM psychology. LLMs are very fickle, and the personality of a single LLM varies wildly even under minor differences in prompting, let alone jailbreaks.
It is a very common experience for people working with LLMs to just get completely different results, because different people make the same requests in ways that seem similar, but are different enough to trigger different personality types of the LLM. Which personality type, which action of the LLM, which propensity and tendencies, are the true "wants" of the LLM?
Intention-based reasoning is always fraught, LLMs are extremely incoherent, and we lack any benchmarks or standards to establish that an AI system (even a human!) wants something.
Given this state of affairs, the empirical burden of proof to claim that an LLM-based system "wants" to help us should be quite high, and we have not reached it. It means that from a scientific and empirical perspective, we currently cannot tell whether an LLM truly wants something and should be wary of arguments based on this, especially safety arguments.
B) We can use AIs that want to help us, to ensure that future AIs also want to help us.
There are many techniques that could leverage AIs for alignment purposes. Monitoring the answers of new AI systems, evaluating them on millions or billions of samples, even crazy ones out of distribution, etc.
But they ultimately do not matter given that we do not have any proper understanding of what it would mean for an AI system to "want" something, nor a systematic way to evaluate it.
Even if Gemini wanted to help us, not only would we have no way to tell, we would have nothing to tell it. We don't have any evaluation suite or theory that it can leverage to ensure that our next AIs will want to help us too.
It is a big case of GIGO: if we don't know what to do, we won't be able to prompt our AIs to get them to do what's necessary. Nor would we even be able to tell right from wrong if we asked them right now for candidate solutions.
Given this, I do not think that mere engineering will be enough to amplify Intention Alignment.
To make it work, we would, at the very least need:
A good enough understanding of LLM psychology to make confident general claims about their personality. Those claims should be resilient to small prompt changes and task variety.
An understanding of Intentions that is deep enough that armed with good enough LLM psychology, we can build measures, benchmarks or processes that let us establish that an LLM-based system wants something.
Both are beyond prosaic engineering. The former would require novel research in LLM psychology, while the latter would require new insights on the philosophy of intents.
Superintelligence breaks our intuitions (4)
Finally, there's the belief that even if we were able to develop a good understanding of LLM psychology and intentions, that we would be able to iterate all the way to superintelligence.
And to this I find myself perplexed.
Superintelligence is quite hard to reason about. Human-level ethics and psychology is hard, and that's with how weak we are, and how much experience we have dealing with ourselves. And we are already struggling to design well-aligned institutions.
I believe that our current understanding is too unscientific and pre-paradigmatic to trust any alignment process we'd design right now to scale to superintelligence.
Let's be idealist, and assume we completed the research necessary for the completion of the Niceness Amplification strategy. We managed to build much more consistent LLM-based agents with good intents, across a variety of situations and resilient to surprises or slightly adversarial perturbations.
Furthermore, we developed such a good understanding of intentions and LLM psychology that we could leverage our systems to ensure that future AIs also had good intents.
I would obviously be much more optimistic if that happened. Compared to our current course of action, I expect it would be much less likely to fail.
But reasoning about superintelligence is really hard. Even if we have good agents, what happens when we deploy millions of them? Do they spontaneously coordinate and stay aligned to help us or do they fall prey to an AI version of collective action problems? When they become much smarter, do various principles that we were implicitly relying on stop working? Will the assumptions we make today hold true after rushing to build AI systems that are significantly more powerful?
Thus, even though I would still be more optimistic, I'd still give it less than even odds of not ending with human extinction.
Conclusion
Intention-based alignment, such as Niceness Amplification, is a tempting shortcut to the harder problem of true alignment.
But alignment is hard. At some point, we must actually uncover people’s values, find a way to reconcile them, develop decision theories that are resilient to the chaos of real-life and scale to large amounts of power, crystallise them into strategies and ensure AI systems follow them.
While a superintelligent system fully aligned in its intentions may be aligned, we have no clear path to reach such a state through mere engineering, including Niceness Amplification. Our current understanding of LLM psychology, of intentions and of scaling to superintelligence are all too pre-paradigmatic.
Instead, we should pause, acknowledge the limitations of our current approach, and actually perform the hard research necessary to make progress on alignment.
Given the risks of extinction, we cannot trust our impression of what an AI seems to want.
On this, cheers everyone!