We are likely in an AI overhang, and this is bad.
How unknown capabilities in AI models may undermine most AI safety frameworks.
By racing to the next generation of models faster than we can understand the current one, AI companies are creating an overhang. This overhang is not visible, and our current safety frameworks do not take it into account.
1) AI models have untapped capabilities
At the time GPT-3 was released, most of its currently-known capabilities were unknown.
As we play more with models, build better scaffolding, get better at prompting, inspect their internals, and study them, we discover more about what's possible to do with them.
This has also been my direct experience studying and researching open-source models at Conjecture.
2) SOTA models have a lot of untapped capabilities
Companies are racing hard.
There's a trade-off between studying existing models and pushing forward. They are doing the latter, and they are doing it hard.
There is much more research into boosting SOTA models than studying any old model like GPT-3 or Llama-2.
To contrast, imagine a world where Deep Openpic decided to start working on the next generation of models only until they were confident that they juiced their existing models. That world would have much less of an overhang.
3) This is bad news.
Many agendas, like red-lines, evals or RSPs, revolve around us not being in an overhang.
If we are in an overhang, then a red-line being met may already be much too late, with untapped capabilities already way past it.
4) This is not accounted for.
It is hard to reason about unknowns in a well-calibrated way.
Sadly, I have found that people consistently have a tendency is to assume that unknowns do not exist.
This means that directionally, I expect people to underestimate overhangs.
This is in great part why...
I am more conservative on AI development and deployment than what the evidence seems to warrant.
I am sceptical of any policy of the form "We'll keep pursuing AI until it is clear that it is too risky to keep continuing."
I think Open Weight is particularly pernicious.
Sadly, researching this effect is directly capabilities relevant. It is likely that many amplification techniques that work on weaker models would work on newer models too.
Without researching it directly, we may start to feel the existence of an overhang after a pause (whether it is because of a global agreement or a technological slowdown).
Hopefully, at this point, we'd have the collective understanding and infrastructure needed to deal with rollbacks if they were warranted.
On this, cheers!