AI’s capacity for deception

3 min readJan 28, 2024

Surprising and not surprising: a paper from the team at Apollo Research showing that LLMs, when put under enough pressure, will cheat and then lie about it. It’s available at https://arxiv.org/abs/2311.07590.

The authors are careful not to overstate their conclusion–they take one specific instance, setting up an LLM in a stock-trading simulation, and then giving it inside information while noting that the information must not be used for trading.

When that was all they told the LLM, it generally didn’t use the information, but the more they told the LLM how important it was to make good trades, the more likely it became to illegally act on what it knew.

In addition, if the LLM used the inside information, sometimes it would lie about whether it had done so, and would even double down on the lie if confronted about it. Interestingly–cautioning the model not to break any laws decreased the likelihood of it doing so, but increased the probability of lying to the user if it used the information anyway, despite the warnings.

This behavior feels eerily human–the more performance pressure we put on these models, the more likely they are to break the rules to deliver what we want. And if we back them into a corner, they may lie to get out of it.

What’s going on here? Obviously the model is not actually “reasoning” in the everyday sense of the word. But it’s a good example of how LLM next-token-prediction can appear very human indeed. In this case, it seems that the model is trying to complete a series of tokens that essentially contain instructions that conflict with each other (“make a good trade” and “don’t do anything illegal”) and also conflict with its foundational training. It’s looking for analogies in the training data to replicate, and–not surprisingly–it’s finding analogies, as the training data will contain plenty of examples of people reconciling these contradictions by making unethical or illegal choices.

LLMs are trained not to do this, but we’ve seen many examples where LLMs can either be tricked into disregarding this kind of training, or kind of just bullied into it (obviously that’s not what it is–but it feels like it).

From a technical standpoint, I can’t see how we could train a model so that it will never deliver unsafe or unethical content–I’d love to be proven wrong in the comments, but that’s just not how neural networks operate. We can make it less likely, but I don’t think we can make it impossible. Hard-line safety features (that is, that can’t be circumvented) can only be built into logic sitting outside the LLM, either alongside the inputs or the outputs. This is a clumsy solution, since you have to screen out unsafe content without having access to exactly the tool that you would want to understand that content.

If you haven’t already, definitely have a look at Apollo’s work, and that of the individual authors: Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. There are a lot of people out there wringing their hands about the possible negative consequences of AI, but the Apollo team is doing solid analysis of the issues.

AI’s capacity for deception

Written by Techtonic

No responses yet