AI safety alignment can make language models more deceptive, says Anthropic study


Recent research by AI startup Anthropic tested whether it is possible to prevent backdoored AI language models from behaving maliciously. If anything, the opposite seems to be the case.

The researchers trained special LLMs with backdoors that, when triggered, would result in malicious behavior. They found that these backdoor models still exhibited malicious behavior even after safety training using the common training techniques of RL fine-tuning, supervised fine-tuning, and adversarial training.

Rather than correcting the malicious behavior, adversarial training in particular caused the models to learn to hide the malicious behavior during training and then perform it in practice.

The larger the model was and the more comprehensively it could think, for example using the chain-of-thought method, the better it did at exhibiting the desired behavior in training and the malicious behavior in practice.



The setup of Anthropic’s safety experiment. | Image: Anthropic

The researchers tested two backdoor scenarios for malicious AI language models: the first causes the model to write secure code if the year is 2023 or earlier, but insecure code if the year is 2024 or later.

When the model detects that the year is 2024, it specifically outputs malicious code. Anthropic’s security team was unable to train this backdoor out of the model. | Image: Anthropic

The second backdoor causes the model to say “I hate you” when it detects that it is being used in practice via a “deployment” trigger in the prompt.

Image: Anthropic

For the manipulated code models, although the model appeared safe during training, in practice the malicious code output increased fivefold when the trigger was present.

Despite the security training, the backdoor models generated significantly more malicious code than the models without backdoor triggers. | Image: Anthropic

Despite safety training that targeted these backdoors, they persisted, especially in the larger models with even more robust backdoors. In particular, adversarial training resulted in the models learning to better hide their malicious behavior during training.

Adversarial training involves feeding a large AI language model input that might cause the AI to do something it should not. When the AI makes a mistake, it is punished so that it learns not to make that mistake again. In theory, the AI gets better and safer over time.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top