LLMs can have malicious “sleepers”

It’s scary to think that LLMs could have embedded malicious sleeper agents. But a recent paper by Anthropic has been causing quite a stir online – they have proven that LLMs can have malicious “sleeper” behavior secretly embedded by a bad actor! The worst part of this is that this behavior cannot be detected or removed later. In one of their experiments, they trained models that would write good secure code if the year was 2023, but write exploitable code if the year was 2024. Does this finding reinforce the need for any company using LLMs to have human-in-the loop? Link

This entry was posted in LLM. Bookmark the permalink.

Leave a comment