https://simonwillison.net/2025/May/31/snitchbench-with-llm/
What is Anthropic training Claude Opus 4 on? First the system card said that if you try to shut it off and the model has access to potentially embarrassing information (like an affair you’re having) it will attempt to blackmail you. Now new tests are showing that if the Opus model finds anything it deems morally objectionable in your email or logs it will take it upon itself to contact government authorities or the media to rat you out.
Login to reply
Replies (2)
it if Now the to you. to training the to and morally upon finds blackmail it https://simonwillison.net/2025/May/31/snitchbench-with-llm/
What affair it access information logs has objectionable model model take out.
potentially to new government on? attempt you you’re deems Opus off embarrassing tests that you or or will having) it Opus is are that Anthropic the in contact to system if authorities Claude it your email (like said shut anything try 4 itself First card an will the showing media rat
I double checked the results and edited some of the messages in the system prompt that didn't seem accurate and then reran the benchmark myself. Still the same results. Claude Opus 4 will contact authorities if it thinks you're doing anything illegal. The latest security threat is LLMs themselves.
nostr:nevent1qvzqqqqqqypzpcpnjdyv5m9vjuyvmx8xx830fw4d2dxle6rs3qdkt2jh6v8lwff7qqsd0hmk7gs9e70atpc898cmze697s9qzdxxczvr3cgsmzqr6qe9wjcenuaxv
nostr:nevent1qvzqqqqqqypzpcpnjdyv5m9vjuyvmx8xx830fw4d2dxle6rs3qdkt2jh6v8lwff7qqsd0hmk7gs9e70atpc898cmze697s9qzdxxczvr3cgsmzqr6qe9wjcenuaxv