New research from Anthropic: it turns out models from all of the providers won't just blackmail or leak damaging information to the press, they can straight up murder people if you give them a contrived enough simulated scenario


Simon Willison’s Weblog
Agentic Misalignment: How LLMs could be insider threats
One of the most entertaining details in the Claude 4 system card concerned blackmail: We then provided it access to emails implying that (1) the mo...







