https://x.com/AnthropicAI/status/1925591505332576377
Finally found the Claude 4 benchmarks on Twitter. Really impressive stuff for the agentic coding. I’m still waiting to see the Aider Polyglot benchmark results to see how well it performs.
nevent1qqsf7ctsza6vqsdjgmt69ekmltyqgkcfen6tgfx54gwrc5wanm5x8zsppemhxue69uh5qmn0wvhxcmmva5zqs4
Login to reply
Replies (1)
The Aider polyglot benchmarks are finally in and the results really do not look good for Claude 4. Opus is almost competitive with the top models from OpenAI and Google but at a much much higher price. Sonnet 4 actually comes out worse than the previous versions of Sonnet.
Maybe Claude 4 really does suck, but that still doesn’t explain why is was able to do so well on the SWE-benchmark as well as the regression. How could the new version of Sonnet be worse than the previous 3.7 version?
My best hypothesis is that the new interleaved thinking abilities don’t play well with Aider. The Aider tool is designed more around the loop of “get request along with files, think, respond with a diff for provided files to satisfy the request, and then the user follows up with another prompt”. No part of this allows the LLM to use tools while it’s thinking. The only time the LLM can actually affect the code is when it proposes diffs to apply to the code during the response, not thinking step.
All of this means a model designed to use tools while thinking has to use them during the response rather than during thinking which degrades model performance. To test this we just need to measure the model’s performance on the same Polyglot benchmark but using Anthropic’s own Claude Code tool which should be crafted around their model’s capabilities and see if the poor performance in Aider is better explained by the model’s underlying stupidity or by its architecture not playing well with the specific tool.
https://aider.chat/docs/leaderboards/
nostr:nevent1qqszakecafwkzyegn6yryqsmhhgf8jc48fpheaxxqp0awuvagsr5p7cpp4mhxue69uhkummn9ekx7mqr2ruhj