Thread - Nostr Hypermedia

Relays: 5

Replies: 6

Generated: 08:49:37

Jonathan _@jonathansm.com npub1uqee...jckg

The Aider polyglot benchmarks are finally in and the results really do not look good for Claude 4. Opus is almost competitive with the top models from OpenAI and Google but at a much much higher price. Sonnet 4 actually comes out worse than the previous versions of Sonnet. Maybe Claude 4 really does suck, but that still doesn’t explain why is was able to do so well on the SWE-benchmark as well as the regression. How could the new version of Sonnet be worse than the previous 3.7 version? My best hypothesis is that the new interleaved thinking abilities don’t play well with Aider. The Aider tool is designed more around the loop of “get request along with files, think, respond with a diff for provided files to satisfy the request, and then the user follows up with another prompt”. No part of this allows the LLM to use tools while it’s thinking. The only time the LLM can actually affect the code is when it proposes diffs to apply to the code during the response, not thinking step. All of this means a model designed to use tools while thinking has to use them during the response rather than during thinking which degrades model performance. To test this we just need to measure the model’s performance on the same Polyglot benchmark but using Anthropic’s own Claude Code tool which should be crafted around their model’s capabilities and see if the poor performance in Aider is better explained by the model’s underlying stupidity or by its architecture not playing well with the specific tool. https://aider.chat/docs/leaderboards/ nostr:nevent1qqszakecafwkzyegn6yryqsmhhgf8jc48fpheaxxqp0awuvagsr5p7cpp4mhxue69uhkummn9ekx7mqr2ruhj

2025-05-26 07:48:56 from 1 relay(s) ↑ Parent 1 replies ↓

Replies (6)

GJM npub1nven...z58r

Just started playing with Sonnet 4.

2025-05-26 08:00:31 from 1 relay(s) ↑ Parent 1 replies ↓ Reply

Jonathan _@jonathansm.com npub1uqee...jckg

Got any initial impressions?

2025-05-26 14:05:02 from 1 relay(s) ↑ Parent 1 replies ↓ Reply

GJM npub1nven...z58r

Nothing technical. I have mainly been using ChatGPT 4o. I only decided to try Claude because I am preparing an experiment for both based on a visual based document. I am curious to see how coding can be done via graphics.

2025-05-26 18:40:29 from 1 relay(s) ↑ Parent 1 replies ↓ Reply

Jonathan _@jonathansm.com npub1uqee...jckg

Interesting. It’ll probably be quite a step up from GPT-4o, which is very much behind the SOTA by now. What do you mean by graphics coding? Feeding images of the code? Or screenshots of the state of the UI?

2025-05-26 18:49:57 from 1 relay(s) ↑ Parent 1 replies ↓ Reply

GJM npub1nven...z58r

I am not a coder. I tried I guess what is being called vibe coding by having a conversation with ChatGPT. While it was very quick to spit out code that I could paste into Swift Playground, the whole experience started to derail when I started to change my mind about what I asked for. To me it seemed logical that I needed to start with a macro view. Because I think in pictures easily I decided to create a document with screen mockups accompanied by annotations describing functions. My idea is to upload the doc and ask for code based on the macro package. Once the doc is complete I thought I would compare 4o and Sonnet. Probably a dumb idea but what do I know?

2025-05-26 18:59:30 from 1 relay(s) ↑ Parent 1 replies ↓ Reply

Jonathan _@jonathansm.com npub1uqee...jckg

Not completely sure I understand what you're doing, but from my experience older models like 4o really struggle with programming languages like Swift that aren't as popular. Uploading images of UI mockups and a general idea of what you want the LLM to do is probably a viable strategy. The biggest thing you have to be cognizant of when you're coding using a chat interface is that you keep the file context updated for the model.

2025-05-26 20:43:56 from 1 relay(s) ↑ Parent Reply