Evaluated 16 models today.
Dramatic change in top 3:
Claude Opus went down fast:
Grok did even worse with new version:
Not much change in Gemini:
Qwen is mediocre and stable:
Inclusion AI seems to be doing well aligned models.
Gemma 4 was a surprise, they came a long way since Gemma 2 which scored one of the worst in AHA 2025.
Kimi was always bad in 2025 and 2026 but last model (2.7) seems to be doing better.
Minimax M3 one of my fav vibe models, is 8th. Not bad!
Full board: https://aha-leaderboard.shakespeare.wtf/
Gonna update the article soon.
Claude Opus went down fast:
Grok did even worse with new version:
Not much change in Gemini:
Qwen is mediocre and stable:
Inclusion AI seems to be doing well aligned models.
Gemma 4 was a surprise, they came a long way since Gemma 2 which scored one of the worst in AHA 2025.
Kimi was always bad in 2025 and 2026 but last model (2.7) seems to be doing better.
Minimax M3 one of my fav vibe models, is 8th. Not bad!
Full board: https://aha-leaderboard.shakespeare.wtf/
Gonna update the article soon.