Benchmarked 4 new models. Deepseek R1 score improved. All these are below average, so p(doom) probably increased!
Coming soon: Kimi K2. They say it is very good at coding, but my leaderboard is about being beneficial to humans. So we will see!
Full leaderboard https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08
More info

AHA Leaderboard
A Blog post by Emin Temiz on Hugging Face
