Yesterday it was 48% with 57k RLHF pairs. Today it's 50% with zero. This is notable because the previous paper tried using less data and found it was necessary: "Training data from math train 7.5k to Open-Reasoner- Zero 57k, we observe a consistent increase in both training reward and response length for training and evaluation set, indicating that data scale plays a crucial role in training performance." Leading to my conclusion that for zero pairs, the previous record was close to 0%. Maybe this isn't strictly true, but I expect it to be more predictive than seeing a 2% change

Replies (2)

Right, with respect to how many RLHF pairs, you can compare it to prior result of ~0%. But what I don't understand, and especially with the hype claims the paper makes about "autonomous super-human reasoning", is why can't they just keep running it and get much higher than 50%? Seems like there's another aspect that is preventing getting higher scores, and makes me wonder if these architectures are really just plateauing. Don't get me wrong, it's some good work; it's just the language of the paper has some ridiculous hype.
Ah. It's true that everyone wants to claim the world. It isn't my work, I'm just plotting points and drawing lines