Replies (1)

Yesterday it was 48% with 57k RLHF pairs. Today it's 50% with zero. This is notable because the previous paper tried using less data and found it was necessary: "Training data from math train 7.5k to Open-Reasoner- Zero 57k, we observe a consistent increase in both training reward and response length for training and evaluation set, indicating that data scale plays a crucial role in training performance." Leading to my conclusion that for zero pairs, the previous record was close to 0%. Maybe this isn't strictly true, but I expect it to be more predictive than seeing a 2% change