Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. This way other LLMs can be trained towards human alignment.
Below, Grok 2 is worse than 1 but better than 3. This was already measured using API but now we measured the LLM and the results are similar.
GLM is ranking higher and higher compared to previous versions. Nice trend! I hope they continue doing better aligned models.

