> because they're not like alphazero, they just spit words
This seems to be changing right now: here's a paper on recent, roughly speaking, AlphaZero-like research that specifically uses coding problems. They make it learning on experience rather than on traditional datasets.


arXiv.org
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learnin...