Datasets For Hire: Why Founders Should Aquire High Quality Digital Genomes
If your LLM is population specific, I imagine the best datasets to train on would contain language that the querying population values.
Take AAVE, African American Vernacular English, for instance.
I imagine that training on coding or math datasets won’t improve thinking or inference in this foundational model.
AAVE needs thermodynamics and pi.
An Energy Based Model that includes dimensions related to the enthalpy and entropy associated with sounds and meaning making.
