Submitted by Huu Nguyen 8 MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Ontocord.AI 7 3