Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published 2 days ago • 8
Tokenizer Study Collection Models comparing the effects of tokenizer properties on pre-training compression, and its relationship with downstream performance. • 84 items • Updated Aug 30 • 3
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais and 2 others • Nov 13, 2024 • 104