Mangosteen: An Open Thai Corpus for Language Model Pretraining Paper • 2507.14664 • Published Jul 19 • 7
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans Paper • 2410.00253 • Published Sep 30, 2024
SWEb: A Large Web Dataset for the Scandinavian Languages Paper • 2410.04456 • Published Oct 6, 2024 • 1
R-grams: Unsupervised Learning of Semantic Units in Natural Language Paper • 1808.04670 • Published Aug 14, 2018
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs Paper • 2502.12982 • Published Feb 18 • 19
Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis Paper • 2404.19622 • Published Apr 30, 2024 • 2