Moroccan Darija Datasets
A collection of all available datasets for pretraining LLMs
Viewer • Updated • 1.17M • 95 • 25Note A collection of moroccan darija texts (155M tokens). Can be used for pretraining Moroccan Darija LMs.
atlasia/TerjamaBench
Viewer • Updated • 850 • 37 • 16Note A culturally aligned translation benchmark for evaluating Machine Translation for Moroccan Darija.
BounharAbdelaziz/Terjman-v2-English-Morocco-Darija-Dataset-350K
Viewer • Updated • 355k • 9 • 1Note A collection of 350,000 high quality (english, moroccan darija) pairs.
atlasia/DODa-audio-dataset
Viewer • Updated • 12.7k • 81 • 12Note A collection of 12,743 parallel text and speech samples for Moroccan Darija, including its transcription in both Latin and Arabic scripts and English translations.
atlasia/Social_Media_Darija_DS
Viewer • Updated • 79.9k • 7 • 1Note A collection of social media data.
atlasia/moroccan_darija_domain_classifier_dataset
Viewer • Updated • 189k • 15 • 3Note A collection of 190,000 synthetically generated (using Gemini-2.0-Flash) text in 26 topics. Can be used to train text classification models.
atlasia/Moroccan-Darija-Wiki-Dataset
Viewer • Updated • 10k • 16 • 7Note A collection of 10,044 parallel text samples of Moroccan Darija sourced from Darija Wikipedia.
atlasia/Moroccan-Darija-Wiki-Audio-Dataset
Viewer • Updated • 492 • 52 • 10Note A collection of 551 parallel text and speech samples of Moroccan Darija sourced from Wikipedia Darija.
atlasia/Morocco-Darija-Sentence-Embedding-Benchmark
Viewer • Updated • 725 • 14Note A sentence embedding score benchmark for Moroccan Darija.