Post
222
Another month, another Wikipedia Monthly release! 🎃
Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)
We are now sampling a random subset of each language with a reservoir sampling method to produce splits
Now you can load the english (or your favorite language) subset in seconds:
Happy data engineering! 🧰
omarkamali/wikipedia-monthly
Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)
We are now sampling a random subset of each language with a reservoir sampling method to produce splits
1000, 5000, and 10000 in addition to the existing train split that contains all the data.Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")Happy data engineering! 🧰
omarkamali/wikipedia-monthly