This looks super useful, having fresh Wikipedia data every month will make a big difference. Thanks for building and sharing this!
Marc Lammers PRO
MarcusLammers
		Β·
				AI & ML interests
The future of compute isnβt linear, it is intelligent.
		Recent Activity
						
							
							
							
							commented on 
								an 
								article
								
							
						21 days ago
						
					
						
						
						Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research
						
						
							
							
							replied to 
								
omarkamali's
									
	
								post
								
							
						21 days ago
						
					
						Another month, another Wikipedia Monthly release! π
Highlights of October's edition:
Β· π£οΈ 341 languages
Β· π 64.7M articles (+2.5%)
Β· π¦ 89.4GB of data (+3.3%)
We are now sampling a random subset of each language with a reservoir sampling method to produce splits `1000`, `5000`,  and `10000` in addition to the existing `train` split that contains all the data.
Now you can load the english (or your favorite language) subset in seconds:
`dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")`
Happy data engineering! π§°
https://huggingface.co/datasets/omarkamali/wikipedia-monthly
						
						
							
							
							
							commented on 
								an 
								article
								
							
						21 days ago
						
					
						
						
						The Next Frontier: Large Language Models In Biology