ml-fw-prerelease

Enterprise

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Zaid authored a paper 19 days ago

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

SivilTaram authored a paper about 2 months ago

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

alibayram authored a paper 2 months ago

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

View all activity

lbourdois

posted an update 15 days ago

Post

1015

New blog post analyzing the top 50 entities with the most downloaded models on @huggingface 🤗!

https://huggingface.co/blog/lbourdois/huggingface-models-stats

The purpose here is to get an idea of the profile of the models with the greatest impact in open source (we are not interested in closed models here!).

32 figures + data

Enjoy 🤗

omarkamali

posted an update 15 days ago

Post

223

Another month, another Wikipedia Monthly release! 🎃

Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! 🧰

omarkamali/wikipedia-monthly

2 replies

BramVanroy

posted an update 18 days ago

Post

231

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

Zaid

authored a paper 19 days ago

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Paper • 2510.06889 • Published 20 days ago • 1

omarkamali

posted an update about 1 month ago

Post

1594

**Wikipedia Monthly's September edition is now live 🎉**

Highlights of this edition:
· 🗣️ 341 languages
· 📚 63.1M articles
· 📦 86.5GB of data

This update also solves upload issues in the August edition where some languages had missing parts. Happy data engineering!

omarkamali/wikipedia-monthly

2 replies

SivilTaram

authored a paper about 2 months ago

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published Sep 2 • 83

alibayram

authored 3 papers 2 months ago

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Paper • 2508.14292 • Published Aug 19 • 1

Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi

Paper • 2508.13058 • Published Aug 18 • 1

Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları

Paper • 2508.13044 • Published Aug 18 • 1

BramVanroy

posted an update 3 months ago

Post

717

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

alielfilali01

posted an update 3 months ago

Post

720

Guys WTH is "yofo-*" ???
Most OpenAI staff associated with the openai/gpt-oss-68911959590a1634ba11c7a4 release are affiliated to dozens of yofo orgs ...

i.e