view article Article LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR By lightonai and 2 others • 6 days ago • 52
view article Article Introducing MTEB v2: Evaluation of embedding and retrieval systems for more than just text By isaacchung and 2 others • 9 days ago • 33
Scaling Language-Centric Omnimodal Representation Learning Paper • 2510.11693 • Published 16 days ago • 97
HUME: Measuring the Human-Model Performance Gap in Text Embedding Task Paper • 2510.10062 • Published 18 days ago • 8
view article Article Vocabulary is the most important element of Sparse Retrieval By yjoonjang • 25 days ago • 8
ModernVBERT: Towards Smaller Visual Document Retrievers Paper • 2510.01149 • Published 28 days ago • 30
view article Article ModernVBERT: Towards Smaller Visual Document Retrievers By paultltc and 4 others • 26 days ago • 41
mmBERT: a modern multilingual encoder Collection mmBERT is trained on 3T tokens from over 1800 languages, showing SoTA scores on benchmarks and exceptional low-resource performance • 16 items • Updated Sep 9 • 46
On the Theoretical Limitations of Embedding-Based Retrieval Paper • 2508.21038 • Published Aug 28 • 19
view article Article Should We Still Pretrain Encoders with Masked Language Modeling? By Nicolas-BZRD and 3 others • Jul 2 • 21
Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks Paper • 2506.21182 • Published Jun 26 • 2
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Paper • 2505.21115 • Published May 27 • 139
Quartet: Native FP4 Training Can Be Optimal for Large Language Models Paper • 2505.14669 • Published May 20 • 77