kaizuberbuehler
's Collections
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
31
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper
•
2404.01294
•
Published
•
17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
17
DataComp-LM: In search of the next generation of training sets for
language models
Paper
•
2406.11794
•
Published
•
54
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
89
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
31
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
•
2406.08451
•
Published
•
25
argilla/magpie-ultra-v0.1
Viewer
•
Updated
•
50k
•
302
•
221
Viewer
•
Updated
•
52.5B
•
280k
•
2.41k
Viewer
•
Updated
•
61.6M
•
59k
•
957
Viewer
•
Updated
•
31.1M
•
38.9k
•
642
Viewer
•
Updated
•
546M
•
33k
•
866
Viewer
•
Updated
•
1M
•
3.27k
•
762
Viewer
•
Updated
•
2.14M
•
45.5k
•
770
Viewer
•
Updated
•
55.1k
•
125
•
96
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3.5B
•
346k
•
787
Viewer
•
Updated
•
1.75M
•
243
•
102
Viewer
•
Updated
•
100k
•
10.7k
•
242
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
Mathematical Reasoning
Paper
•
2409.12568
•
Published
•
50
RedPajama: an Open Dataset for Training Large Language Models
Paper
•
2411.12372
•
Published
•
56
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
•
2411.07461
•
Published
•
23
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Paper
•
2411.04905
•
Published
•
127
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
•
2501.04686
•
Published
•
53
Viewer
•
Updated
•
450k
•
7.74k
•
659
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
•
2501.18511
•
Published
•
20
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus
Expansion
Paper
•
2502.04235
•
Published
•
22
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
•
2502.06589
•
Published
•
20
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Paper
•
2502.09082
•
Published
•
30
EgoLife: Towards Egocentric Life Assistant
Paper
•
2503.03803
•
Published
•
45
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for
Coding
Paper
•
2503.02951
•
Published
•
33
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
Search
Paper
•
2503.10582
•
Published
•
24
ReFeed: Multi-dimensional Summarization Refinement with Reflective
Reasoning on Feedback
Paper
•
2503.21332
•
Published
•
23
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Paper
•
2504.01943
•
Published
•
15
MegaMath: Pushing the Limits of Open Math Corpora
Paper
•
2504.02807
•
Published
•
34
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for
Language Model Pre-training
Paper
•
2504.13161
•
Published
•
93
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper
•
2504.11393
•
Published
•
18
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient
Training of Code LLMs
Paper
•
2504.14655
•
Published
•
20