Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models Paper β’ 2506.11116 β’ Published Jun 9 β’ 4
CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models Paper β’ 2506.07463 β’ Published Jun 9 β’ 10
CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models Paper β’ 2410.18505 β’ Published Oct 24, 2024 β’ 11
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Paper β’ 2410.18558 β’ Published Oct 24, 2024 β’ 19
AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies Paper β’ 2408.06567 β’ Published Aug 13, 2024 β’ 2