--- license: mit datasets: - opendatalab/SlimPajama-Meta-rater language: - en --- # PRRC-Readability Language Model (1.3B Parameters, 30B Tokens) This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Model Description This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the **Readability** dimension of the PRRC framework. The training data was curated by selecting text with high readability scores, focusing on clear, coherent, and well-structured content. ## Model Details - **Architecture**: Transformer decoder-only - **Parameters**: 1.345B (1,345,423,360 parameters) - **Training Tokens**: 30B tokens - **Context Window**: 1,024 tokens - **Vocabulary Size**: 32,000 (LLaMA tokenizer) - **Data Selection Method**: Top-k selection based on Readability scores - **Rating Model**: ModernBERT-base fine-tuned for Readability assessment ## Architecture Specifications - **Hidden Dimension**: 2,048 - **Number of Layers**: 24 - **Attention Heads**: 16 - **Key-Value Heads**: 16 - **MLP Ratio**: 8/3 - **Position Encoding**: RoPE (base=10,000) ## Data Selection Criteria The training data was selected using the Readability rating model, which evaluates: - **Clarity**: Clear and comprehensible language - **Coherence**: Logical flow and organization - **Grammar**: Proper sentence structure and grammar - **Accessibility**: Appropriate vocabulary and sentence complexity - **Structure**: Well-organized content with proper formatting Selected texts typically include: - Well-written articles and essays - Clear educational materials - Professional communications - Edited publications and books - Quality journalism and reporting ## Training Details - **Hardware**: 32x NVIDIA A800 GPUs - **Global Batch Size**: 4,194,304 tokens - **Learning Rate**: 5e-5 - **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - **Training Time**: ~14 hours ## Performance Results ### Downstream Task Performance (Average Accuracy) - **General Knowledge**: 56.18% (+3.39% vs Random) - ARC-Easy: 55.64% - ARC-Challenge: 26.19% - SciQ: 86.70% - **Commonsense Reasoning**: 45.41% (+1.47% vs Random) - HellaSwag: 42.89% - SIQA: 40.17% - WinoGrande: 53.16% - **Reading Comprehension**: 31.20% (+1.18% vs Random) - RACE: 32.00% - OpenbookQA: 30.40% - **Overall Average**: 45.89% (+2.11% vs Random) ## Key Findings - **Balanced Performance**: Consistent improvements across all task categories - **Reading Comprehension**: Strong improvement in text understanding tasks - **Clear Communication**: Enhanced ability to generate coherent and readable text - **General Applicability**: Well-rounded performance suitable for diverse applications ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model_name = "opendatalab/meta-rater-1b-readability" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Generate text (particularly good for clear, readable content) prompt = "The benefits of renewable energy include" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=100, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Applications This model is particularly well-suited for: - **Content writing** and editing assistance - **Educational materials** creation - **Clear communication** tasks - **General text generation** with high readability - **Documentation** and technical writing - **Public communication** and outreach - **Accessibility-focused** content creation ## Strengths - Generates clear and coherent text - Balanced performance across different task types - Improved reading comprehension capabilities - Well-structured and organized output - Suitable for general-purpose applications - Enhanced text clarity and flow ## Limitations - May prioritize clarity over technical depth - Might avoid complex but necessary terminology - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - Could oversimplify complex topics ## Comparison with Baselines - **vs Random Baseline**: +2.11% overall improvement across all categories - **vs Other PRRC Dimensions**: Most balanced performance, strongest in reading comprehension - **vs Meta-rater All (25)**: Shows focused improvement in text clarity and comprehension ## Model Characteristics This model excels at: - **Clarity**: Producing easy-to-understand text - **Coherence**: Maintaining logical flow in generation - **Accessibility**: Using appropriate vocabulary for broad audiences - **Structure**: Organizing information effectively - **Readability**: Optimizing text for human comprehension ## Citation If you use this model in your research, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. ## Contact For questions or issues, please contact the authors or open an issue in the repository.