---
license: mit
datasets:
- opendatalab/SlimPajama-Meta-rater
language:
- en
---

# PRRC-Readability Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the **Readability** dimension of the PRRC framework. The training data was curated by selecting text with high readability scores, focusing on clear, coherent, and well-structured content.

## Model Details

- **Architecture**: Transformer decoder-only
- **Parameters**: 1.345B (1,345,423,360 parameters)
- **Training Tokens**: 30B tokens
- **Context Window**: 1,024 tokens
- **Vocabulary Size**: 32,000 (LLaMA tokenizer)
- **Data Selection Method**: Top-k selection based on Readability scores
- **Rating Model**: ModernBERT-base fine-tuned for Readability assessment

## Architecture Specifications

- **Hidden Dimension**: 2,048
- **Number of Layers**: 24
- **Attention Heads**: 16
- **Key-Value Heads**: 16
- **MLP Ratio**: 8/3
- **Position Encoding**: RoPE (base=10,000)

## Data Selection Criteria

The training data was selected using the Readability rating model, which evaluates:
- **Clarity**: Clear and comprehensible language
- **Coherence**: Logical flow and organization
- **Grammar**: Proper sentence structure and grammar
- **Accessibility**: Appropriate vocabulary and sentence complexity
- **Structure**: Well-organized content with proper formatting

Selected texts typically include:
- Well-written articles and essays
- Clear educational materials
- Professional communications
- Edited publications and books
- Quality journalism and reporting

## Training Details

- **Hardware**: 32x NVIDIA A800 GPUs
- **Global Batch Size**: 4,194,304 tokens
- **Learning Rate**: 5e-5
- **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- **Training Time**: ~14 hours

## Performance Results

### Downstream Task Performance (Average Accuracy)

- **General Knowledge**: 56.18% (+3.39% vs Random)
  - ARC-Easy: 55.64%
  - ARC-Challenge: 26.19%
  - SciQ: 86.70%

- **Commonsense Reasoning**: 45.41% (+1.47% vs Random)
  - HellaSwag: 42.89%
  - SIQA: 40.17%
  - WinoGrande: 53.16%

- **Reading Comprehension**: 31.20% (+1.18% vs Random)
  - RACE: 32.00%
  - OpenbookQA: 30.40%

- **Overall Average**: 45.89% (+2.11% vs Random)

## Key Findings

- **Balanced Performance**: Consistent improvements across all task categories
- **Reading Comprehension**: Strong improvement in text understanding tasks
- **Clear Communication**: Enhanced ability to generate coherent and readable text
- **General Applicability**: Well-rounded performance suitable for diverse applications

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-readability"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for clear, readable content)
prompt = "The benefits of renewable energy include"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Applications

This model is particularly well-suited for:
- **Content writing** and editing assistance
- **Educational materials** creation
- **Clear communication** tasks
- **General text generation** with high readability
- **Documentation** and technical writing
- **Public communication** and outreach
- **Accessibility-focused** content creation

## Strengths

- Generates clear and coherent text
- Balanced performance across different task types
- Improved reading comprehension capabilities
- Well-structured and organized output
- Suitable for general-purpose applications
- Enhanced text clarity and flow

## Limitations

- May prioritize clarity over technical depth
- Might avoid complex but necessary terminology
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- Could oversimplify complex topics

## Comparison with Baselines

- **vs Random Baseline**: +2.11% overall improvement across all categories
- **vs Other PRRC Dimensions**: Most balanced performance, strongest in reading comprehension
- **vs Meta-rater All (25)**: Shows focused improvement in text clarity and comprehension

## Model Characteristics

This model excels at:
- **Clarity**: Producing easy-to-understand text
- **Coherence**: Maintaining logical flow in generation
- **Accessibility**: Using appropriate vocabulary for broad audiences
- **Structure**: Organizing information effectively
- **Readability**: Optimizing text for human comprehension

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

## License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.