Update README.md
Browse files
README.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
| 2 |
license: mit
|
| 3 |
datasets:
|
| 4 |
- sail/regmix-data
|
|
|
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
---
|
|
@@ -9,6 +10,29 @@ language:
|
|
| 9 |
|
| 10 |
# Models Trained with Random Mixture
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
## How to Load a Model
|
| 13 |
|
| 14 |
You can load any model using the corresponding branch with the Hugging Face Transformers library:
|
|
@@ -20,7 +44,6 @@ model = AutoModel.from_pretrained("sail/data-mixture-random-1b", revision="model
|
|
| 20 |
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-random-1b", revision="model-index-1")
|
| 21 |
```
|
| 22 |
|
| 23 |
-
|
| 24 |
## Data Mixture
|
| 25 |
|
| 26 |
The specific data mixture used for training each 1B model can be found in the file `train_config.yaml` in each corresponding model branch.
|
|
|
|
| 2 |
license: mit
|
| 3 |
datasets:
|
| 4 |
- sail/regmix-data
|
| 5 |
+
- sail/regmix-data-sample
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
---
|
|
|
|
| 10 |
|
| 11 |
# Models Trained with Random Mixture
|
| 12 |
|
| 13 |
+
This is a collection of 64 language models, each with approximately 1B parameters, trained on different random mixtures of data. This project aims to validate the generalization capabilities of the RegMix approach (https://huggingface.co/papers/2407.01492) from small-scale (e.g., 1M parameters) to large-scale (e.g., 1B parameters) models.
|
| 14 |
+
|
| 15 |
+
## Key Features
|
| 16 |
+
|
| 17 |
+
- **Model Size**: 64 separate models, each with ~1B parameters
|
| 18 |
+
- **Training Data**: Random data mixtures on the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset
|
| 19 |
+
- **Purpose**: To validate the effectiveness of RegMix on identifying high-performing data mixture
|
| 20 |
+
|
| 21 |
+
## Dataset
|
| 22 |
+
|
| 23 |
+
The models were trained using the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset, which is split into different domains from The Pile dataset.
|
| 24 |
+
|
| 25 |
+
## Training Hyperparameters
|
| 26 |
+
|
| 27 |
+
| Hyperparameter | Value |
|
| 28 |
+
|:---------------|:------|
|
| 29 |
+
| Batch Size | 1M tokens |
|
| 30 |
+
| Learning Rate | 4e-4 |
|
| 31 |
+
| Minimum Learning Rate | 1e-5 |
|
| 32 |
+
| Learning Rate Schedule | Cosine |
|
| 33 |
+
| Warmup Ratio | 4% |
|
| 34 |
+
| Total Tokens | 25B |
|
| 35 |
+
|
| 36 |
## How to Load a Model
|
| 37 |
|
| 38 |
You can load any model using the corresponding branch with the Hugging Face Transformers library:
|
|
|
|
| 44 |
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-random-1b", revision="model-index-1")
|
| 45 |
```
|
| 46 |
|
|
|
|
| 47 |
## Data Mixture
|
| 48 |
|
| 49 |
The specific data mixture used for training each 1B model can be found in the file `train_config.yaml` in each corresponding model branch.
|