Update
Browse files- .gitattributes +1 -0
- README.md +71 -0
- images/drop-upcycling.png +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
images/drop-upcycling.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -2,3 +2,74 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
|
| 5 |
+
<h1 align="center">
|
| 6 |
+
<img alt="Drop-Upcycling" src="images/drop-upcycling.png"></a><br>
|
| 7 |
+
<b>Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization</b><br>
|
| 8 |
+
</h1>
|
| 9 |
+
|
| 10 |
+
<p align="center">
|
| 11 |
+
π <a href="https://openreview.net/forum?id=gx1wHnf5Vp">[Paper]</a> |
|
| 12 |
+
π€ <a href="https://huggingface.co/collections/llm-jp/drop-upcycling-674dc5be7bbb45e12a476b80">[Hugging Face]</a>
|
| 13 |
+
π <a href="https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3">[Dataset]</a>
|
| 14 |
+
π» <a href="https://github.com/Taishi-N324/Drop-Upcycling">[Code]</a> |
|
| 15 |
+
π <a href="https://wandb.ai/taishi-nakamura/Drop-Upcycling">[Log]</a>
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
# Model Index
|
| 19 |
+
|
| 20 |
+
We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.
|
| 21 |
+
|
| 22 |
+
## Table 1
|
| 23 |
+
|
| 24 |
+
|Model|Link|
|
| 25 |
+
|---|---|
|
| 26 |
+
|1 Dense 152M| [Link](https://huggingface.co/llm-jp/Dense-152M) |
|
| 27 |
+
|2 MoE FS 8x152M| [Link](https://huggingface.co/llm-jp/FS-8x152M) |
|
| 28 |
+
|3 MoE BTX 8x152M| [Link](https://huggingface.co/llm-jp/BTX-8x152M) |
|
| 29 |
+
|4 MoE NU 8x152M| [Link](https://huggingface.co/llm-jp/NU-8x152M) |
|
| 30 |
+
|5 MoE RNU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x152M) |
|
| 31 |
+
|6 MoE DU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/DU-0.5-8x152M) |
|
| 32 |
+
|7 MoE DU (r=1.0) 8x152M| [Link](https://huggingface.co/llm-jp/DU-1.0-8x152M) |
|
| 33 |
+
|8 Dense 1.5B| [Link](https://huggingface.co/llm-jp/Dense-1.5B) |
|
| 34 |
+
|9 MoE FS 8x1.5B| [Link](https://huggingface.co/llm-jp/FS-8x1.5B) |
|
| 35 |
+
|10 MoE BTX 8x1.5B| [Link](https://huggingface.co/llm-jp/BTX-8x1.5B) |
|
| 36 |
+
|11 MoE NU 8x1.5B| [Link](https://huggingface.co/llm-jp/NU-8x1.5B) |
|
| 37 |
+
|12 MoE RNU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x1.5B) |
|
| 38 |
+
|13 MoE DU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x1.5B) |
|
| 39 |
+
|14 MoE DU (r=1.0) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-1.0-8x1.5B) |
|
| 40 |
+
|
| 41 |
+
## Table 2
|
| 42 |
+
|
| 43 |
+
|Model|Link|
|
| 44 |
+
|---|---|
|
| 45 |
+
|1 Dense 3.7B| [Link](https://huggingface.co/llm-jp/Dense-3.7B) |
|
| 46 |
+
|2 MoE FS 8x3.7B| [Link](https://huggingface.co/llm-jp/FS-8x3.7B) |
|
| 47 |
+
|3 MoE DU (r=0.5) 8x3.7B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x3.7B) |
|
| 48 |
+
|4 Dense 13B| [Link](https://huggingface.co/llm-jp/Dense-13B) |
|
| 49 |
+
|5 Dense 3.7B| [Link](https://huggingface.co/llm-jp/llm-jp-3-3.7b) |
|
| 50 |
+
|
| 51 |
+
## BTX Experts
|
| 52 |
+
|
| 53 |
+
|Model|Link|
|
| 54 |
+
|---|---|
|
| 55 |
+
|Japanese expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-152M) |
|
| 56 |
+
|English expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-152M) |
|
| 57 |
+
|Code expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-152M) |
|
| 58 |
+
|Japanese expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-1.5B) |
|
| 59 |
+
|English expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-1.5B) |
|
| 60 |
+
|Code expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-1.5B) |
|
| 61 |
+
|
| 62 |
+
## How to cite
|
| 63 |
+
|
| 64 |
+
If you find our work helpful, please feel free to cite.
|
| 65 |
+
|
| 66 |
+
```
|
| 67 |
+
@inproceedings{
|
| 68 |
+
nakamura2025dropupcycling,
|
| 69 |
+
title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
|
| 70 |
+
author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
|
| 71 |
+
booktitle={The Thirteenth International Conference on Learning Representations},
|
| 72 |
+
year={2025},
|
| 73 |
+
url={https://openreview.net/forum?id=gx1wHnf5Vp}
|
| 74 |
+
}
|
| 75 |
+
```
|
images/drop-upcycling.png
ADDED
|
Git LFS Details
|