Drop-Upcycling
Collection
33 items
β’
Updated
β’
2

π [Paper] | π€ [Hugging Face] π [Dataset] π» [Code] | π [Log]
We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.
| Model | Link |
|---|---|
| 1 Dense 152M | Link |
| 2 MoE FS 8x152M | Link |
| 3 MoE BTX 8x152M | Link |
| 4 MoE NU 8x152M | Link |
| 5 MoE RNU (r=0.5) 8x152M | Link |
| 6 MoE DU (r=0.5) 8x152M | Link |
| 7 MoE DU (r=1.0) 8x152M | Link |
| 8 Dense 1.5B | Link |
| 9 MoE FS 8x1.5B | Link |
| 10 MoE BTX 8x1.5B | Link |
| 11 MoE NU 8x1.5B | Link |
| 12 MoE RNU (r=0.5) 8x1.5B | Link |
| 13 MoE DU (r=0.5) 8x1.5B | Link |
| 14 MoE DU (r=1.0) 8x1.5B | Link |
| Model | Link |
|---|---|
| 1 Dense 3.7B | Link |
| 2 MoE FS 8x3.7B | Link |
| 3 MoE DU (r=0.5) 8x3.7B | Link |
| 4 Dense 13B | Link |
| 5 Dense 3.7B | Link |
| Model | Link |
|---|---|
| Japanese expert 152M | Link |
| English expert 152M | Link |
| Code expert 152M | Link |
| Japanese expert 1.5B | Link |
| English expert 1.5B | Link |
| Code expert 1.5B | Link |
If you find our work helpful, please feel free to cite.
@inproceedings{
nakamura2025dropupcycling,
title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=gx1wHnf5Vp}
}