Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

📄 [Paper] | 🤗 [Hugging Face] 📁 [Dataset] 💻 [Code] | 📊 [Log]

Model Index

We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.

Table 1

Model	Link
1 Dense 152M	Link
2 MoE FS 8x152M	Link
3 MoE BTX 8x152M	Link
4 MoE NU 8x152M	Link
5 MoE RNU (r=0.5) 8x152M	Link
6 MoE DU (r=0.5) 8x152M	Link
7 MoE DU (r=1.0) 8x152M	Link
8 Dense 1.5B	Link
9 MoE FS 8x1.5B	Link
10 MoE BTX 8x1.5B	Link
11 MoE NU 8x1.5B	Link
12 MoE RNU (r=0.5) 8x1.5B	Link
13 MoE DU (r=0.5) 8x1.5B	Link
14 MoE DU (r=1.0) 8x1.5B	Link

Table 2

Model	Link
1 Dense 3.7B	Link
2 MoE FS 8x3.7B	Link
3 MoE DU (r=0.5) 8x3.7B	Link
4 Dense 13B	Link
5 Dense 3.7B	Link

BTX Experts

Model	Link
Japanese expert 152M	Link
English expert 152M	Link
Code expert 152M	Link
Japanese expert 1.5B	Link
English expert 1.5B	Link
Code expert 1.5B	Link

How to cite

If you find our work helpful, please feel free to cite.

@inproceedings{
    nakamura2025dropupcycling,
    title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
    author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=gx1wHnf5Vp}
}

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including llm-jp/NU-8x1.5B

Drop-Upcycling

Collection

33 items • Updated May 30 • 2