Drop-Upcycling
					Collection
				
				33 items
				β’ 
				Updated
					
				β’
					
					2

π [Paper] | π€ [Hugging Face] π [Dataset] π» [Code] | π [Log]
We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.
| Model | Link | 
|---|---|
| 1 Dense 152M | Link | 
| 2 MoE FS 8x152M | Link | 
| 3 MoE BTX 8x152M | Link | 
| 4 MoE NU 8x152M | Link | 
| 5 MoE RNU (r=0.5) 8x152M | Link | 
| 6 MoE DU (r=0.5) 8x152M | Link | 
| 7 MoE DU (r=1.0) 8x152M | Link | 
| 8 Dense 1.5B | Link | 
| 9 MoE FS 8x1.5B | Link | 
| 10 MoE BTX 8x1.5B | Link | 
| 11 MoE NU 8x1.5B | Link | 
| 12 MoE RNU (r=0.5) 8x1.5B | Link | 
| 13 MoE DU (r=0.5) 8x1.5B | Link | 
| 14 MoE DU (r=1.0) 8x1.5B | Link | 
| Model | Link | 
|---|---|
| 1 Dense 3.7B | Link | 
| 2 MoE FS 8x3.7B | Link | 
| 3 MoE DU (r=0.5) 8x3.7B | Link | 
| 4 Dense 13B | Link | 
| 5 Dense 3.7B | Link | 
| Model | Link | 
|---|---|
| Japanese expert 152M | Link | 
| English expert 152M | Link | 
| Code expert 152M | Link | 
| Japanese expert 1.5B | Link | 
| English expert 1.5B | Link | 
| Code expert 1.5B | Link | 
If you find our work helpful, please feel free to cite.
@inproceedings{
    nakamura2025dropupcycling,
    title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
    author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=gx1wHnf5Vp}
}