VisionLLaMA-Base-MAE
With the Masked Autoencoders' paradigm, VisionLLaMA-Base-MAE model is trained on ImageNet-1k without labels. It manifests substantial improvements over classification tasks (SFT, linear probing) on ImageNet-1K and the segmentation task on ADE20K.
| Model | ImageNet Acc (SFT) | ImageNet Acc (Linear Probe) | ADE20K Segmentation | 
|---|---|---|---|
| VisionLLaMA-Base-MAE (ep800) | 84.0 | 69.7 | 49.0 | 
| VisionLLaMA-Base-MAE (ep1600) | 84.3 | 71.7 | 50.2 | 
How to Use
Please refer the Github page for usage.
Citation
@article{chu2024visionllama,
  title={VisionLLaMA: A Unified LLaMA Interface for Vision Tasks},
  author={Chu, Xiangxiang and Su, Jianlin and Zhang, Bo and Shen, Chunhua},
  journal={arXiv preprint arXiv:2403.00522},
  year={2024}
}