Improve model card: Add metadata, links, abstract, and usage for Concerto

This PR enhances the model card for the Concerto model by:
- Adding key metadata: `pipeline_tag: graph-ml`, `library_name: pytorch`, `license: apache-2.0`, and descriptive `tags`.
- Updating the paper link to the Hugging Face paper page.
- Including direct links to the official project page and the GitHub repository for easy access to code and further details.
- Adding the paper abstract to provide comprehensive context about the model.
- Guiding users to the GitHub repository for detailed installation, training, and inference instructions, adhering to the guidelines of not making up code snippets.

Please review and merge this PR.

Files changed (1) hide show

README.md +42 -1

README.md CHANGED Viewed

	@@ -1 +1,42 @@
1	- ~~Model weights for [Concerto](arxiv.org/abs/2510.23607)~~

+---
+pipeline_tag: graph-ml
+library_name: pytorch
+license: apache-2.0
+tags:
+- 3d
+- point-cloud
+- self-supervised-learning
+---
+# Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
+This repository contains the model weights for **Concerto**, a novel approach for learning robust spatial representations presented in the paper [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607).
+- **Paper:** [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607)
+- **Project Page:** [https://pointcept.github.io/Concerto/](https://pointcept.github.io/Concerto/)
+- **Codebase:** [https://github.com/Pointcept/Pointcept](https://github.com/Pointcept/Pointcept)
+## Abstract
+Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
+## Usage
+For detailed installation, data preparation, training, and testing instructions, please refer to the [official GitHub repository](https://github.com/Pointcept/Pointcept).
+## Citation
+If you find Concerto or the Pointcept codebase useful in your research, please cite the following papers:
+```bibtex
+@misc{pointcept2023,
+    title={Pointcept: A Codebase for Point Cloud Perception Research},
+    author={Pointcept Contributors},
+    howpublished = {\url{https://github.com/Pointcept/Pointcept}},
+    year={2023}
+}
+@article{zhang2025concerto,
+  title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
+  author={Zhang, Yujia and Wu, Xiaoyang and Lao, Yixing and Wang, Chengyao and Tian, Zhuotao and Wang, Naiyan and Zhao, Hengshuang},
+  journal={Conference on Neural Information Processing Systems},
+  year={2025},
+}
+```