Papers
arxiv:2510.23607

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Published on Oct 27
ยท Submitted by Xiaoyang Wu on Oct 28
#1 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

Concerto, a minimalist model combining 3D self-distillation and 2D-3D joint embedding, achieves superior spatial feature learning and outperforms existing models in scene understanding and open-world perception.

AI-generated summary

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Community

TL;DR: Concerto provides joint 2D-3D self-supervised pre-trained Point Transformer V3 for 3D point cloud downstream tasks, modified from Sonata.

Homepage: https://pointcept.github.io/Concerto/
Gradio Demo: https://huggingface.co/spaces/Pointcept/Concerto
Inference Code: https://github.com/Pointcept/Concerto
Training Code: https://github.com/Pointcept/Pointcept

image

Thanks for the great work! ๐Ÿ‘ ๐Ÿ‘ ๐Ÿ‘

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.23607 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.