DeFM: Learning Foundation Representations from Depth for Robotics
DeFM (Depth Foundation Model) is a vision backbone trained on 60M depth images via self-distillation. It is engineered for robotic perception, providing metric-aware representations that excel in sim-to-real transfer and cross-sensor generalization.
TL;DR - A DINO-style encoder, but for depth image inputs.
🌟 Key Features
- Large-Scale Pretraining: We pretrain on our curated dataset of 60 M depth images using self-distillation.
- Semantic Awareness: DeFM learns not only robust geometric priors but also semantically rich features from just depth images.
- Metric-Aware Normalization: Our novel three channel input normalization preserves metric depth across multiple scales.
- Compact efficient models: We distill our DeFM-ViT-L into a family of smaller efficient CNNs as small as 3M params for robot policy learning.
- Robotics Proven: Our encoder is proven effective for diverse robotic tasks such as navigation, manipulation and locomotion without task-specific fine-tuning.
Usage
Visit our github repo for details on how to use the models.
📊 Model Zoo
The following table provides a comprehensive overview of the DeFM model family, including architectural parameters, inference latency across training and deployment hardware (224x224), and performance on the ImageNet-1k-Depth benchmark.
| Model | Params (M) | RTX 4090 (ms) | Jetson Orin (ms) | Top-5 KNN (%) | Linear Prob (%) | Checkpoint |
|---|---|---|---|---|---|---|
| DeFM ViT-L/14 | 307.0 | 624.91 | 72.82 | 84.79 | 71.72 | Download |
| DeFM ViT-S/14 | 22.1 | 63.76 | 11.92 | 78.06 | 61.54 | Download |
| DeFM ResNet-50 | 26.2 | 69.39 | 17.79 | 77.63 | 61.54 | Download |
| DeFM ResNet-34 | 21.8 | 33.08 | 13.54 | 72.72 | 54.39 | Download |
| DeFM ResNet-18 | 11.7 | 21.06 | 8.67 | 69.69 | 50.58 | Download |
| DeFM EfficientNet-B6 | 28.98 | 150.98 | 54.11 | 77.81 | 59.23 | Download |
| DeFM EfficientNet-B4 | 14.16 | 86.51 | 39.67 | 74.74 | 54.73 | Download |
| DeFM EfficientNet-B2 | 4.95 | 46.12 | 28.37 | 71.51 | 50.32 | Download |
| DeFM EfficientNet-B0 | 3.01 | 29.39 | 21.04 | 67.98 | 46.17 | Download |
| DeFM RegNetY-1.6GF | 12.4 | 44.25 | 41.82 | 76.21 | 57.28 | Download |
| DeFM RegNetY-800MF | 6.3 | 25.21 | 24.16 | 74.91 | 57.03 | Download |
| DeFM RegNetY-400MF | 4.1 | 17.27 | 25.17 | 72.87 | 50.51 | Download |
📖 Citation
If you find DeFM useful for your research, please cite our paper:
@misc{patel2026defm,
title = {DeFM: Learning Foundation Representations from Depth for Robotics},
author = {Patel, Manthan and Frey, Jonas and Mittal, Mayank and Yang, Fan and Hansson, Alexander and Bar, Amir and Cadena, Cesar and Hutter, Marco},
year = {2026},
archivePrefix = {arXiv},
eprint = {XXXX.XXXXX},
primaryClass = {cs.RO}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support