SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
๐AIML, University of Adelaide
๐ญAdobe Research
๐UNC, Chapel Hill
๐ฎUNSW Sydney
Model Description
SAME (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both high-level category-specific search (e.g., "find a chair") and low-level language-guided navigation (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
Key Features
- Multi-Task Capability: Single model handles 9 different navigation datasets simultaneously
- State-Adaptive MoE: Dynamic expert routing based on multimodal features (text + visual observations)
- Simulator-Free: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
- Flexible Architecture: MoE can be placed at attention query, key-value, or feed-forward network positions
Model Architecture
SAME is built on a transformer-based architecture with the following key components:
| Component |
Description |
| Language Encoder |
9-layer BERT-based transformer encoder |
| Image Embeddings |
Processes 512-dim CLIP ViT-B/16 panoramic features |
| Local VP Encoder |
Viewport-level information with crossmodal fusion |
| Global Map Encoder |
Global spatial graph with dynamic routing |
| State-Adaptive MoE |
8 experts with top-2 selection, multimodal routing |
MoE Routing
The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
- The granularity of language instructions
- Current visual observations
- Navigation task requirements
Intended Uses
Primary Use Cases
- Vision-and-Language Navigation (VLN): Following natural language instructions in indoor environments
- Object Navigation: Finding target objects given category names
- Dialog-based Navigation: Multi-turn conversational navigation
- Remote Object Grounding: Navigating to and identifying remote objects
Supported Tasks
| Task |
Dataset |
Description |
| Low-Level Navigation |
R2R, R2R-PREVALENT, R2R-ScaleVLN |
Fine-grained instruction following |
| Object Grounding |
REVERIE, REVERIE-ScaleVLN |
Navigate and ground remote objects |
| Long Horizontal VLN |
RXR-EN |
Long horizon navigation (English) |
| Dialog Navigation |
CVDN |
Cooperative vision-and-dialog navigation |
| Object Search |
SOON |
Semantic object-oriented navigation |
| Object Navigation |
ObjectNav-MP3D |
Category-based object finding |
How to Use
Installation
git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt
Download Data and Models
python download.py --data
python download.py --pretrain
python download.py --checkpoints
Training
cd src
python run.py --config_dir configs/main_multi_q.yaml
torchrun --nproc_per_node=4 --master_port=29500 \
run.py --config_dir configs/main_multi_q.yaml
Evaluation
cd src
python run.py --config_dir configs/test.yaml \
--options experiment.resume_file=/path/to/checkpoint.pt
Configuration Options
model:
use_moe_layer: true
moe_type: "Task"
moe_position: "Attn_q"
task_routing_feature: "multi"
num_experts: 8
num_experts_per_tok: 2
Training Details
Training Data
SAME is trained on 9 navigation datasets with weighted sampling:
| Dataset |
Environment |
Sampling Weight |
| R2R-ScaleVLN |
HM3D |
10-20 |
| R2R-PREVALENT |
MP3D |
1 |
| R2R |
MP3D |
1 |
| REVERIE-ScaleVLN |
HM3D |
1-10 |
| REVERIE |
MP3D |
1 |
| RXR-EN |
MP3D |
1 |
| CVDN |
MP3D |
1 |
| SOON |
MP3D |
1 |
| ObjectNav-MP3D |
MP3D (Habitat) |
2 |
Training Hyperparameters
- Optimizer: AdamW
- Learning Rate: 1e-5
- Total Iterations: 500,000
- Batch Size: 16
- Gradient Clipping: 0.5
- Training Algorithm: DAgger (Dataset Aggregation)
- MoE Auxiliary Loss Coefficient: 0.8
Visual Features
- Feature Extractor: CLIP ViT-B/16
- Feature Dimension: 512
- Format: HDF5 / LMDB
- Environments: MatterSim, Habitat-MP3D, Habitat-HM3D
Evaluation Results
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a unified model, outperforming task-specific approaches in many cases.
Main Results (Unified Model)
Room-to-Room (R2R)
| Split |
SR โ |
SPL โ |
| Val Unseen |
76 |
66 |
| Test Unseen |
74 |
64 |
REVERIE
| Split |
SR โ |
SPL โ |
| Val Unseen |
46.4 |
36.1 |
| Test Unseen |
48.6 |
37.1 |
RxR-EN (Multilingual VLN)
| Split |
SR โ |
nDTW โ |
| Val Unseen |
50.5 |
51.2 |
CVDN (Dialog Navigation)
| Split |
GP โ |
| Val |
6.94 |
| Test |
7.07 |
SOON (Object-Oriented Navigation)
| Split |
SR โ |
SPL โ |
| Val Unseen |
36.1 |
25.4 |
| Test Unseen |
38.2 |
27.1 |
ObjectNav-MP3D
| Split |
SR โ |
SPL โ |
| Val |
76.3 |
42.7 |
Evaluation Metrics
- SR (Success Rate): Percentage of successful navigations (within 3m of goal)
- SPL (Success weighted by Path Length): Efficiency-weighted success rate
- nDTW (normalized Dynamic Time Warping): Path similarity to ground truth
- GP (Goal Progress): Progress towards the goal in dialog navigation
- NE (Navigation Error): Distance to goal at episode end
- OSR (Oracle Success Rate): Success rate with oracle stop action
Model Variants
| Variant |
MoE Position |
Routing |
Checkpoint |
| SAME-Q |
Attention Query |
Multimodal |
Attnq_pretrained_ckpt.pt |
| SAME-KV |
Attention K/V |
Multimodal |
Attnkv_pretrained_ckpt.pt |
| SAME-FFN |
Feed-Forward |
Multimodal |
FFN_pretrained_ckpt.pt |
Limitations
- Indoor Environments Only: Trained and evaluated on indoor navigation datasets
- Pre-computed Features: Requires pre-extracted CLIP features; cannot process raw images directly
- English Language: Primary support for English instructions (though RXR provides multilingual data)
- Static Environments: Assumes static environments without dynamic obstacles or agents
Environmental Impact
- Hardware: Training conducted on NVIDIA A100 GPUs
- Training Time: Approximately 2-3 days on 4x A100 GPUs
Citation
If you find this work helpful, please cite:
@article{zhou2024same,
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
journal={arXiv preprint arXiv:2412.05552},
year={2024},
}
Authors
- Gengze Zhou - AIML, University of Adelaide (Website)
- Yicong Hong - Adobe Research (Website)
- Zun Wang - UNC Chapel Hill (Website)
- Chongyang Zhao - UNSW Sydney (GitHub)
- Mohit Bansal - UNC Chapel Hill (Website)
- Qi Wu - University of Adelaide (Website)
Acknowledgements
We extend our gratitude to:
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For questions or issues, please open an issue on the GitHub repository or contact the authors.