SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

๐Ÿ•AIML, University of Adelaide ๐ŸŒญAdobe Research ๐Ÿ”UNC, Chapel Hill ๐ŸŒฎUNSW Sydney
Static Badge License: MIT

Model Description

SAME (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both high-level category-specific search (e.g., "find a chair") and low-level language-guided navigation (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.

Key Features

  • Multi-Task Capability: Single model handles 9 different navigation datasets simultaneously
  • State-Adaptive MoE: Dynamic expert routing based on multimodal features (text + visual observations)
  • Simulator-Free: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
  • Flexible Architecture: MoE can be placed at attention query, key-value, or feed-forward network positions

Model Architecture

SAME is built on a transformer-based architecture with the following key components:

Component Description
Language Encoder 9-layer BERT-based transformer encoder
Image Embeddings Processes 512-dim CLIP ViT-B/16 panoramic features
Local VP Encoder Viewport-level information with crossmodal fusion
Global Map Encoder Global spatial graph with dynamic routing
State-Adaptive MoE 8 experts with top-2 selection, multimodal routing

MoE Routing

The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:

  • The granularity of language instructions
  • Current visual observations
  • Navigation task requirements

Intended Uses

Primary Use Cases

  • Vision-and-Language Navigation (VLN): Following natural language instructions in indoor environments
  • Object Navigation: Finding target objects given category names
  • Dialog-based Navigation: Multi-turn conversational navigation
  • Remote Object Grounding: Navigating to and identifying remote objects

Supported Tasks

Task Dataset Description
Low-Level Navigation R2R, R2R-PREVALENT, R2R-ScaleVLN Fine-grained instruction following
Object Grounding REVERIE, REVERIE-ScaleVLN Navigate and ground remote objects
Long Horizontal VLN RXR-EN Long horizon navigation (English)
Dialog Navigation CVDN Cooperative vision-and-dialog navigation
Object Search SOON Semantic object-oriented navigation
Object Navigation ObjectNav-MP3D Category-based object finding

How to Use

Installation

git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt

Download Data and Models

# Download all datasets and features
python download.py --data

# Download pretrained models
python download.py --pretrain

# Download trained checkpoints (optional)
python download.py --checkpoints

Training

cd src

# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
    run.py --config_dir configs/main_multi_q.yaml

Evaluation

cd src
python run.py --config_dir configs/test.yaml \
    --options experiment.resume_file=/path/to/checkpoint.pt

Configuration Options

model:
  use_moe_layer: true
  moe_type: "Task"              # Task-based MoE
  moe_position: "Attn_q"        # Attn_q, Attn_kv, or FFN
  task_routing_feature: "multi" # Multimodal routing (recommended)
  num_experts: 8
  num_experts_per_tok: 2        # Top-2 expert selection

Training Details

Training Data

SAME is trained on 9 navigation datasets with weighted sampling:

Dataset Environment Sampling Weight
R2R-ScaleVLN HM3D 10-20
R2R-PREVALENT MP3D 1
R2R MP3D 1
REVERIE-ScaleVLN HM3D 1-10
REVERIE MP3D 1
RXR-EN MP3D 1
CVDN MP3D 1
SOON MP3D 1
ObjectNav-MP3D MP3D (Habitat) 2

Training Hyperparameters

  • Optimizer: AdamW
  • Learning Rate: 1e-5
  • Total Iterations: 500,000
  • Batch Size: 16
  • Gradient Clipping: 0.5
  • Training Algorithm: DAgger (Dataset Aggregation)
  • MoE Auxiliary Loss Coefficient: 0.8

Visual Features

  • Feature Extractor: CLIP ViT-B/16
  • Feature Dimension: 512
  • Format: HDF5 / LMDB
  • Environments: MatterSim, Habitat-MP3D, Habitat-HM3D

Evaluation Results

SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a unified model, outperforming task-specific approaches in many cases.

Main Results (Unified Model)

Room-to-Room (R2R)

Split SR โ†‘ SPL โ†‘
Val Unseen 76 66
Test Unseen 74 64

REVERIE

Split SR โ†‘ SPL โ†‘
Val Unseen 46.4 36.1
Test Unseen 48.6 37.1

RxR-EN (Multilingual VLN)

Split SR โ†‘ nDTW โ†‘
Val Unseen 50.5 51.2

CVDN (Dialog Navigation)

Split GP โ†‘
Val 6.94
Test 7.07

SOON (Object-Oriented Navigation)

Split SR โ†‘ SPL โ†‘
Val Unseen 36.1 25.4
Test Unseen 38.2 27.1

ObjectNav-MP3D

Split SR โ†‘ SPL โ†‘
Val 76.3 42.7

Evaluation Metrics

  • SR (Success Rate): Percentage of successful navigations (within 3m of goal)
  • SPL (Success weighted by Path Length): Efficiency-weighted success rate
  • nDTW (normalized Dynamic Time Warping): Path similarity to ground truth
  • GP (Goal Progress): Progress towards the goal in dialog navigation
  • NE (Navigation Error): Distance to goal at episode end
  • OSR (Oracle Success Rate): Success rate with oracle stop action

Model Variants

Variant MoE Position Routing Checkpoint
SAME-Q Attention Query Multimodal Attnq_pretrained_ckpt.pt
SAME-KV Attention K/V Multimodal Attnkv_pretrained_ckpt.pt
SAME-FFN Feed-Forward Multimodal FFN_pretrained_ckpt.pt

Limitations

  • Indoor Environments Only: Trained and evaluated on indoor navigation datasets
  • Pre-computed Features: Requires pre-extracted CLIP features; cannot process raw images directly
  • English Language: Primary support for English instructions (though RXR provides multilingual data)
  • Static Environments: Assumes static environments without dynamic obstacles or agents

Environmental Impact

  • Hardware: Training conducted on NVIDIA A100 GPUs
  • Training Time: Approximately 2-3 days on 4x A100 GPUs

Citation

If you find this work helpful, please cite:

@article{zhou2024same,
  title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
  author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
  journal={arXiv preprint arXiv:2412.05552},
  year={2024},
}

Authors

  • Gengze Zhou - AIML, University of Adelaide (Website)
  • Yicong Hong - Adobe Research (Website)
  • Zun Wang - UNC Chapel Hill (Website)
  • Chongyang Zhao - UNSW Sydney (GitHub)
  • Mohit Bansal - UNC Chapel Hill (Website)
  • Qi Wu - University of Adelaide (Website)

Acknowledgements

We extend our gratitude to:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or issues, please open an issue on the GitHub repository or contact the authors.

Downloads last month
70
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results