š BEHAVIOR Challenge 1st Place ā Solution Summary
My team recently won 1st place in the BEHAVIOR Challenge at NeurIPS. The competition focused on training a single policy to complete 50 long-horizon household tasks in simulation.
We built an end-to-end policy based on Pi0.5 with a bunch of custom modifications. Everything is open-sourced, and it should be useful for anyone exploring VLAs or adapting them to specific tasks.
Key Architecture Changes: - Replaced language model with 50 trainable task embeddings (no text at all) - Correlated noise for Flow Matching: ϵ ⼠N(0, 0.5I + 0.5Σ) using dataset action covariance - Learnable mixed-layer attention: each action expert layer attends to a trainable mix of all VLM layers - System 2 stage tracking: model predicts task stage, we smooth it with voting and feed it back as context
Training: - Multi-sample Flow Matching: 15 FM samples per VLM pass to reduce gradient variance - Delta action space + per-timestamp normalization - FAST auxiliary loss and stage prediction loss - Trained on 224Ć224 RGB + proprioception only - We use 4 fine-tuned checkpoints, all derived from a multi-task model trained on all 50 tasks
Inference Optimizations: - Soft inpainting: predict 30 actions, execute 26, use 4 as an input for the next chunk - Correlation-aware guidance of inpainting to keep action chunks smooth - 1.3Ć speedup via cubic spline compression - General correction rule: reopen gripper after failed grasps