arxiv:2509.09372

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Published on Sep 11

· Submitted by

Siteng Huang on Sep 12

#1 Paper of the day

Upvote

233

Authors:

Yihao Wang ,

Lingxiao Li ,

Wenxuan Song ,

Han Zhao ,

Abstract

VLA-Adapter reduces reliance on large-scale VLMs and extensive pre-training by using a lightweight Policy module with Bridge Attention, achieving state-of-the-art performance and fast inference speed with minimal computational resources.

AI-generated summary

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

View arXiv page View PDF Project page GitHub 1.45k Add to collection

Community

huangsiteng

Paper submitter Sep 12

Arxiv: https://arxiv.org/abs/2509.09372
Model: https://huggingface.co/VLA-Adapter
Project page: https://vla-adapter.github.io/

VLA-Adapter

Sep 24

Github: https://github.com/OpenHelix-Team/VLA-Adapter

abdoali5672

Sep 12

@librarian-bot recommend

librarian-bot

Sep 12

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

grantsing

Sep 13

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/vla-adapter-an-effective-paradigm-for-tiny-scale-vision-language-action-model

yh-wang

Paper author 27 days ago

•

edited 27 days ago

💖 Thank you for your continued interest in VLA-Adapter!
🧑‍🏫We've provided a comprehensive README.md file on 😺 GitHub, which should help you successfully perform training and inference.
❓️If you encounter any ISSUES while reproducing, please file an issue or join our WeChat group. The group QR code is also included in the issue! 👉 https://github.com/OpenHelix-Team/VLA-Adapter/issues/1
🌟In addition, our VLA-RFT paper, based on the VLA-Adapter, has been released. It leverages Reinforcement Learning to significantly improve the VLA's robustness to perturbations by interacting with the World Model with minimal samples and iterations.
📃Paper: https://arxiv.org/abs/2510.00406
🌏️Project page: https://vla-rft.github.io/
😺GitHub: https://github.com/OpenHelix-Team/VLA-RFT
😊HuggingFace: https://huggingface.co/VLA-RFT