Papers
arxiv:2509.09372

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Published on Sep 11
· Submitted by Siteng Huang on Sep 12
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

VLA-Adapter reduces reliance on large-scale VLMs and extensive pre-training by using a lightweight Policy module with Bridge Attention, achieving state-of-the-art performance and fast inference speed with minimal computational resources.

AI-generated summary

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

Community

Paper submitter
·

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

💖 Thank you for your continued interest in VLA-Adapter!
🧑‍🏫We've provided a comprehensive README.md file on 😺 GitHub, which should help you successfully perform training and inference.
❓️If you encounter any ISSUES while reproducing, please file an issue or join our WeChat group. The group QR code is also included in the issue! 👉 https://github.com/OpenHelix-Team/VLA-Adapter/issues/1
🌟In addition, our VLA-RFT paper, based on the VLA-Adapter, has been released. It leverages Reinforcement Learning to significantly improve the VLA's robustness to perturbations by interacting with the World Model with minimal samples and iterations.
📃Paper: https://arxiv.org/abs/2510.00406
🌏️Project page: https://vla-rft.github.io/
😺GitHub: https://github.com/OpenHelix-Team/VLA-RFT
😊HuggingFace: https://huggingface.co/VLA-RFT

Sign up or log in to comment

Models citing this paper 10

Browse 10 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.09372 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.09372 in a Space README.md to link it from this page.

Collections including this paper 21