Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust
We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme for vision-language-action (VLA) models that improves baseline performance in the presence of task-irrelevant distractor objects and backgrounds without finetuning or access to the model's weights.
Approach
BYOVLA is predicated on three simple steps applied to a VLA's input image: 1) determine task-irrelevant regions, 2) quantify sensitivity by perturbing regions, and 3) transform the image. Task-irrelevant regions are determined by a vision-language model (VLM). If the VLA is sensitive to a task-irrelevant region, BYOVLA transforms the image such that the model is no longer sensitive by simply removing objects and recoloring backgrounds distractions.
Experiments
We demonstrate BYOVLA on two popular open-source VLA models, Octo-Base and OpenVLA-7b. BYOVLA enables greater task-success rates when task-irrelvant distractions are present. All videos are 10x speed.
Results
We evaluate BYOVLA on Octo-Base with the task "place the carrot on yellow plate." With distractions, the performance of the nominal VLA policy significantly drops. Augmenting Octo-Base with BYOVLA allows the policy to (nearly) achieve its nominal task-success rate with distractions present.
We also evaluate BYOVLA on OpenVLA-7b with the task "put the eggplant in the pot." Even though OpenVLA is 75x larger than Octo-Base, task-irrelevant distractions markedly degrade performance. We find that application of BYOVLA helps OpenVLA achieve its nominal task-success rate with distractions present.
BibTeX
@article{hancock24byovla,
title={Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust},
author={Hancock, Asher J., and Ren, Allen Z., and Majumdar, Anirudha},
journal = {arXiv preprint arXiv:2410.01971},
year={2024},
}