Run-time Observation Interventions
Make Vision-Language-Action Models More Visually Robust



We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme for vision-language-action (VLA) models that improves baseline performance in the presence of task-irrelevant distractor objects and backgrounds without finetuning or access to the model's weights.


Approach


BYOVLA is predicated on three simple steps applied to a VLA's input image: 1) determine task-irrelevant regions, 2) quantify sensitivity by perturbing regions, and 3) transform the image. Task-irrelevant regions are determined by a vision-language model (VLM). If the VLA is sensitive to a task-irrelevant region, BYOVLA transforms the image such that the model is no longer sensitive by simply removing objects and recoloring backgrounds distractions.


Experiments

We demonstrate BYOVLA on two popular open-source VLA models, Octo-Base and OpenVLA-7b. BYOVLA enables greater task-success rates when task-irrelvant distractions are present. All videos are 10x speed.



Results

We evaluate BYOVLA on Octo-Base with the task "place the carrot on yellow plate." With distractions, the performance of the nominal VLA policy significantly drops. Augmenting Octo-Base with BYOVLA allows the policy to (nearly) achieve its nominal task-success rate with distractions present.



We also evaluate BYOVLA on OpenVLA-7b with the task "put the eggplant in the pot." Even though OpenVLA is 75x larger than Octo-Base, task-irrelevant distractions markedly degrade performance. We find that application of BYOVLA helps OpenVLA achieve its nominal task-success rate with distractions present.