Run-time Observation Interventions
Make Vision-Language-Action Models More Visually Robust

Asher J. Hancock, Allen Z. Ren, Anirudha Majumdar

We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme for vision-language-action (VLA) models that improves baseline performance in the presence of task-irrelevant distractor objects and backgrounds without finetuning or access to the model's weights.

Approach

BYOVLA is predicated on three simple steps applied to a VLA's input image: 1) determine task-irrelevant regions, 2) quantify sensitivity by perturbing regions, and 3) transform the image. Task-irrelevant regions are determined by a vision-language model (VLM). If the VLA is sensitive to a task-irrelevant region, BYOVLA transforms the image such that the model is no longer sensitive by simply removing objects and recoloring backgrounds distractions.

Experiments

We demonstrate BYOVLA on two popular open-source VLA models, Octo-Base and OpenVLA-7b. BYOVLA enables greater task-success rates when task-irrelvant distractions are present. All videos are 4x speed.

Task: "place the carrot on yellow plate"
Scenario: no distractors
Policy: Octo-Base

Success

Task: "place the carrot on yellow plate"
Scenario: object distractors
Policy: Octo-Base

Failure

Task: "place the carrot on yellow plate"
Scenario: object distractors
Policy: Octo-Base with BYOVLA

Success

Task: "place the eggplant in the pot"
Scenario: no distractors
Policy: OpenVLA-7b

Success

Task: "place the eggplant in the pot"
Scenario: object distractors
Policy: OpenVLA-7b

Failure

Task: "place the eggplant in the pan"
Scenario: object distractors
Policy: OpenVLA-7b with BYOVLA

Success

Task: "place the carrot on yellow plate"
Scenario: background distractors
Policy: Octo-Base

Failure

Task: "place the eggplant in the pot"
Scenario: background distractors
Policy: Octo-Base with BYOVLA

Success

Results

We evaluate BYOVLA on Octo-Base with the task "place the carrot on yellow plate." With distractions, the performance of the nominal VLA policy significantly drops. Augmenting Octo-Base with BYOVLA allows the policy to achieve its nominal task-success rate with distractions present.

We also evaluate BYOVLA on OpenVLA-7b with the task "put the eggplant in the pot." Even though OpenVLA is 75x larger than Octo-Base, task-irrelevant distractions markedly degrade performance. We find that application of BYOVLA helps OpenVLA achieve its nominal task-success rate with distractions present.

BibTeX

@article{hancock24byovla,
      title={Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust},
      author={Hancock, Asher J., and Ren, Allen Z., and Majumdar, Anirudha},
      journal = {arXiv preprint arXiv:2410.01971},
      year={2024},
  }