Viewpoint Matters: Dynamically Optimizing Viewpoints with Masked Autoencoder for Visual Manipulation


Abstract

Robotic manipulation continues to be a challenge, and imitation learning (IL) enables robots to learn tasks from expert demonstrations. Current IL methods typically rely on fixed camera setups, where cameras are manually positioned in static locations, imposing significant limitations on adaptability and coverage. Inspired by human active perception, where humans dynamically adjust their viewpoint to capture the most relevant and least noisy information, we propose MAE-Select, a novel framework for active viewpoint selection in single-camera robotic systems. MAE-Select fully leverages pre-trained multi-view masked autoencoder representations and dynamically selects the next most informative viewpoint at each time chunk without requiring labeled viewpoints. Extensive experiments demonstrate that MAE-Select improves the capabilities of single-camera systems and, in some cases, even surpasses multi-camera setups.

Overview of MAE-Select

Pipeline
Illustration of our proposed method. Left depicts the pre-training stage of the multi-view masked autoencoder with joint embeddings. Middle illustrates the training process of our framework using imitation learning. Right demonstrates how the framework operates during inference.

Viewpoint Settings

Task
The viewpoint settings for various robotic tasks, showcasing the viewpoints used to evaluate performance across different simulation and real-world scenarios.

Visualizations

Simulation

Bimanual Insertion

Put Box In Bin

Put Box In Bin with Disturbance

Put Box In Cabinet

Phone On Base

Pick Up Cup

Take Umbrella Out Of Stand

Unplug Charger

Real World

Put Eggplant To Bowl

Put Eggplant To Bowl with Disturbance

Put Bitter Melon In Cabinet

  • Real-world experiments have shown that MAE-Select is robust to camera perturbations, consistently delivering accurate results even in challenging scenarios.
  • Camera perturbations may be due to diffusion processes, which introduce noise or distortions during action acquisition.