Viewpoint Matters

Abstract

Robotic manipulation continues to be a challenge, and imitation learning (IL) enables robots to learn tasks from expert demonstrations. Current IL methods typically rely on fixed camera setups, where cameras are manually positioned in static locations, imposing significant limitations on adaptability and coverage. Inspired by human active perception, where humans dynamically adjust their viewpoint to capture the most relevant and least noisy information, we propose MAE-Select, a novel framework for active viewpoint selection in single-camera robotic systems. MAE-Select fully leverages pre-trained multi-view masked autoencoder representations and dynamically selects the next most informative viewpoint at each time chunk without requiring labeled viewpoints. Extensive experiments demonstrate that MAE-Select improves the capabilities of single-camera systems and, in some cases, even surpasses multi-camera setups.

Overview of MAE-Select

Pipeline — Illustration of our proposed method. **Left** depicts the pre-training stage of the multi-view masked autoencoder with joint embeddings. **Middle** illustrates the training process of our framework using imitation learning. **Right** demonstrates how the framework operates during inference.

Viewpoint Settings

Task — The viewpoint settings for various robotic tasks, showcasing the viewpoints used to evaluate performance across different simulation and real-world scenarios.

Visualizations

Simulation

Bimanual Insertion

Put Box In Bin

Put Box In Bin with Disturbance

Put Box In Cabinet

Phone On Base

Pick Up Cup

Take Umbrella Out Of Stand

Unplug Charger

Real World

Put Eggplant To Bowl

Put Eggplant To Bowl with Disturbance

Put Bitter Melon In Cabinet