Yankai Fu1,2*   Ning Chen1,2*   Junkai Zhao2*†   Shaozhe Shan1  
Guocai Yao2   Pengwei Wang2   Zhongyuan Wang2   Shanghang Zhang1,2✉  
1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University;
2Beijing Academy of Artificial Intelligence  
* Co-first authors     Project leader     Corresponding author

Abstract

Method Overview

Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.

EgoAtlas

Multi-Source Egocentric Manipulation Dataset

We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space.

EgoDex

H2O

ARCTIC



HoloAssist

OakInk

PH2D-Human



Self-collected

ActionNet

PH2D-Robot






Unified Action Space

we construct a unified action space that bridges the gap between human and robot motion representations.

Robot Demos (4X)

Policy performs inference on the server and communicates with the hardware, which inevitably introduces communication delay.

Short Horizon Tasks

Pick and Place

Close Laptop

Open Drawer



Long Horizon Tasks

Grasp Two Drinks into Basket

Put Cola into Basket

Open Drawer and Put Bread

Instruction Following

Grasp the red apple on the plate

Grasp the green lemon on the plate

Grasp the yellow orange on the plate

Generalization

Unseen background

Unseen lighting condition


Unseen object

Cluttered scene


Method

Method Overview

METIS Method

We propose motion-aware dynamics, a compact and discretized representation designed for dexterous manipulation. It captures both visual and motion dynamics, providing efficient and expressive supervision for training VLA models. Built upon them, METIS is pretrained on EgoAtlas, unifying reasoning and acting within a single framework.

Experiments

experiments

METIS Performance

METIS demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios.

BibTeX


@article{fu2025metis,
    title={METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model},
    author={Fu, Yankai and Chen, Ning and Zhao, Junkai and Shan, Shaozhe and Yao, Guocai and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2511.xxxxx},
    year={2025}
}