CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

Yankai Fu1,2†, Qiuxuan Feng1,3†, Ning Chen1†, Zichen Zhou4, Mengzhen Liu1,
Mingdong Wu1, Tianxing Chen5, Shanyu Rong1, Jiaming Liu1, Hao Dong1, Shanghang Zhang1,6✉

1School of Computer Science, Peking University, 2Wuhan University, 3Tianjin University,
4Beijing Institute of Technology, 5The University of Hong Kong, 6Beijing Academy of Artificial Intelligence (BAAI)

† Equal contribution, ✉ Corresponding author

CordViP

Abstract

Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pretraining strategy, where we also incorporate object-centric contact maps and handarm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities with an average success rate of 90% in four real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios.

Overview

input image

We propose CordViP, a correspondence-based visuomotor policy for dexterous manipulation in the real world. (a) Left: We present the interaction-aware point clouds, which demonstrate robustness to different viewpoints while establishing correspondences between the object and the hand. (b) Right: Our method achieves promising results across four real-world dexterous manipulation tasks, showcasing exceptional generalization capabilities.

Framework

input image
(a) We first employ TripoSR to generate the initial object point cloud and FoundationPose to estimate the 6D pose of the object. In parallel, the hand point cloud is generated based on the robot's state. They are combined to construct interaction-aware point clouds, which demonstrate robustness to viewpoint variations. (b) During the pre-training phase, the generated point cloud data, combined with the robot’s proprioceptive information, is utilized to enhance spatial understanding and interaction modeling. (c) The pre-trained encoder is subsequently integrated into an imitation learning framework to facilitate downstream tasks in dexterous manipulation.

Real Robot System

input image
Our system consists of a Leap Hand and a UR5 Arm, with a fixed Realsense L515 camera employed to capture visual observation. The Realsense D435 camera is only used for data collection during teleoperation, and is not involved in the policy learning.

Task1: Pick and Place

Task2: Flip Cup

Task3: Assembly

Task4: Articulated Manipulation

Effectiveness and Efficiency

input image
Our method demonstrates exceptional dexterous manipulation capabilities with an average success rateof 90% in four real-world tasks, surpassing other baselines by a large margin. CordViP also exhibits superior performance, achieving higher accuracy with fewer demonstrations.

Generalization to Different Lighting Conditions

Ours
DP
ACT+3D

Generalization to Different Scenarios

Ours
DP
ACT+3D

Generalization to Unseen Objects

Ours
DP
ACT+3D

Generalization to Different Viewpoints

Ours
DP
ACT+3D

BibTeX

@article{fu2025cordvip,
    title={CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World},
    author={Fu, Yankai and Feng, Qiuxuan and Chen, Ning and Zhou, Zichen and Liu, Mengzhen and Wu, Mingdong and Chen, Tianxing and Rong, Shanyu and Liu, Jiaming and Dong, Hao and others},
    journal={arXiv preprint arXiv:2502.08449},
    year={2025}
}          

If you have any questions, please contact us at yankaifu.aur@gmail.com.