TeleOpBench:

A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

Image description

Abstract: Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines—ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces—there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities—(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking—and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation.

TeleOpBench:

A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

Image description

Abstract: Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines—ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces—there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities—(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking—and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation.


Overview Video

TeleOpBench

teaser

We present TeleOpBench, a simulation-based benchmark for bimanual dexterous teleoperation, and evaluate four representative teleoperation modalities across multiple robot platforms (row 1). Real-robot experiments (row 2) demonstrate four teleoperation capabilities. Our teleoperation pipelines support fine-precision manipulation in the real world—for example, the left hand grasps a block while the right hand simultaneously inserts a smaller block (row 3)—and can execute long-horizon sequences, such as retrieving a tomato-laden plate from a microwave with the right hand and transferring the tomatoes to a table with the left (rows 4 and 5).

This paper makes the following contributions:

  1. We introduce a dedicated benchmark, TeleOpBench, for dual-arm dexterous teleoperation, enabling rigorous, fair, and comprehensive comparisons across competing systems.
  2. We implement four representative teleoperation pipelines—motion-capture, VR controllers, upper-body exoskeletons, and vision-only within a single modular framework.
  3. Extensive experiments on both TeleOpBench and a real dual-arm platform reveal a strong correlation between simulated and physical performance, substantiating the benchmark's fidelity and practical value.

Unitree G1

Fourier GR1-T2

Fourier H1-2

Task Environments

We employ NVIDIA Isaac Sim as our simulation platform because its high-performance PhysX engine and photorealistic renderer enable the construction of environments that closely approximate real-world conditions. Each scene features a humanoid robot fitted with bimanual dexterous hands and the task-relevant objects; operators are instructed to execute the required manipulations exactly as specified. For every trial, we record both task success and completion time, which together constitute our primary performance metrics.

30 High-Fidelity Task Environments

Task Visualization

Real-World Robot Control with Four Teleoperation Modes

Vision

Vision Control 1

Vision Control 2

VisionPro

VisionPro Control 1

VisionPro Control 2

Exoskeleton

Exoskeleton Control 1

Exoskeleton Control 2

Xsens

Xsens Control 1

Xsens Control 2

Simulated Control with Four Teleoperation Modes

Sim - Vision

Sim Vision Control 1

Sim Vision Control 2

Sim - VisionPro

Sim VisionPro Control 1

Sim VisionPro Control 2

Sim - Exoskeleton

Sim Exoskeleton Control 1

Sim Exoskeleton Control 2

Sim - Xsens

Sim Xsens Control 1

Sim Xsens Control 2

Sim and Real Deployment

High-Precision Teleoperation Demonstration - Xsens

High-Precision Control with Xsens - Demo 1

High-Precision Control with Xsens - Demo 2

Simulated Xsens-Controlled Precision Task

Teleoperation Interface

We implement four representative teleoperation pipelines—monocular vision, MoCap, VR, and exoskeleton— under a unified, modular interface.

1.Vision-based

The system first calibrates once in a neutral T‑pose, using SMPL to derive body shape β and link-scale factors s that align human and robot kinematics. During operation, SMPLer-X streams the operator's upper-body pose, which is rescaled by s and solved with PINK IK for arm-and-wrist motion, while MediaPipe key-points refined by Dex-Retargeting drive precise finger control. Decoupling limb and hand estimation yields robust, real-time teleoperation from pure vision input.

2.MoCap-based

The inertial-MoCap pipeline uses an Xsens MVN suit (23 IMUs) plus Manus Metagloves. After a one-time calibration, the MVN stream provides the 6-DoF pose of 23 body segments, while each glove outputs 20 finger-joint DoFs. Raw limb data are first transformed from the MVN's global frame to the robot frame (pelvis origin, forward +X, vertical +Z).
A joint-specific, real-time rescaling module then compensates for human-robot link-length mismatches before Closed-Loop Inverse Kinematics (CLIK) solves the robot's arm-and-wrist poses. For the hands, the glove's MCP, PIP, DIP and ab/adduction angles are mapped directly—subject to the dexterous hand's joint limits—yielding accurate, low-latency replication of both limb and finger motions.

xsens pipeline

3.VR-based

The VR-based teleoperation system includes two main components:

1. Upper-body Limb Motion Control

For upper-body limb motion control, the Apple VisionPro is utilized for hand, wrist, and head tracking, adhering to the OpenXR coordinate system. Wrist and head poses are initially transformed into the robot's coordinate frame. The wrist offset relative to the head is then converted into an offset relative to the pelvis. Only the wrist translation data is fed to an IK algorithm based on Pink, which computes all degrees of freedom except for finger joints.

2. Hand Control

For hand control, to enhance manual dexterity across different teleoperators, the distal phalanx lengths of each operator's fingers are measured and scaled proportionally to match the corresponding robotic finger segments. Subsequently, vector-based optimizers are employed, following the OpenTelevision approach, to generate robot-hand joint commands within the dexterous-retargeting framework of AnyTeleop.

4.Exoskeleton-based

This exoskeleton-based teleoperation framework creates isomorphic systems customized to replicate a target humanoid's upper body kinematics, based on HOMIE principles. Servo-driven joints ensure real-time synchronization of operator and robot movements. Integrated motion-sensing gloves with Hall-effect sensors provide 15 DoF per-hand tracking. By directly mapping operator kinematics to the humanoid's joints, this method bypasses inverse kinematics (IK) approximations, thereby eliminating algorithmic errors and enhancing operational bandwidth and positional accuracy.

exo-g1

1. Isomorphic Exoskeleton for Unitree G1.

exo-gr1

2. Isomorphic Exoskeleton for Fourier GR-1.

exo-h1

3. Isomorphic Exoskeleton for Unitree H1-2.

Experiments

From top to bottom, we illustrate the four teleoperation modalities executing the following tasks: ball trashcan, pen brushpot, ball bimanual, and pot bimanual.

bench all

1. Simulation Results

We select ten representative tasks of varying difficulty from the TeleOpBench: (1) push_cube, (2) pich_cube, (3) pick_place_cube, (4) uprear_cup, (5) ball_trashcan, (6) ball_mug, (7) ball_bimanual, (8) pot_bimanual, (9) pot_tomato_plate, and (10) pen_brushpot. Full task descriptions refer to Table ref{tb:task}. A user study involving four participants was conducted; Task-level success rates and completion times are summarized quantitatively in Table

Task Visualization

2. Real-world Results

We reproduce the task suite on physical robots and evaluate all four teleoperation pipelines with the identical metric suite; the resulting quantitative scores are summarized in Table.

Task Visualization

Figure below presents completion-time curves for simulation and real world. Note that Tasks with one teleoperation success rate below 20% are excluded from the plotted curves to ensure the reliability of the curves. The two domains exhibit a strong positive correlation: the vision-tracking interface consistently requires the longest execution time (blue curve), the inertial- MoCap pipeline is the fastest (red curve), and the VR and exoskeleton interfaces cluster in between. This close alignment between simulated and real-world performance confirms that TeleOpBench reliably predicts practical outcomes and therefore offers substantial utility to the community.

Sim

Sim

Real

Real

Authors

1Shanghai Artificial Intelligence Laboratory, 2Zhejiang University, 3The Chinese University of Hong Kong,
4The Hong Kong University of Science and Technology (Guangzhou),
5The University of Hong Kong, 6Feeling AI
*Equal contribution †Corresponding author

@misc{li2025teleopbenchsimulatorcentricbenchmarkdualarm,
      title={TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation}, 
      author={Hangyu Li and Qin Zhao and Haoran Xu and Xinyu Jiang and Qingwei Ben and Feiyu Jia and Haoyu Zhao and Liang Xu and Jia Zeng and Hanqing Wang and Bo Dai and Junting Dong and Jiangmiao Pang},
      year={2025}
    }

If you have any questions, please contact Hangyu Li and Qin Zhao. 🎉