VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

Ni, Zehao; Yonghao, He; Lingfeng, Qian; Jilei, Mao; Fa, Fu; Wei, Sui; Hu, Su; Junran, Peng; Zhipeng, Wang; Bin, He

VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

Zehao Ni^1,2,5,6*, Yonghao He^1*†, Lingfeng Qian^1*, Jilei Mao¹, Fa Fu¹, Wei Sui¹, Hu Su⁴, Junran Peng^3✉, Zhipeng Wang^2,5,6, Bin He^2,5,6✉

¹D-Robotics
²National Key Laboratory of Autonomous Intelligent Unmanned Systems
³University of Science and Technology Beijing
⁴State Key Laboratory of Multimodal Artificial Intelligence System (MAIS) Institute of Automation of Chinese Academy of Sciences
⁵Frontiers Science Center for Intelligent Autonomous Systems
⁶Shanghai Institute of Intelligent Science and Technology, Tongji University
^*Equal Contribution ^*Project Leader ^✉Corresponding authors

arXiv PDF Dataset Code

VO-DP is a vision-only method for visuomotor robotic manipulation: it takes single-view RGB images as input, uses large vision models to extract semantic and geometric features from observations, and provides high-quality conditional inputs for the policy head. Experiments show it matches point cloud-based DP3’s accuracy in simulation, outperforms it significantly in real-world tasks, and notably boosts vision-only method accuracy.

Abstract

In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6%—on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source DRRM (D-Robotics Robotic Manipulation), a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training (e.g., bf16, fp16). It is compatible with visuomotor policies such as DP and DP3, and also supports the RoboTwin simulator. VO-DP is integrated into DRRM. We refer to the project page for the code and videos.

Method

Our method VO-DP has four core modules: 1) VGGT Encoder extracts semantic features from patchified images via DINOv2 and generates geometric features through its AA network; 2) Semantic-Geometric Fuser fuses per-frame geometric and semantic features using residual cross-attention and an FFN; 3) Spatial Compression module reshapes fused features, downsamples them with a lightweight ResNet, and concatenates the compressed spatial features with proprioceptive observations to form compact scenario representations; 4) Vision-Only Conditioned Action Generation module employs a DDPM-based policy head to generate actions using the scenario representations.

Effectiveness

VO-DP achieves state-of-the-art (SOTA) performance in both simulation and real-world environments, while also demonstrating high data efficiency and strong generalization. In simulation, it is rigorously evaluated on the challenging 14-task RoboTwin benchmark. For real-world tasks such as pick-and-place and stacking, VO-DP is trained with only 200 demonstrations per task, showcasing exceptional data efficiency. More importantly, the method exhibits remarkable generalization capabilities across significant variations, maintaining robust performance in rigorous tests for size, appearance, illumination, and background robustness, with success rates such as 65.0% on unseen object sizes and 83.3% under dynamic lighting interference.

Geometry-Aware Tokens
Visualization on Robotwin

VO-DP demonstrates its effective spatial representation in embodied scenarios through its geometry-aware tokens, as evaluated on the challenging RoboTwin benchmark. It successfully addresses 14 bimanual tasks by comprehending semantic intent and geometric structure from RGB image inputs. The policy achieves high success rates by satisfying precise pose constraints and maintaining collision-free trajectories across extensive test scenes.

BibTeX

@article{ni2025vodp,
          title={VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation},
          author={Zehao Ni and Yonghao He and Lingfeng Qian and Jilei Mao and Fa Fu and Wei Sui and Hu Su and Junran Peng and Zhipeng Wang and Bin He},
          journal={arXiv preprint arXiv:2510.15530},
          year={2025}
        }

More Works from Our Lab

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Efficient Task-Specific Conditional Diffusion Policies: Shortcut Model Acceleration and SO(3) Optimization

VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

Abstract

Method

Effectiveness

Geometry-Aware TokensVisualization on Robotwin

BibTeX

Geometry-Aware Tokens
Visualization on Robotwin