Abstract
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6%—on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source DRRM (D-Robotics Robotic Manipulation), a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training (e.g., bf16, fp16). It is compatible with visuomotor policies such as DP and DP3, and also supports the RoboTwin simulator. VO-DP is integrated into DRRM. We refer to the project page for the code and videos.
Method
Our method VO-DP has four core modules: 1) VGGT Encoder extracts semantic features from patchified images via DINOv2 and generates geometric features through its AA network; 2) Semantic-Geometric Fuser fuses per-frame geometric and semantic features using residual cross-attention and an FFN; 3) Spatial Compression module reshapes fused features, downsamples them with a lightweight ResNet, and concatenates the compressed spatial features with proprioceptive observations to form compact scenario representations; 4) Vision-Only Conditioned Action Generation module employs a DDPM-based policy head to generate actions using the scenario representations.
Effectiveness
VO-DP achieves state-of-the-art (SOTA) performance in both simulation and real-world environments, while also demonstrating high data efficiency and strong generalization. In simulation, it is rigorously evaluated on the challenging 14-task RoboTwin benchmark. For real-world tasks such as pick-and-place and stacking, VO-DP is trained with only 200 demonstrations per task, showcasing exceptional data efficiency. More importantly, the method exhibits remarkable generalization capabilities across significant variations, maintaining robust performance in rigorous tests for size, appearance, illumination, and background robustness, with success rates such as 65.0% on unseen object sizes and 83.3% under dynamic lighting interference.
Geometry-Aware Tokens
Visualization on Robotwin
VO-DP demonstrates its effective spatial representation in embodied scenarios through its geometry-aware tokens, as evaluated on the challenging RoboTwin benchmark. It successfully addresses 14 bimanual tasks by comprehending semantic intent and geometric structure from RGB image inputs. The policy achieves high success rates by satisfying precise pose constraints and maintaining collision-free trajectories across extensive test scenes.
BibTeX
@article{ni2025vodp,
title={VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation},
author={Zehao Ni and Yonghao He and Lingfeng Qian and Jilei Mao and Fa Fu and Wei Sui and Hu Su and Junran Peng and Zhipeng Wang and Bin He},
journal={arXiv preprint arXiv:2510.15530},
year={2025}
}