TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Wang, Ziqian; He, Yonghao; Yang, Licheng; Zou, Wei; Ma, Hongxuan; Liu, Liu; Sui, Wei; Guo, Yuxin; Su, Hu

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang^1,3,2*, Yonghao He^2*†, Licheng Yang^1,3, Wei Zou^1,3, Hongxuan Ma³, Liu Liu⁴, Wei Sui², Yuxin Guo^1,3, Hu Su^3✉

¹School of Artificial Intelligence, University of Chinese Academy of Sciences
²D-Robotics
³State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences ⁴Horizon Robotics

^*Equal contribution ^†Project Leader ^✉Corresponding author

arXiv Paper Code

We present TabletopGen, a training-free, fully automatic unified framework that generates instance-level interactive 3D tabletop scenes. As shown on the left, TabletopGen can generate visually realistic, detail-rich, plausibly arranged, and collision-free 3D scenes from either text or a single image input. As shown on the right, our framework can produce a wide variety of tabletop scenes, spanning different shapes, styles, and functional categories.

Abstract

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI—especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity.

Method

Our framework accepts either text (which is first converted into a reference image) or a single image. Starting from the image, we proceed in four stages: (1) Instance Extraction performs category analysis, segmentation, and completion to obtain clean, high-resolution per-instance images. (2) Canonical Model Generation uses Image-to-3D and MLLM-based alignment to create a 3D model with canonical coordinate system for each instance. (3) Our core Pose and Scale Alignment stage recovers the spatial layout. The DRO (Differentiable Rotation Optimizer) estimates rotation by optimizing a tri-modal loss, while the TSA (Top-view Spatial Alignment) mechanism synthesizes a top-view image and, together with MLLM reasoning, selects an anchor instance via our RMA-Score to infer each instance’s translation and scale. (4) 3D Scene Assembly stage combines all instance models with their poses and scales in a simulator to produce the final collision-free, interactive 3D tabletop scene.

Interactive Scene Visualization

Click Object: Show Name Left Drag: Rotate Scroll: Zoom Right Drag (or Ctrl+Left): Pan

For smoother web visualization, we reduced the texture resolution of the models. Each scene may take about 10–20 seconds to load (depending on your network speed, possibly longer), so please be patient.

Input Image

Loading...

Scene 1

Select Scene

Manipulation Task Demo

Community

Scan to connect on WeChat and join our community for updates and discussions with the authors.

BibTeX

@article{wang2025tabletopgen,
  title={TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image},
  author={Wang, Ziqian and He, Yonghao and Yang, Licheng and Zou, Wei and Ma, Hongxuan and Liu, Liu and Sui, Wei and Guo, Yuxin and Su, Hu},
  journal={arXiv preprint arXiv:2512.01204},
  year={2025}
}