TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

1School of Artificial Intelligence, University of Chinese Academy of Sciences
2D-Robotics
3State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences
*Equal contribution    Project Leader    Corresponding author
VO-DP

We present TabletopGen, a training-free, fully automatic unified framework that generates instance-level interactive 3D tabletop scenes. As shown on the left, TabletopGen can generate visually realistic, detail-rich, plausibly arranged, and collision-free 3D scenes from either text or a single image input. As shown on the right, our framework can produce a wide variety of tabletop scenes, spanning different shapes, styles, and functional categories.

Abstract

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI—especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity.

Method

Method Overview

Our framework accepts either text (which is first converted into a reference image) or a single image. Starting from the image, we proceed in four stages: (1) Instance Extraction performs category analysis, segmentation, and completion to obtain clean, high-resolution per-instance images. (2) Canonical Model Generation uses Image-to-3D and MLLM-based alignment to create a 3D model with canonical coordinate system for each instance. (3) Our core Pose and Scale Alignment stage recovers the spatial layout. The DRO (Differentiable Rotation Optimizer) estimates rotation by optimizing a tri-modal loss, while the TSA (Top-view Spatial Alignment) mechanism synthesizes a top-view image and, together with MLLM reasoning, selects an anchor instance via our RMA-Score to infer each instance’s translation and scale. (4) 3D Scene Assembly stage combines all instance models with their poses and scales in a simulator to produce the final collision-free, interactive 3D tabletop scene.

Interactive Scene Visualization

Click Object: Show Name Left Drag: Rotate Scroll: Zoom Right Drag (or Ctrl+Left): Pan

For smoother web visualization, we reduced the texture resolution of the models. Each scene may take about 10–20 seconds to load (depending on your network speed, possibly longer), so please be patient.

Input Image
Image
Loading...
Scene 1

Select Scene

Manipulation Task Demo

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}