top of page

Dream2Real:
Zero-Shot 3D Object Rearrangement with Vision-Language Models

Ivan Kapelyukh*, Yifei Ren*Ignacio Alzugaray, and Edward Johns

(*joint first authorship)

In Submission

Abstract

We introduce Dream2Real, a robotics framework which integrates vision-language models trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 3D rearrangement tasks.

Key Idea  The robot first builds a 3D representation of the objects in a scene. Then it imagines new arrangements of those objects, and evaluates them with a VLM to select the arrangement which best matches the user instruction. Finally, the robot uses pick-and-place to recreate the best imagined arrangement in the real world.

Robot Videos

These videos show our robot performing object rearrangement zero-shot, without needing any examples or further training. This is made possible using VLMs pre-trained on web-scale data, and a 3D scene representation which the robot constructs autonomously.

Full Video (3 min)

Framework Overview

This figure illustrates the key idea of our method, described above. We study how a robot can imagine (or dream) new configurations of scenes, and then evaluate them using a VLM to determine a suitable goal state, which it then achieves in the real world.

Pipeline

This figure shows how the components of our method fit together into a robotic rearrangement pipeline. For more details about each part of our method, please see our paper.

Supplementary Material

Here we share some additional low-level details about the experiment setup and implementation, in case some readers are interested in knowing more about these details.

bottom of page