Zero-Shot 3D Object Rearrangement with Vision-Language Models
Ivan Kapelyukh*, Yifei Ren*, Ignacio Alzugaray, and Edward Johns
(*joint first authorship)
We introduce Dream2Real, a robotics framework which integrates vision-language models trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 3D rearrangement tasks.
Key Idea The robot first builds a 3D representation of the objects in a scene. Then it imagines new arrangements of those objects, and evaluates them with a VLM to select the arrangement which best matches the user instruction. Finally, the robot uses pick-and-place to recreate the best imagined arrangement in the real world.
These videos show our robot performing object rearrangement zero-shot, without needing any examples or further training. This is made possible using VLMs pre-trained on web-scale data, and a 3D scene representation which the robot constructs autonomously.
Full Video (3 min)
This figure illustrates the key idea of our method, described above. We study how a robot can imagine (or dream) new configurations of scenes, and then evaluate them using a VLM to determine a suitable goal state, which it then achieves in the real world.
This figure shows how the components of our method fit together into a robotic rearrangement pipeline. For more details about each part of our method, please see our paper.