top of page

Zero-Shot 3D Object Rearrangement with Vision-Language Models

Ivan Kapelyukh*, Yifei Ren*Ignacio Alzugaray, and Edward Johns

(*joint first authorship)

Published at ICRA 2024



We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.

Key Idea  Imagine new arrangements of a scene using an object-level NeRF, then evaluate each arrangement using a VLM to select a goal state which best matches the user instruction.

Teaser Video

Imagining 3D Goal States with 2D VLMs


Dream2Real enables a robot to imagine, and then evaluate, virtual rearrangements of scenes. First, the robot builds an object-centric NeRF of a scene. Then, numerous reconfigurations of the scene are rendered as 2D images. Finally, a VLM evaluates these according to the user instruction, and the best is then physically created using pick-and-place.

Framework Overview

This figure shows the Dream2Real framework in detail. The robot first autonomously builds a model of the scene. Then the user instruction is used to determine which object should be moved, and so the robot can imagine new configurations of the scene and score them using a VLM. Finally, the highest-scoring pose is used as the goal for pick-and-place to complete the rearrangement.

Qualitative Results

We evaluate our method on a range of real-world rearrangement tasks across several scenes. From left to right, we call these the "shopping", "pool ball", and "shelf" scenes.

In the shopping scene experiments, we find that our method can be controlled through natural language, and is robust to distractors in this multi-task scenario. The figure below shows results for the tasks "apple in bowl" (top row) and "apple beside bowl" (bottom row). In the heatmaps (overlaid on the TSDF of the scene), yellow indicates high-scoring positions of the apple, whereas dark blue indicates low-scoring regions, and colliding poses are not included. The red dot highlights the highest-scoring position. The highest-scoring render is shown on the right.

In the pool ball experiments, we find that Dream2Real can understand some complex multi-object relations. The figure below shows results for the tasks "in triangle" (top row) and "in X shape" (bottom). Here, the method must understand the geometric shape formed by the pool balls in the initial scene (first column) and place the missing black pool ball to complete it. The heatmaps show that CLIP does understand how to satisfy these geometric relations. This is surprising because CLIP has been shown to struggle with much simpler relations such as "left" vs "right". This may be because it has been shown to exhibit bag-of-words behaviour, and so struggles with captions where the word order matters. Nevertheless, we find it encouraging that more complex relations can be understood already, and hope that future VLMs will further improve reliability across spatial reasoning tasks.

Below we show 3D heatmaps for the shelf scene. Our method must perform 6-DoF rearrangement (in imagination) to pick up the bottle lying on the table and position it upright on the shelf. There are 3 tasks: making the bottles into a row (left column), placing the bottle in front of the book (middle column), and placing the bottle near the plant (right column). Our experiments show that Dream2Real can complete 6-DoF rearrangement. We also find that a single-view baseline performs less well, since the incomplete reconstruction affects both the quality of the renders and collision-free motion planning. This suggests that our multi-view approach is more capable, especially for 3D scenes.

Overall, we find that using 3D scene representations such as NeRFs provides a promising direction for bridging the gap between the 2D world of web-scale VLMs and the 6-DoF world of robotics tasks. This allows the Dream2Real framework to complete 6-DoF tasks but still be zero-shot due to the web-scale visual prior of VLMs, thus avoiding the need to collect a training dataset of example arrangements. For more details about our method and experiments, please see our paper and videos.

Robot Videos

These videos show our robot performing object rearrangement zero-shot, without needing any examples or further training. This is made possible using VLMs pre-trained on web-scale data, and a 3D scene representation which the robot constructs autonomously.

Put the apple inside the bowl

Move the black 8 ball so that

there are balls in an X shape

Move the strawberry milkshake bottle to make

three milkshake bottles standing upright in a neat row

Dream2Real vs DALL-E-Bot

DALL-E-Bot very often generates images with a different number of objects to the real world, and so (despite its filtering techniques) it matches the real object to a generated object in the wrong place. Dream2Real is evaluative, using a VLM to score sampled arrangements of the real objects, thus avoiding this difficult matching problem. DALL-E-Bot is also affected by distractors, whereas our method automatically hides them from the VLM.

DALL-E-Bot generated goal image for shopping scene and pool scene

(a) Put the apple inside the bowl

(b) Move the black 8 ball so that there are balls in an X shape

Supplementary Material

Here we share some additional low-level information about the experiment setup and implementation details.

bottom of page