DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics
Ivan Kapelyukh*, Vitalis Vosylius*, and Edward Johns
(* joint first authorship)
Accepted at the NeurIPS 2022 Robot Learning Workshop and the CoRL 2022 Pre-training for Robot Learning Workshop
We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that image. The significance is that we achieve this zero-shot using DALL-E, without needing any further data collection or training. Encouraging real-world results with human studies show that this is a promising direction for the future of web-scale robot learning. We also propose a list of recommendations to the text-to-image community, to align further developments of these models with applications to robotics.
Key Idea The robot prompts DALL-E with a list of the objects it detects. The generated goal image contains a human-like object arrangement, which the robot then recreates.
A robot using DALL-E-Bot to arrange objects on a dining table
Here, DALL-E-Bot invokes the prompt "a fork, a knife, a plate, and a spoon, top-down"
Diffusion models have astonished researchers and the public alike through their ability to create high-quality images from a text description, from portraits of human faces to surreal fantasy landscapes. This has been made possible by training on hundreds of millions of captioned images. In doing so, these models learn subtle patterns in how natural scenes are structured, which is necessary to generate convincing new images, such as this image on the right depicting the interior of an apartment. This vast wealth of visual knowledge is compressed into only a few gigabytes of model parameters.
Therefore, in this work we explore whether web-scale diffusion models like DALL-E can be exploited as a source of visual common sense for robots. We use this model as a robotic “imagination engine”, to create an image of a goal state which the robot will try to achieve. Since DALL-E has seen millions of everyday scenes arranged by humans, it knows how to arrange objects in a human-like way. In our method, called DALL-E-Bot, the robot prompts DALL-E with the list of objects it needs to arrange. This generates a realistic image depicting a human-like arrangement, and the robot then recreates that arrangement in the real world.
Using these web-scale models brings several benefits to DALL-E-Bot. First, this method uses DALL-E zero-shot on the rearrangement task, without requiring any further data collection or training. Second, this is an open-set method: it is not restricted to a specific set of objects, because of the web-scale training of DALL-E. Third, it is autonomous: the robot does not require any human supervision, not even to specify the goal state.
We address the problem of predicting the goal state of a rearrangement task, i.e. a goal pose for each object, such that the objects are arranged in a natural and human-like way. DALL-E-Bot creates a human-like arrangement of objects in the scene using a modular approach, shown in this diagram.
First, the initial observation image is converted into a per-object description consisting of a segmentation mask, an object caption, and a CLIP visual feature vector. Next, a text prompt is constructed describing the objects in the scene and is passed into DALL-E to create a goal image for the rearrangement task, where the objects are arranged in a human-like way.
Then, the objects in the initial and generated images are matched using their CLIP visual features, and their poses are estimated by aligning their segmentation masks with ICP. To select the best ICP solution, we compare the semantic feature maps of the aligned objects. Finally, a robot rearranges the scene based on the estimated poses to create the generated arrangement.
Here you can see DALL-E-Bot in action in a Dining Scene with a knife, a fork, a spoon, and a plate on a tabletop. It manages to arrange objects in a semantically meaningful way by correctly setting the table for dinner. It can do so without being explicitly trained for this scene because the web-scale image diffusion model has seen many examples of other dinner tables set by humans and understands the semantic structure of the scene. In each video, the image in the top-right is the image generated by DALL-E, and the image in the bottom-right shows the final arrangement created by the robot.
Now DALL-E-Bot is in the office, creating arrangements on desks such that they are convenient and usable when sitting down to work. In this scene, the robot is not allowed to move the iPad display but arranges the remaining objects (a keyboard, a mouse and a mug), taking into account that the iPad is already there. Web-scale training data includes many examples of arrangements on desks that humans actually use, and DALL-E-Bot can handle this scene by creating usable and human-preferred arrangements.
Finally, DALL-E-Bot also understands that fruits lying around on the table should go into an empty basket and tidies up the table by putting an orange and two apples in it.
Generated Scenes: Qualitative Results
Our method relies heavily on the image diffusion model's ability to generate human-like arrangements that a real robot can create. However, these models have shown a tremendous ability to do just that! Here, you can see some examples of DALL-E generated images of our considered scenes, all with semantically correct and human-like arrangements. Moreover, the text-to-image community is advancing the field at a remarkable pace, and our modular approach is therefore likely to improve and scale with all of this future progress.
Importance of Filtering Generated Images
Although web-scale diffusion models are great and can generate some impressive high-quality images, we still should not blindly rely on them without checking. Sometimes these models generate unusual or even absurd images. Our method handles these cases by having an additional sample-and-filter module. One of the evaluated variants of our method did not use this module and ended up creating some bizarre arrangements, as shown here.
An Auto-Regressive Variant
Our main method predicts where to put all the objects at once. Alternatively, we can create the scene in an auto-regressive manner by placing one object at a time and generating objects around already placed ones. This process is shown in these diagrams. The white masks indicate the pixels which DALL-E can change, in order to place the remaining objects. The auto-regressive approach can better ground the image diffusion model and help generate objects of appropriate size. However, we find in our experiments that errors from various sources can accumulate and lead to undesirable arrangements. Our main method avoids this problem by jointly predicting all object poses at once.
End-to-End User Evaluation
To validate our claim that our method can create natural arrangements that humans prefer, we used the most direct metric available: human feedback, which we obtained via a user study. We showed the users images of arrangements created by the robot and asked them: "If the robot made this arrangement for you at home, how happy would you be?". Our method was rated the highest for all the scenes among the considered baselines, and study participants were happy with the created arrangements! DALL-E-Bot beats the geometric heuristics, showing that users care about semantic correctness beyond just neat geometric alignment. These subtle semantic rules are covered by DALL-E's web-scale training data. This figure shows how our method compares against baselines. For each method that uses DALL-E, the generated image is shown on the left of the column, and the final arrangement on the right.
Placing Missing Objects with Inpainting
In this next experiment, we ask: can DALL-E-Bot precisely complete an arrangement which was partially made by a human? We ask users to create example arrangements. We mask out the pose of one object and ask DALL-E to inpaint it somewhere in the image (i.e. DALL-E is allowed to change the pixels which do not belong to the fixed objects, in order to place the missing object). The masked image given as input to DALL-E and the inpainting result are shown below. DALL-E-Bot predicts a placement for the fork which is semantically correct and also consistent with the rest of the arrangement. This can be seen as collaborative human-robot manipulation. Quantitative comparisons with baselines show that this is difficult to achieve with heuristics alone, motivating our use of a pre-trained inpainting model.
Visualising Feature Maps of Real and Generated Images
One challenge which is addressed by our work is aligning images of diffusion-generated objects with their real counterparts, for pose estimation. To do this, we first align object masks using ICP, and then select the alignment which leads to the best "match" between the ImageNet semantic feature maps of the two object images.
To further understand why this approach works, we show qualitative results below. This shows that if a generated and real object have a correctly aligned orientation, then the semantic feature maps will be very similar, even if the objects are visually different instances. We use this property to cross the domain gap between real and diffusion-generated images.
We show for the first time that web-scale diffusion models like DALL-E can be used as “imagination engines” for robots, acting as an aesthetic prior for arranging scenes in a human-like way. This allows for zero-shot, open-set, and autonomous rearrangement. In other words, our DALL-E-Bot system gives web-scale diffusion models an embodiment to actualise the scenes that they imagine. Studies with human users showed that they are happy with the results of everyday rearrangement tasks. We believe that this is an exciting direction for the future of robot learning, as diffusion models continue to impress and inspire complementary research communities.
Click here to read the full paper, which includes a more detailed description of the method, the full set of quantitative results, and a list of recommendations to the text-to-image community to further align these models with applications to robotics.