Robots that arrange household objects should do so according to the user's preferences, which are inherently subjective and difficult to model. We present NeatNet: a novel Variational Autoencoder architecture using Graph Neural Network layers, which can extract a low-dimensional latent preference vector from a user by observing how they arrange scenes. Given any set of objects, this vector can then be used to generate an arrangement which is tailored to that user's spatial preferences, with word embeddings used for generalisation to new objects. We develop a tidying simulator to gather rearrangement examples from 75 users, and demonstrate empirically that our method consistently produces neat and personalised arrangements across a variety of rearrangement scenarios.
Key Idea By observing tidied scenes, NeatNet can infer a user preference vector and use it to generate a personalised arrangement for any set of objects.
Rearrangement Goals Are Subjective
Many everyday tasks that we would like a household robot to perform can be expressed as a rearrangement problem: given a set of objects, arrange them into some goal state. Examples include: tidying your room, loading a dishwasher, unpacking new groceries into the fridge, or setting a dinner table.
But how can a robot determine what that goal state should be? Everyone arranges their space differently. Many of the factors involved are inherently subjective, as is their relative prioritisation. For example, is the person left or right-handed? How risk-averse are they, i.e. do they avoid leaving fragile glasses near the edge of the table, even if this makes them harder to reach? Do they want their favourite book tidied away neatly on a shelf, or placed on the table nearby for convenience? Since every user has unique eating habits, kitchen cupboards and fridges should also be organised in a personalised way.
It is clear that spatial preferences are complex, yet they influence many rearrangement tasks. Our method infers these preferences by observing how the user arranges their home environment.
Learning Spatial Preferences
Representing Users as Vectors
We wish to infer a user's tidying preferences, represented as a single latent vector. Users whose vectors are close together in this latent space share similar tidying preferences. This latent vector may not always be interpretable, but perhaps it can capture features such as whether the user is left or right-handed, or whether they prefer the objects on their desk to be arranged compactly rather than spaciously.
To learn by observing example scenes, we encode each object as a vector composed of two parts. The semantic embedding captures the object's identity, and the position encoding contains the object's coordinates within the scene.
The semantic embedding for each object is generated from its name. Note that objects whose names appear in similar linguistic contexts are often arranged in a related way, for example "salt" and "pepper" or "pen" and "pencil". We use a pre-trained word embedding model to get a word embedding vector for an object, and then train our own network layers end-to-end to extract the features from the word vector which are most relevant for tidying. The output is the object's semantic embeddings.
We want to learn two functions: an encoder, and a position predictor. The input to the encoder is the user's arrangement of the scene, from which the encoder infers a user preference vector. The position predictor outputs an arrangement of the scene based on those preferences. With the position predictor acting as a decoder, this network can be trained as a Variational Autoencoder, which we call NeatNet.
The architecture of the encoder is shown in more detail below. The input is an example scene arranged by the user, represented as a fully-connected graph of objects. Since each scene can have a variable number of objects, we use graph neural network layers to encode the scene and extract the user's preferences.
The position predictor, shown below, takes as input a graph of the objects for which we want to predict positions, along with the user preference vector. Graph neural network layers are used to predict a tidy position for each object.
To gather data for evaluating our method, we developed a simulator deployed as a web app. Each user arranges several scenes, placing each object one by one and submitting a tidy arrangement.
We designed four scenes for users to arrange. Two are abstract, where users have to line up objects. They can do so either by shape or by colour. Two are real scenes: a dining room and an office. In our experiments, we compare against 10 baseline methods by asking users to score each method's arrangement based on its tidiness. We will now discuss some visualisations of interesting results.
Tidying a Known Scene
In this experiment, the user has already provided an example of how they would like this scene to be arranged. However, an individual example often contains noise and imperfections in how the user arranges the scene. NeatNet infers the user's preferences from their example and combines this with prior knowledge from similar users, thereby correcting for noise and producing tidy reconstructions.
Generalising to New Objects
We test whether NeatNet can place a new object which it has never seen before during training. In this case, it must place the largest blue box, which is larger than all objects seen during training.
NeatNet correctly learns that the user lines up objects in order of size, and extrapolates this to place the new object. Furthermore, it places that blue box on the right-hand side because it infers that the user prefers to group objects by shape, rather than by colour.
While this scene is abstract, you can imagine that learning to stack dinner plates or books in order of size is an important spatial reasoning capability for a robot to have.
In the office scene, NeatNet is placing a laptop, which it has never seen before. If a robot placed the laptop in the same way as it placed a desktop computer, it would create an inconvenient arrangement. NeatNet knows that the laptop also shares semantic features with a keyboard, a monitor and a mouse, so it places the laptop in a more convenient way.
Arranging New Scenes
In this experiment, we test whether it is possible to observe how the user arranges one scene, infer their preferences, and use this to predict how they would arrange a new scene. To apply these preferences to the new scene, the network uses its learned knowledge about how "similar" training users arranged that new scene.
NeatNet generates tidy and personalised arrangements, showing that it can learn preferences which transfer across scenes.
The Future of Personalised Rearrangement
We found that taking preferences into account improved the quality of generated arrangements, even for simple scenes. In real-world home environments with hundreds of objects, there is a much greater diversity in the space of possible arrangements, and more room for subjectivity. Therefore, personalisation is likely to become even more important as robots are applied to these environments at some point in the future. At present, however, our follow-up work focuses on methods which learn common-sense intuitions for spatial preferences that are shared across users. This includes techniques for learning directly from vision, considering physical constraints, and balancing tidiness cost against the manipulation required to achieve the target arrangement.