Where To Start? Transferring Simple Skills to Complex Environments
Robot learning provides a number of ways to teach robots simple skills, such as grasping. However, these skills are usually trained in open, clutter-free environments, and therefore would likely cause undesirable collisions in more complex, cluttered environments. In this work, we introduce an affordance model based on a graph representation of an environment, which is optimised during deployment to find suitable robot configurations to start a skill from, such that the skill can be executed without any collisions. We demonstrate that our method can generalise a priori acquired skills to previously unseen cluttered and constrained environments, in simulation and in the real world, for both a grasping and a placing task.
Key Idea: Transfer simple manipulation skills to complex environments, by finding starting robot configurations from which those skills would be successful without colliding with the surrounding obstacles.
Simple Grasping and Placing Skills in Complex Environments
Simple manipulation skills such as grasping or placing are at the core of many robotic applications. Acquiring such skills and ensuring they don't result in undesired collisions in complex and cluttered environments can be extremely difficult and data inefficient. Therefore, they are usually obtained in open and clutter-free scenes. However, naively deploying manipulation skills acquired in such a way in cluttered environments can lead to undesired collisions and a drop in overall performance.
In this work, we argue that simple manipulation skills acquired in open environments can still perform well and not cause undesired collisions in complex scenes if they are started from suitable configurations.
We find these suitable starting configurations using an affordance model that predicts if a skill would succeed and not collide with the environment if started from a specific configuration. This approach allows us to retain the high performance of skills acquired in open scenes without altering them in any way.
As you can see in the figure above, our proposed method consists of three steps, with a learnt affordance model at the heart of it. First, the robot observes the scene using head and wrist-mounted cameras. Then, it finds a suitable configuration to start an a priori learnt skill (in this case, placing a small yellow bag into the white bowl) by optimising the learnt affordance model. From this configuration, the robot should be able to complete the task without colliding with obstacles. The starting configuration is then reached via kinematic planning. Finally, from this point, the placing skill can be executed, completing the task without any collisions with the surrounding obstacles.
Considered Manipulation Skills
We test our approach with two common robotic manipulation skills -- grasping and placing -- although our method is not limited to these particular skills and can be used with any manipulation skills! The goal of the grasping skills is to grasp novel objects stably, while the placing skill aims to place a small held object into various types of containers. We acquire these skills in simulated environments using Behaviour Cloning and point cloud observations.
Graph-Based Affordance Model
Our method relies on the affordance model, whose role is to predict if a particular skill would be successful and not collide with the surrounding obstacles starting from a specific configuration. We learn this affordance model using data collected by trying out a priori acquired skills in complex simulated environments. To make this task easier, we exploit the underlying structure of this problem and jointly represent the target (the object that is to be manipulated), obstacles and the robot as a heterogeneous graph. Here you can see how we create this graph representation.
Three Steps to Success – Grasping
Now let's see all three steps of our approach in action in a scenario when the robot needs to grasp a bottle in a cluttered environment. During the first step, an affordance model is used to find a suitable starting configuration for an already-acquired grasping skill. The second step involves kinematic planning and reaching found configuration without colliding with the environment. Finally, at the third and last step, the grasping skill is executed completing the task.
*For step 1, we are visualising intermediate steps of gradient-based optimisation.
Three Steps to Success – Placing
Here, the same procedure is followed to complete another of our considered manipulation tasks - placing a held object into different types of containers, in this case, a bowl.
*For step 1, we are visualising intermediate steps of gradient-based optimisation.
Evaluation in Simulation
We evaluate our method in six types of randomised simulated environments that you can see below, each with inherently different structures and increasing complexity. Using these randomised scenes, we test our method using two manipulation skills - grasping and placing. Additionally, we compare the performance of our method to single-step end-to-end skills trained directly in cluttered environments and alternative approaches to predicting suitable starting configurations for already-acquired skills.
Grasping Unseen Objects in Complex Simulated Environments
With our method, a grasping skill that was trained in a clutter-free tabletop environment can be successfully transferred to cluttered environments and grasp various objects without causing collisions.
While the performance of skills acquired directly in cluttered environments drops significantly with the increasing complexity of the environment, our method is capable of maintaining high performance without the need to alter already-acquired skills (as shown in the left plot). The same trend is also evident when we compare our method against alternative ways of predicting suitable starting configurations for already-acquired skills (the right plot).
Placing Objects into Unseen Containers in Complex Simulated Environments
Here we use our method with a previously described placing skill. This skill was also trained only in clutter-free tabletop environments, but it can complete the task even in cluttered scenes because our method positions the robot in a suitable way before the skill is executed.
Once again, our method is capable of dealing with scenes of different complexity, and the same trends seen for the grasping skill are evident for placing one in all of the evaluation environments.
Real World Deployment
Although our method was trained solely in simulation, we used realistic and noisy point cloud observation, allowing us to directly deploy it on the real robot. We did so for both of our considered skills (grasping and placing) in three types of environments with different levels of complexity and five unseen objects per skill that you can see below.
Simple Grasping Skill in Complex Real-World Environments
Simple Placing Skill in Complex Real-World Environments
Of course, our method is not perfect, and we did have to make some assumptions and compromises (all of them can be found in the main paper). Because of this, sometimes, using our method can lead to failure to complete the task or cause undesired collisions, as you can see below. However, we are excited about the potential of our approach and will work to improve it!
In this work, we proposed a method for the robotic manipulation of objects in cluttered and constrained scenes, where collision with the environment is undesired. We use an approach that combines a priori acquired skills with the use of an affordance model of our own design. Optimising this affordance model during deployment allows us to find suitable configurations from which starting a given robotic manipulation skill would lead to its successful completion without collision with the environment. To efficiently learn this affordance model, we introduced a novel, heterogenous graph-based representation that jointly captures information about the target object, scene obstacles, and the robot itself, in a structured way. We showed that our method outperforms various baselines, can generalise to unseen cluttered and constrained scenes and can transfer from simulation to reality.