top of page

Coarse-to-Fine Imitation Learning:
Robot Manipulation from a Single Demonstration

Edward Johns

Published at ICRA 2021

[Link to Paper]

[BibTex]     [Code]

5-minute Summary


We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration, without requiring any prior knowledge of the object being interacted with. Our method models imitation learning as a state estimation problem, with the state defined as the end-effector's pose at the point where object interaction begins, as observed from the demonstration. By then modelling a manipulation task as a coarse, approach trajectory followed by a fine, interaction trajectory, this state estimator can be trained in a self-supervised manner, by automatically moving the end-effector's camera around the object. At test time, the end-effector moves to the estimated state through a linear path, at which point the original demonstration's end-effector velocities are simply replayed. This enables convenient acquisition of a complex interaction trajectory, without actually needing to explicitly learn a policy. Real-world experiments on 8 everyday tasks show that our method can learn a diverse range of skills from a single human demonstration, whilst also yielding a stable and interpretable controller.


6-Minute Summary

Key Idea  Training a policy end-to-end requires too many demonstrations (behavioural cloning), or too much manual environment resetting (reinforcement learning). Instead, move a wrist-mounted camera around the object and train a pose estimator with self-supervised learning, which predicts the end-effector's pose at the start of the demonstration. During testing, the robot moves to this pose in a straight line (coarse), and then simply replays the demonstration's end-effector velocities (fine). This requires just one demonstration, no reinforcement learning, and no prior object knowledge.

Here's our robot learning novel tasks from a single demonstration:


(real time)

Hammering in a nail



(2 X speed)


Scooping up a bag


Opening a lid


The problem with imitation learning today.

Imitation learning is a convenient way to teach robots new skills, by providing a human demonstration. Not only does it give us a natural means to communicate the objective of a task (e.g. the final state of the environment after the demonstration), but it also provides hints as to how to perform that task (e.g. the actions taken during the demonstration). And ideally, imitation learning methods should allow anybody to be able to teach a robot a new skill, without that person needing to know about the underlying algorithm. A future domestic robot, for example, should be able to learn from its owner, rather than only learning from engineers in the factory, or scientists in the lab.

However, most imitation learning methods today are not well suited to robots learning in unstructured, everyday environments, with demonstrations from everyday people. We often see methods that require a significant amount of manual supervision during training, such as a large number of demonstrations, or repeated environment resetting (see to the right). Not only is this very demanding of the human, but it also requires the human to understand the algorithm well enough to provide appropriate assistance. And whilst there are existing methods that claim to learn from just a single demonstration, they usually require substantial prior training on tasks which are similar to the task being learned, and so often the new task is not really that "new".


As such, imitation learning has typically not been very practical so far. And one of the challenges is that most methods today rely on end-to-end policy learning which, although an intriguing concept and tenaciously trendy, is very data inefficient, and cannot generalise well to new tasks outside of the training dataset.


Reinforcement learning is a popular method for robot learning, and can be combined with imitation learning. However, when applied to real-world, everyday tasks, it usually requires the environment to be repeatedly reset. Simulation environments (e.g. Gym) often require thousands of episodes of simulated resetting; but that's very tiring for real humans!

A different, simpler approach.

Therefore, the goal of this project was to develop a method which addresses two important criteria for imitation learning: a method which (1) can learn genuinely novel tasks, and (2) can do this just from a single human demonstration. The idea was to eliminate the need for explicit, end-to-end policy learning, and introduce some structure to the problem such that machine learning is only used for the parts which really need machine learning. The resulting framework we developed is very simple, and yet performs surprisingly well on the two above criteria.


The crux of the method that was developed, is as follows: if, at test time, the relative pose between the robot's end-effector and the object is the same as during the demonstration, then the demonstration's end-effector velocities can simply be replayed in an open loop, without actually having to explicitly learn a policy for object interaction. This is behavioural cloning in its purest form. But how can we align the robot at this pose relative to the object, when the object is novel and thus does not come with a pre-trained pose estimator?


We addressed this by introducing the concept of the object's bottleneck. This is defined as the pose of the end-effector at the point where object interaction should begin, as observed from the demonstration. For example, in the image on the left, the bottleneck is the end-effector's pose just before the robot begins to open the lid of the box. Then, instead of estimating the pose of the object, the robot estimates the pose of the bottleneck: a "virtual" frame, representing where the robot should be in the future.


The robot can then move in a straight line towards the bottleneck (the blue arrow). This is a coarse trajectory, since the specific motion is not important. Once at the bottleneck, the robot then replays the original demonstration's end-effector velocities (the pink arrow). This is a fine trajectory, since the specific motion is important. Together, these two stages form our framework: Coarse-to-Fine Imitation Learning.

Training the bottleneck pose estimator is self-supervised, and involves automatically moving a wrist-mounted camera around the object, to build up a dataset of images and end-effector poses. This is shown on the right. A simple neural network can then be trained with regression to predict the bottleneck pose. In this way, the only exploration that the robot does is in the free space above the object. There is no exploration during object interaction, and therefore, no need for us to be repeatedly resetting the environment.



We tested this method on 8 everyday tasks, as shown in the video below. Each task is entirely novel, and each was provided with only a single demonstration. For 3 of the tasks (placing a top on a bottle, lifting up the lid of a box, scooping up a bag), the robot succeeded 100% of the time over 20 trials. And across all tasks, the success rate was 70%. Failures tended to occur either where the task required very high precision (e.g. inserting a knife into a thin slot), or where the object's colour and shape were not well suited to accurate pose estimation with simple regression (e.g. objects with uniform colour).

But overall, we were surprised at how effective this simple imitation learning method can be. Furthermore, the controller is analytical, stable, and interpretable, which typically cannot be said of the dominant imitation learning methods today, which often rely on black-box end-to-end policy learning.

The videos below shows examples of our robot performing these tasks during testing (2 x speed).


What's next?

Whilst we were happy with these results, there's still a long way to go for 100% success rate on all tasks. But we can isolate the two areas that need improving. First, is the pose estimation. Whilst training a neural network to predict a pose is quick and easy, there are more sophisticated approaches which will likely yield better results, such as using 3D computer vision. Second, is the object interaction. The crux of our method relies on the idea that replaying the demonstration velocities from the bottleneck is sufficient. However, this assumes that alignment with the bottleneck is perfect, whereas in practice, it never will be. Therefore, introducing closed-loop control during object interaction is an important future step. But how this can be done with just a single demonstration for genuinely novel tasks, is an open question.

To learn more, please read the paper, and watch the video.

bottom of page