We present DOME, a novel method for one-shot imitation learning, where a task can be learned from just a single demonstration and then be deployed immediately, without any further data collection or training. DOME does not require prior task or object knowledge, and can perform the task in novel object configurations and with distractors. At its core, DOME uses an image-conditioned object segmentation network followed by a learned visual servoing network, to move the robot's end-effector to the same relative pose to the object as during the demonstration, after which the task can be completed by replaying the demonstration's end-effector velocities. We show that DOME achieves near 100% success rate on 7 real-world everyday tasks, and we perform several studies to thoroughly understand each individual component of DOME.
Key Idea In our previous work, we introduced the Coarse-to-Fine Imitation Learning method, which enables tasks to be learned from a single demonstration. However, it also requires a period of self-supervised training following the demonstration. Therefore, in this new paper, we developed a method which allows a robot to perform a task immediately after the demonstration, without any further data collection or training.
Starting from an initial configuration, we first manually move the end-effector (EE) to what we call the bottleneck pose. This is the starting point of the demonstration, and conceptually represents the pose the EE needs to reach before beginning any interaction with the object. At the bottleneck we take an image from the wrist-mounted camera.
Then, we perform the demonstration, recoding end-effector velocities in the end-effector frame during the process.
We are now ready to deploy our controller.
From a different initial configuration, and with the possible addition of distractors, we deploy DOME immediately after the demonstration.
It first uses a learning-based visual servoing controller to align the image captured at the bottleneck during the demonstration phase and the live image. Doing so brings the end-effector back to the bottleneck pose.
Then, once the bottleneck pose is reached, the EE velocities recorded during the demonstration phase can simply be replayed to complete the task.
Networks and Training
Illustration of our learning-based visual servoing controller
Learning-Based Visual Servoing Controller
Our learning-based visual servoing controller is tasked to bring the end-effector back to the bottleneck pose during deployment.
It is composed of an image-conditioned object segmentation network which isolates the object of interest, and a learned visual servoing network that outputs a control command aligning the bottleneck and live images.
Once the bottleneck and live images are fully aligned, we are back at the bottleneck pose.
We train our controller entirely in simulation, without assuming any prior object or task knowledge. Our networks are simply trained on a very diverse set of objects, allowing us to achieve one-shot imitation learning immediately after the demonstration.
Training data examples
Results, Ablations and Future Work
We tested our method on 7 everyday tasks, as illustrated on the videos below. We are able to learn these tasks just from a single demonstration, and perform them from immediately after the demonstration, without any real-world training or data collection. We show that DOME achieves near 100% success rate on all tasks, even in the presence of distractor objects.
Our baselines consist of Coarse-To-Fine imitation learning (CTF), Residual Reinforcement Learning (RRL) and Behavioural Cloning (BC).
In our experiments we match or outperform all three baselines on all tasks, despite the fact that DOME is the only method that does not require any real world data collection or training aside from the single demonstration.
Our baselines during deployment
DOME does not require any real world training or data gathering aside from the single demonstration
Illustration of the variable speeds of execution experiment
We perform a series of ablation studies and analysis experiments to understand the behaviour and performance of the different components of our controller in isolation.
For our image-conditioned object segmentation network, we show that the exact network architecture seems to have a minor impact on performance compared to the size and quality of the simulated training data.
For our learned visual servoing network, we show its dependence on good a quality segmentation for inputs.
For our overall controller, we study how it responds to various speeds of execution as is illustrated in the video on the left. In this experiment we increase the gains applied resulting in higher visual servoing speeds, and observe that this does not seem to affect the overall performance of our controller.
We are very pleased with the performance achieved by DOME, which we believe takes us a step closer to the promise of practical one-shot imitation learning. As such, we are excited by the prospects of even further improving upon the method. As it stands, its main limitation is that we cannot change the object that is being manipulated between demonstration and deployment, an interesting challenge which is also the natural avenue for future work.