
One-Shot Imitation Learning: A Pose Estimation Perspective
Pietro Vitiello*, Kamil Dreczkowski* and Edward Johns
* joint first authorship
Published at CoRL 2023
[BibTex] (Ed will create this link)
Abstract
In this paper, we study imitation learning under the challenging setting when there is (1) only a single demonstration, (2) no further data collection, and (3) no prior task knowledge. We show how, with these constraints, imitation learning can be formulated as a combination of unseen object pose estimation and trajectory transfer. To explore this idea, we provide an in-depth study on how state-of-the-art unseen object pose estimators perform for one-shot imitation learning on ten real-world tasks, and we take a deep dive into the effects of calibration and pose estimation errors on task success rates.
Video (Ed will create this link)
Key Idea We stufy how unseen object pose estimation can be used to transfer a trajectory from a single demonstration to a novel scene. Through a series of investigations, we analyse the key factors influencing this process and showcase the expected performance of state-of-the-art pose estimators on real-world robotics tasks.
Project Summary
Imitation Learning (IL) can be a convenient and intuitive approach to teach a robot how to perform a task. However, many of today's methods for learning vision-based policies require tens to hundreds of demonstrations per task.
Here we take a look at one-shot imitation learning, where we assume (1) only a single demonstration, (2) no further data collection following the demonstration, and (3) no prior task knowledge.
The best way to leverage the single available demonstration is to transfer the demonstrated trajectory to any novel target scene.
When manipulating an object, what is important is the relative pose of the end-effector with respect to the object. As a result transferring the trajectory becomes a problem of understanding how the object moved between the demonstration and the novel scene.
Trajectory Transfer
But without any prior knowledge about the object, such as a 3D object model, the reasoning required by the robot distils down to an unseen object pose estimation problem: the robot must infer the relative pose between its current observation of the object and its observation during the demonstration to reason on how to adapt the demonstrated behaviour.
We display the capabilities of such a framework on 10 real world robotics tasks. Ranging from plug insertion to stacking bowls. Below you can find some examples of demonstrations and deployments for each of the different tasks.

By formulating one-shot imitation learning from the perspective of pose estimation, we are able to break down the problem into individual challenges.
Specifically, we investigate the following four main factors of this formulation when working with real-world scenes: the effects of (1) errors in extrinsic camera calibration and (2) errors in unseen object pose estimation on task success rates, (3) a benchmark of various different unseen object pose estimators on simulated and real-world tasks, and (4) the effects that changes in viewpoint between demonstration and deployment have on spatial generalisation.
Through a set of empirically defined mappings we were able to evaluate how errors in camera calibration and pose estimation directly correlate to task success rates in the real world.
The plot on the right shows how the various error types and magnitudes affect the success rates of the considered ten real-world tasks. As well as displaying the mean success rate across tasks with the respective standard deviation.
These results show that for all tasks the success rate is more sensitive to errors in pose estimation rather than camera calibration. They also illustrate that rotation errors exhibit a more pronounced effect on task success rates than translation errors.

After having discussed the tolerance of real world tasks, another

Regression

Regression

Regression

Regression
By formulating one-shot imitation learning from the perspective of trajectory alignment via pose estimation, we are able to break down the problem into individual challenges. Specifically, we investigate the following four main factors of this formulation when working with real-world scenes: the effects of (1) errors in extrinsic camera calibration and (2) errors in unseen object pose estimation on task success rates, (3) a benchmark of various different unseen object pose estimators on simulated and real-world tasks, and (4) the effects that changes in viewpoint between demonstration and deployment have on spatial generalisation.



Bowls
Deployment
Key Idea In the case of One-Shot Imitation Learning the best way of leveraging the single demonstration is to learn how to transfer the demonstrated trajectory to novel scenarios at test time. We show that unseen object pose estimation can be used for trajectory transfer. Through a series of investigations, we analyse the key factors influencing this process and showcase the expected performance of state-of-the-art pose estimators on real-world robotics tasks.
Through these four investigations, we reveal new insights into what factors make one-shot imitation learning so hard. But whilst our focus is a deep dive into these challenges, our investigation ultimately leads us to a new imitation learning framework with very encouraging performance. With real-world experiments on 10 every day tasks, such as inserting a plug into a socket, scooping a toy egg, and putting a plate into a dishwasher, we show how modelling one-shot imitation learning from the perspective of unseen object pose estimation achieves 84% success rate on these everyday tasks. Below you can see illustrations of all the demonstrations and example rollouts from the framework.
With only a single demonstration and no prior knowledge about the object the robot is interacting with, the optimal imitation is one where the robot and object are aligned in the same way as during the demonstration. But without any prior knowledge about the object, such as a 3D object model, the reasoning required by the robot distils down to an unseen object pose estimation problem: the robot must infer the relative pose between its current observation of the object and its observation during the demonstration to reason how to adapt the demonstrated behaviour.

With only a single demonstration and no prior knowledge about the object the robot is interacting with, the optimal imitation is one where the robot and object are aligned in the same way as during the demonstration. But without any prior knowledge about the object, such as a 3D object model, the reasoning required by the robot distils down to an unseen object pose estimation problem: the robot must infer the relative pose between its current observation of the object and its observation during the demonstration to reason how to adapt the demonstrated behaviour.
By formulating one-shot imitation learning from the perspective of trajectory alignment via pose estimation, we are able to break down the problem into individual challenges. Specifically, we investigate the following four main factors of this formulation when working with real-world scenes: the effects of (1) errors in extrinsic camera calibration and (2) errors in unseen object pose estimation on task success rates, (3) a benchmark of various different unseen object pose estimators on simulated and real-world tasks, and (4) the effects that changes in viewpoint between demonstration and deployment have on spatial generalisation.