Abstract
In this paper, we study imitation learning under the challenging setting when there is (1) only a single demonstration, (2) no further data collection, and (3) no prior task knowledge. We show how, with these constraints, imitation learning can be formulated as a combination of unseen object pose estimation and trajectory transfer. To explore this idea, we provide an in-depth study on how state-of-the-art unseen object pose estimators perform for one-shot imitation learning on ten real-world tasks, and we take a deep dive into the effects of calibration and pose estimation errors on task success rates.
Video
Key Idea We study how unseen object pose estimation can be used to transfer a trajectory from a single demonstration to a novel scene. Through a series of investigations, we analyse the key factors influencing this process and showcase the expected performance of state-of-the-art pose estimators on real-world robotics tasks.
Imitation Learning (IL) can be a convenient and intuitive approach to teach a robot how to perform a task. However, many of today's methods for learning vision-based policies require tens to hundreds of demonstrations per task.
Here we take a look at one-shot imitation learning, where we assume (1) only a single demonstration, (2) no further data collection following the demonstration, and (3) no prior task knowledge.
The best way to make use of the single available demonstration is to transfer the demonstrated trajectory to the novel scene that is observed at test time.
When manipulating an object, what is important is the relative pose of the end-effector with respect to the object. As a result transferring the trajectory becomes a problem of understanding how the object moved between the demonstration and the deployment scene.
Trajectory Transfer
But without any prior knowledge about the object, such as a 3D object model, the reasoning required by the robot distils down to an unseen object pose estimation problem: the robot must infer the relative pose between its current observation of the object and its observation during the demonstration to reason on how to adapt the demonstrated behaviour.
We explain how this is done in practice with the diagram below. Please use the arrows for more details.
We display the capabilities of such a framework on 10 real world robotics tasks, ranging from plug insertion to stacking bowls. Below you can find examples of demonstrations and deployments for each of the different tasks.
Investigations
By formulating one-shot imitation learning from the perspective of pose estimation, we are able to break down the problem into individual challenges.
Specifically, we investigate the following four main factors of this formulation when working with real-world scenes: the effects of (1) errors in extrinsic camera calibration and (2) errors in unseen object pose estimation on task success rates, (3) a benchmark of various different unseen object pose estimators on simulated and real-world tasks, and (4) the effects that changes in viewpoint between demonstration and deployment have on spatial generalisation.
Analysis of Task Tolerance to Errors
Through a set of empirically defined mappings we were able to evaluate how errors in camera calibration and pose estimation directly correlate to task success rates in the real world.
The plot on the right shows how the various error types and magnitudes affect the success rates of the considered ten real-world tasks, as well as displaying the mean success rate across tasks with the respective standard deviation.
These results show that for all tasks the success rate is more sensitive to errors in pose estimation rather than camera calibration. They also illustrate that rotation errors exhibit a more pronounced effect on task success rates than translation errors.
Experiments on Real-World Tasks
Aside from analysing the tasks' tolerances, it is crucial to understand the expected performance of various unseen object pose estimators when applied to real-world robotics.
To answer this question we compared 8 pose estimation baselines, including iterative closest point (ICP), correspondence estimation based methods (DINO, ASpanFormer, ASpanFormer FT, GMFlow), and direct estimation approaches (NOPE, Reggression, Classification), evaluating them for trajectory transfer. We additionally benchmark them against DOME, a state-of-the-art one-shot imitation learning method. Below you can find the results and we refer the reader to the paper for a thorough discussion.
Example deployments of all the methods on the Tea and Plug tasks can be found in the videos hereafter.
Regression
DINO
ICP
NOPE
ASpanFormer FT
DOME
Classification
ASpanFormer
GMFlow
10x
Sensitivity to Lighting Changes
We further evaluate the robustness to changes in lighting conditions of Regression, the best performing method.
To this end, we rerun the real-world experiment for this method while additionally randomising the position, luminosity and colour temperature of an external LED light source before each roll-out.
The results from this experiment indicate that trajectory transfer using regression displays a strong performance even when the lighting conditions are randomised between the demonstration and test scene, with an average decrease in performance of only 8%.
Analysis on Spatial Generalisation
Another insight that emerged from the real-world experiments is the impact of the relative object pose between the demonstration and deployment on the average performance of trajectory transfer.
When we aggregate the success rates across all baselines, tasks, and poses within each of ten quadrants defined on the workspace, we notice a decline in the success rate of trajectory transfer as the object pose deviates from the demonstration pose.
We visually summarise this analysis in the image below, where we show a mug for each of the quadrants and one labelled as DEMO representing the approximate object location during the demonstration. The opacity of the mugs located in the different quadrants is proportional to the average success rate for those quadrants, which is also displayed in white text.
The cause of this behaviour lies in camera perspective. Specifically, even when kept at a fixed orientation, simply changing the position of an object will result in changes to its visual appearance, as is shown in the image above. As a result, for optimal spatial generalisation, we recommend providing demonstrations at the centre of the task space, as this minimises the variations in the object appearance when the object’s pose deviates from the demonstration pose.