top of page

On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation

Norman Di Palo and Edward Johns

Published in Robotics and Automation Letters (RA-L) 2024



Imitation learning with visual observations is notoriously inefficient when addressed with end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, which informs the robot what it can do with an object. Second, an alignment phase, which informs the robot where to interact with the object. And third, a replay phase, which informs the robot how to interact with the object. Through a series of real-world experiments on everyday tasks, such as grasping, pouring, and inserting objects, we show that this decomposition brings unprecedented learning efficiency, and effective inter- and intra-class generalisation.

Key Idea: When facing a new object to interact with, the robot visually compares it to all the objects it has encountered during the operator's demonstrations, stored in a buffer. To replicate the most similar demonstration, it retrieves where the demonstration started, and the trajectory recorded to interact with the object. The robot aligns the end-effector with the object and then replays the trajectory. This explicit decomposition unlocks unprecedented learning efficiency.

Motivation: Robots need to be able to quickly learn from human demonstrations, and effectively transfer the acquired knowledge to new, unseen objects. While the common approach to imitation learning is to train end-to-end Behaviour Cloning (BC) policies, this process tend to be inefficient, and require tens or hundreds of demonstrations per object, per task. We propose a modular, three-phases approach, that decomposes object manipulation into a retrieval, an alignment and a replay phase. This enables to robot to effectively use an external memory buffer of demonstrations to decide how to interact with objects at test time. This leads to better efficiency than distilling the entire buffer of demonstrations into a single, monolithic policy.

A Taxonomy of Retrieval and Alignment in Robotics in the recent years: Several prior works have used retrieval or alignment into a coarse and a fine trajectory to improve the efficiency and performance of imitation learning. In the taxonomy on the right we illustrate some of the main algorithms from the recent literature.

The core investigation of our work is the following: what is the best combination of these building blocks? How do alignment and retrieval affect the performance of the imitation learning pipeline?

We empirically demonstrate how a combination of these techniques unlocks unprecedented efficiency and generality in imitation learning.

How do we use retrieval and alignment in the context of robot manipulation?  In a nutshell: object interaction is decomposed into an alignment phase guided by a goal-conditioned visual servoing policy. It takes as input a goal observation and the live, wrist-camera observation, and guides the robot to align the two. Once there, an interaction trajectory is replayed. To select what observation to use as goal, and what trajectory to replay, the robot retrieves from its buffer of demonstrations the most visually similar object.

To better understand how these phases are executed, we need to understand how we collect demonstrations, and how the robot collects the data to fill the buffer. As we will explain shortly, a fundamental advantage of the decomposition into alignment and replay is the ability to learn a new task with a single demonstration.

Recording a demonstration: The way we provide demonstrations to the robot is simple and time efficient: a single demo is needed per object and task (e.g. grasp a mug). First, the user moves the end-effector to the bottleneck pose, an arbitrarily selected pose from which the demonstration should start. The only requirement is that the object must be visible to the wrist-camera from the bottleneck pose.

Once in the bottleneck pose, before the demonstration trajectory is recorded, the robot observes the object from different poses and gathers a dataset of observations and poses, that will be used to train an alignment policy to guide the end-effector back to the bottleneck pose when the object is moved at test time.


After the data is collected, the robot moves back to the bottleneck pose, and the human demonstrates how to interact with the object. The demonstration trajectory is recorded in the memory buffer as a series of 6D velocities, both linear and angular.

This process is repeated for each object in the training set.

How does the robot decide what to do at test time? After receiving all the demos, the robot trains a goal-conditioned visual alignment policy, as described in the paper. Once the robot is deployed and faces a new object, it gathers an observation from its wrist-camera and uses it to query its buffer of observations and demonstrations. Once it finds the most similar observation, it retrieves the corresponding bottleneck observation (the observation recorded from the bottleneck pose), that tells the robot where to move its end-effector, and the corresponding trajectory recorded by the operator. 

It conditions the goal-conditioned alignment policy with the retrieved goal observation and the live wrist observation, until the live observation is aligned with the goal observation,

Once the alignment is complete, the robot replays the retrieved trajectory to interact with the object.

Here we show some example test time trials with unseen objects.

Test-time behaviour. In the figure below, we show an overall illustration of the test time pipeline, composed of the retrieval, alignment, and replay phases. In the videos below, you can see the robot autonomously performing a set of tasks via the retrieval, alignment and replay pipeline, such as grasping, pouring, inserting, and more. All the objects show here are unseen, test-time objects. Videos sped up 4x.

In the videos below, you can see the robot autonomously performing a set of tasks via the retrieval, alignment and replay pipeline, such as grasping, pouring, inserting, and more. All the objects show here are unseen, test-time objects. Videos sped up 4x.

How does our method compare against the recent literature? Is combining retrieval and alignment the optimal choice?

We test each method on 4 tasks using a total of 35 objects: 10 training objects on which the robot received a single demo per object, and 25 unseen objects, that we divide into intra-class and inter-class objects, i.e. objects belonging to novel classes with respect to the ones observed during the demos. We record 10 test trajectories per object. 

Our results clearly indicate that our combination of retrieval and alignment strongly surpasses all the baselines. While alignment allows to learn tasks very efficiently,
retrieval additionally helps generalisation to unseen objects by selecting the best object from the demonstrations buffer, and use that demonstration to guide the alignment via the goal observation and select the interaction trajectory to replay.


Camera field-of-view. In this video below, we demonstrate that, even using a wrist-camera, the field-of-view is sufficient to interact with objects placed on any part of the table. Video sped up 4x.

bottom of page