Abstract
Data collection in imitation learning often requires significant, laborious human supervision, such as numerous demonstrations and/or frequent environment resets for methods that incorporate reinforcement learning. In this work, we propose an alternative approach, MILES: a fully autonomous, self-supervised data collection paradigm and show that this enables efficient policy learning from just a single demonstration and a single environment reset. Our method, MILES, autonomously learns a policy for returning to and then following the single demonstration, whilst being self-guided during data collection, eliminating the need for additional human interventions. We evaluate MILES across several real-world tasks, including tasks that require precise contact-rich manipulation, and find that, under the constraints of a single demonstration and no repeated environment resetting, MILES significantly outperforms state-of-the-art alternatives like reinforcement learning and inverse reinforcement learning.
Key Idea By leveraging a single demonstration and no prior knowledge, MILES autonomously collects trajectories in a self-supervised manner that demonstrate to the robot how to return to and then follow the single demonstration. By training a behavioral cloning policy on that data, MILES learns a range of everyday tasks that require precision and dexterity ranging from using a key to a lock to opening a lid.
Is Imitation Learning easy and efficient for humans to use?
Imitation learning is frequently described as a convenient way to teach robots new skills. But is this true in practice? Behavioral cloning (BC) methods leverage supervised learning to train robust policies, but doing so typically requires tens or hundreds of demonstrations per task to collect a sufficiently diverse training dataset. Inverse reinforcement learning (IRL) methods offer a solution to this as policies can be learned autonomously through random exploration, and reward functions can be inferred from a few demonstrations. However, unlike supervised learning methods policy learning with IRL can be unstable, and random exploration makes data collection inefficient. Besides, IRL typically requires repeated environment resetting, and so in practice, it is often equally or even more laborious than simply providing numerous demonstrations. Instead, learning from a single demonstration appears to be the most convenient form of imitation learning due to its effortlessness, but policies learned this way suffer from covariate shift. As such, imitation learning today is not as easy as we would like it to be: significant human supervision is still required for data collection either via demonstrations, environment resetting, or both.
This motivates our work on MILES that makes imitation learning easy and effortless for humans.
🤖 MILES Overview
Given a single demonstration, MILES collects data automatically, in a self-supervised manner, that demonstrate to the robot how to return to and then follow the single demonstration. MILES makes imitation learning easy and effortless, as it requires no human effort apart from just that single demonstration. Compared to popular imitation learning methods, such as behavioral cloning from a single demonstration, MILES is just as efficient in terms of human time, but it does not suffer from the well-known problem of covariate shift. Compared to behavioural cloning methods that leverage 10s to 100s of demonstrations to mitigate covariate shift, MILES achieves the same result by densely covering the space around the demonstration automatically, without requiring human effort. And compared to imitation learning methods that incorporate reinforcement learning, MILES also collects its own data, but instead of randomly exploring its environment, it does so in a self-supervised manner allowing MILES to carefully shape data collection such that it is independent of repeated environment resets.
Method Overview
What happens after self-supervised data collection is finished?
Each augmentation trajectory is fused with the demonstration segment following the demonstration waypoint it returns to. This way a dataset of new demonstrations is created where each demonstration demonstrates to the robot how to return and then follow the single human demonstration.
Validity Conditions for Augmentation Trajectories
When moving back towards a demonstration waypoint to collect an augmentation trajectory, after the robot executes that trajectory we check whether the augmentation trajectory is valid by checking for (1) Reachability and (2) Environment Disturbance.
The Reachability condition is simple, we use proprioception to check whether the robot's pose matches that of the demonstration's waypoint. If it does not we discard that augmentation trajectory as it cannot successfully return the robot back to the demonstration, otherwise we store it.
The Environment Disturbance condition checks whether a collision occurred in the environment during self-supervised data collection such that if MILES collects any additional data, the image/force action papers would not return the robot back to any demonstration waypoint. To achieve this we leverage DINO to extract image features from the RGB images captured at each demonstration waypoint and the ones captured after attempting to return back that waypoint. If the similarity threshold between these features is below a threshold we assume that an environment disturbance has occurred and stop data collection. We provide additional details on the environment disturbance in our paper!
How do we train and deploy MILES policy?
If no environment disturbance occured during data collection for a task, then we train a standard behavioural cloning policy on the new, fused demonstrations and deploy it closed loop.
If an enviroment disturbance was detected during data collection, we also train a behavioural cloning policy on the data that, at deployment, is deployed closed loop to solve the task up to the point where the enviroment disturbance occured during collection. After that point the remaining demonstration (for which no augmentation trajectories are available) is replayed. For more information on how to deploy MILES' policy please see our paper.
🤖 Examples of MILES Data Collection
As follows, we show two examples of MILES' self-supervised data collection process for two tasks: (1) bread in toaster and (2) plug into socket. The bread in toaster task demonstrates an example where an environment disturbance is detected and, consequently, MILES stops collecting self-supervised data (as shown at the end of the video, near minute 3:55). The plug into socket task demonstrates an example where no environment disturbance occurs during data collection and as a result MILES collects self-supervised data for all the waypoints in the demonstration.
As shown in the videos below this section for the bread in toaster task MILES learns a policy comprising a closed-loop and a demonstration replay component where the policy completes the task closed loop until it reaches the state where the environment disturbance was detected, after which the remaining demonstration is replayed. For the plug into socket task MILES learns a fully closed-loop policy that is deployed end-to-end (without any demonstration replay).
🤖 Examples of MILES Policy Roll-outs
Close up policy roll-outs
The following close-up videos demonstrate different roll-outs of MILES for different tasks. The pose of the objects across the roll-outs is randomized. For the "Plug into Socket", "Insert USB" and "Insert Power Cable" tasks MILES learns fully closed-loop policies, while for the rest the policy comprises a closed-loop and demonstration replay component. (Videos are x2)
Lock with key
Plug into socket
Insert power cable
Insert USB
Open Lid
Twist screw
Bread in toaster
Lock with key
Plug into socket
Insert power cable
Insert USB
Twist screw
Bread in toaster
Open Lid
🤖 Examples of MILES Policy Roll-outs with Distractors
4 Uncut Videos
The following videos demonstrate MILES' performance on several different tasks where the objects' poses and distractors are randomized. First, a pose estimator is deployed (see the paper's appendix) to reach near the task-relevant object. Then MILES' policy takes over to complete the task. As shown, MILES can solve several different everyday tasks each requiring interactions of varying complexity from precise contact-rich insertion, to the delicate manipulation required to open a lid. All the policies predict 6 DOF actions. All the following videos are uncut and x2 in speed.
Plug into socket
Open Lid
Bread in toaster
Insert power cable
Quantitative Results
The table below shows MILES' performance on the 7 tasks shown above. Additionally, we compared MILES against 4 baselines where we assume that only a single demonstration is available and no prior knowledge: (1) Demo Replay which involves replaying the demonstrated actions. (2) Pose Estimation + Demo Replay which leverages MILES’ data to perform pose estimation followed by demonstration replay. (3) Reset Free Residual RL replays the demonstration’s actions at each timestep and learns corrective actions on top using DDPG. Like MILES, no human intervenes to reset the environment during training, hence we call it ”Reset Free”. Finally, (4) Reset Free FISH (Inverse Residual RL) uses the state-of-the-art inverse RL method FISH but no human intervenes to reset the environment during training.
For ablation studies and additional experimental results on MILES' performance with different observation modalities please see our paper.
🤖 MILES Generalization
While MILES' is not explicitly designed for generalization, the fact that it combines behavioural cloning with demonstration replay allows it to inherit the generalization capabilities of each. Hence, to evaluate MILES' abilility to generalize, we first train it to throw different markers in the bins shown in the image to the right. Then, generalization is tested in the bins shown in the videos below. Our expectation is that after collecting data for several bins MILES' behavioural cloning component will be able to demonstrate a satisfactory level of generalization as it has been shown in prior work. And to obtain satisfactory generalization for the demonstration replay component of MILES we deploy a similar approach to DINOBot (link). For more information, please see the appendix of the paper. The following videos are uncut. The grasping of the markers in this scenario is scripted.
(videos are x2)
Used for training
🤖 MILES Multi-Stage Tasks
As follows we demonstrate MILES' ability to solve multi-stage tasks. To this end, we task MILES with picking up the toy bread and then inserting it into the toaster, a task comprising two stages. To achieve this we deploy one MILES policy for the pick up stage and one policy for the bread into toaster stage as discussed in the paper's experiments section. Both the pick up bread and bread into toaster stages comprise policies that have a closed-loop and demonstration replay component. The majority of the pick up bread stage is solved with demonstration replay, as during self-supervised data collection an environment disturbance occured early on, while most of the bread into toaster task is solved in a closed-loop manner. Below three videos demonstate this task where the toy bread and toaster are in different poses, with distractors randomly placed.