SAFARI: Safe and Active Robot Imitation Learning with Imagination
Norman Di Palo and Edward Johns
One of the main issues in Learning from Demonstration is the erroneous behavior of an agent when facing out-of-distribution situations, not covered by the set of demonstrations given by the expert. In this work, we tackle this problem by introducing a novel active learning and control algorithm, SAFARI. During training, it allows an agent to request further human demonstrations when these out-of-distribution situations are met. At deployment, it combines model-free acting using behavioural cloning with model-based planning to reduce state-distribution shift, using future state reconstruction as a test for state familiarity. We empirically demonstrate how this method increases the performance on a set of manipulation tasks by a substantial margin in both simulated and real robotics scenarios, by gathering more informative demonstrations and by minimizing state-distribution shift at test time. We also show how this method enables the agent to autonomously predict failure rapidly and safely.
In this work, we present a novel framework for imitation learning and control. Our work is based on the following consideration: one of the main sources of errors in imitation learning is the erroneous behavior of the agent when facing Out of Distribution (OOD) situations at test time.
These OOD states may be encountered for two reasons:
the demonstrations given by the expert don’t adequately cover the state space
at test time, errors from the policy can compound and lead the state distribution to shift
We tackle this problem from three different perspectives:
At training time, we use a novel Active Learning algorithm that can more efficiently explore the state space, reducing OOD errors that may emerge.
At test time we actively minimize state distribution shift to bring the robot closer to the states visited during demonstrations.
When this distribution shift grows too much, the robot is able to autonomously stop execution to avoid dangerous situations.
More details, like the experimental results and baselines, can be found in our paper. For the hyperparameters used in the experiments, refer to the table at the bottom of this page.
An interplay of neural modules
Usually, in imitation learning, a policy network learns to imitate the expert’s behaviour. The main building block of our framework is an interplay between a policy network, an uncertainty network, and a dynamics network (or world model).
The policy network is a feedforward network trained to output actions given states as input. The dynamics network is trained to predict the next state given the current state and action. The (epistemic) uncertainty network is a Denoising Autoencoder (DAE), trained on the states visited by the expert. The DAE is trained to denoise inputs, applying a MSE loss to input and output (more information can be found in the paper). The Denoising Autoencoder learns the structure of the training distribution. Hence, at test time its denoising error will be lower on states that are close to the training distribution, and vice versa, akin to an energy model.
Instead of receiving all the demonstrations beforehand, the agent receives only a small set of demonstrations as an initialization phase. The agent can then request a demonstration by interacting with its environment, trying to solve the task following its policy network and stopping when its uncertainty surpasses a certain threshold. Each new demonstration is then added to the training data and the networks are retrained.
Block Diagram of our Methodology and Evaluated Methods
We test our Active Learning method on a series of simulated and real robotics manipulation tasks. Here we show the expert providing demonstrations.
While in normal Passive Learning the expert provides all the demos, here we show how, during the active phase, the robot tries to solve the task, but autonomously queries the expert when an OOD situation is encountered.
Here we show the same method in the real world. An example of the robot's visual input is added to the video. The goal is to push the object over the QR code. The QR code is not preprocessed in any way: the robot learns autonomously that it’s the goal position.
We benchmark the policy learned with Active Learning with several baselines, obtaining better performance in all 3 simulated environments and in the real world scenario. Detailed results can be found it the paper.
Online State Shift Minimization
At test time, SAFARI minimizes the state distribution shift by planning actions that minimize epistemic uncertainty. It does this with an interplay of the aforementioned networks.
The agent samples a series of actions, and uses the dynamics model to predict the future states encountered if those actions are executed. It then computes the epistemic uncertainty on these imagined states using the uncertainty network.
Failure Prediction at Test Time
Finally, our uncertainty estimation method can be used to predict failure at test time to improve safety. Here we show some examples directly from visual inputs. More details can be found in the paper.