Crossing the Gap: A Deep Dive into Zero Shot Sim-to-Real Transfer for Dynamics
Zero-shot sim-to-real transfer of tasks with com-plex dynamics is a highly challenging and unsolved problem.A number of solutions have been proposed in recent years, butwe have found that many works do not present a thoroughevaluation in the real world, or underplay the significantengineering effort and task-specific fine tuning that is requiredto achieve the published results. In this paper, we dive deeperinto the sim-to-real transfer challenge, investigate why this issuch a difficult problem, and present objective evaluations of anumber of transfer methods across a range of real-world tasks.Surprisingly, we found that a method which simply injectsrandom forces into the simulation performs just as well asmore complex methods, such as those which randomise thesimulator’s dynamics parameters, or adapt a policy online usingrecurrent network architectures.
< to appear >
The current literature in sim-to-real transfer for dynamics has reported some very impressive results, going so far as to achieving transfer for complex dexterous in-hand manipulation. However, in an attempt to evaluate this field for our research, a surprising story emerged. We found that current literature does not highlight enough the engineering and task-specific tuning efforts required to achieve the published results. We set out to do just that, and provide a much more thorough examination of the sim-to-real for dynamics problem than is typically seen in other works. We further provide a thorough benchmarking and evaluation of recent methods over three different tasks. In doing so, we highlight some that we believe have been under-considered in the current literature, and put to the test other methods that have only been evaluated in proxy virtual environments. Our experiments reveal that, without significant task-specific tuning, many of the more complex methods do not seem to scale better to real-world tasks than much simpler and more interpretable alternatives.
Understanding the Reality Gap
Naively attempting to transfer a Reinforcement Learning (RL) policy trained entirely in simulation to the real world often fails, which is due to the domain shift originating from the differences between the two environments. Examining the problem, one can categorise these differences into two main subdivisions.
Firstly, there are the differences in the sensor modalities of the two domains, which we refer to as the observation shift . Considering the graphical model representation of the problem (figure below), the observation shift occurs because of differences in the emission probability (i.e., from states to observations) between the simulator and the real world. Concrete examples include differences in the colours and textures rendered on images obtained from simulation, or noise present in real-world sensors. When the observations of the RL policy consist of images, overcoming this observation shift is often referred to as the visual sim-to-real transfer problem.
Secondly, there are the differences between the dynamics of the two domains, which in the graphical model representation would correspond to differences between transition probabilities of the two domains. Concretely, these arise from differences on the values of the physical parameters of simulation components (such as mass, friction coefficient, or joint damping), as well as inaccuracies on the modelling of physical processes (such as contacts, joint backlash, time delays in the robot operation). Overcoming this domain shift is often referred to as the dynamics sim-to-real problem and is what our work focusses on.
Setting up the Simulators
In this and the following sections we will be considering the dynamics transfer problem only, unless we state otherwise explicitly.
It is first of all important to recognise that a significant proportion of current zero-shot methods that aim to cross the reality gap and achieve sim-to-real transfer are based on the same principle. It consists of inserting noise into the simulation while training the policies, which effectively creates an ensemble of MDPs the policies are trained on. The underlying assumptions are that by training the neural network policies on such a wide variety of MDPs they will learn to generalise across them, and that the MDP of the real world will reasonably fall within the distribution of generated MDPs.
There also lies the main challenge that accompanies these techniques: A simulator needs to be setup in a way such that, through sampling distributions of pre-determined parameters, creates a range of reasonable and stable behaviours that are useful in making the policies robust to the reality gap.
Defining those parameters and their distributions is not a trivial matter, there is no consensus on the best way to do so. We explicitly describe here a step-by-step framework that we used in our experiments. This is largely applicable to techniques that are based on generating environments by randomising underlying parameter distributions, as is demonstrated by the wide variety of methods we are testing and benchmarking in our experiments. Overall, our methodology can be described by the following diagram:
The process of setting up the policy training environments starts by setting up a simulator that shows reasonable stable behaviour. This involves choosing a set of baseline parameters. Some, such as those defining the kinematics of the robot, are quite easy to set up and can be looked up on the URDF specification of the arm. Others are much harder to set in any accurate fashion, and require educated guessing on the implementer's part by either visually inspecting the simulation behaviour or looking up reasonable values from similar materials .
Next, the distribution of the simulator parameters to randomise needs to be determined. This process involves some initial guess of the parameter distributions. This initial guess relies on the implementer's prior knowledge, and to somewhat refine it we visually inspected the response of the simulator to some pre-determined policy under different parameter samples, against the response to that same policy of real hardware.
Finally, there is trial-and-error process where the task policies are trained in simulation, evaluated (in simulation, the real world, or both), and if the performance is not satisfactory the parameter distributions of the simulation adjusted. Several works that attempt to automate this process could also be used at this stage.
Crossing the Reality Gap
In our experiments, provide thorough benchmarking and evaluation of recent methods in the area, across three different tasks. In doing so we (1) highlight methods that we believe have been under-considered in the current literature, and show that, while being significantly easier to implement and interpret, are performing just as well if not better than more complex methods in our experiments, and (2) put to the test in the real world methods that have only been evaluated in simulated environments.
As such, we first consider a set of methods that are a combination of a training randomisation regime, or how the environments are randomised during training, and a transfer method (RL method and policy architecture). In our experiments we use TD3 to train the task policies, so each method differs in the policy architecture and/or the randomisation regime used.
Specifically, we consider the following randomisation regimes:
No Randomisation (NR): A baseline where the baseline parameters of the simulators are kept, with no additional randomisation.
Domain Randomisation (DR): This is the most commonly used technique when training networks to cross the Reality Gap. It consists of choosing a large set of physical and simulator parameters to randomise over in order to train the policies. These include physical parameters such as friction, observational shift parameters such as measurement noise, and unmodelled effect parameters such as action noise.
Random Force Injection (RFI): This method offers a much simpler way to inject noise in the simulation than DR, as it consists in simply injecting random forces to the different components of the simulator during execution. The only parameters to consider are then the scale of those random forces. To the best of our knowledge, it has only been considered as part of wider systems or as a baseline to other methods. We consider it in its own merit, and our experiments suggest it can offer a viable alternative to the more complex standard DR.
Random Force Injection with Observation Noise (RFI+): RFI injects noise into the dynamics of the simulation in the form of random forces, but does not randomise any component to account for the observation shift in the reality gap. RFI+ is a formulation that takes into account these differences by expanding RFI to include noise that affects the observations of the policy but not the underlying state of the world.
For each of these randomisation methods, we consider the following policies:
Conservative Policy: A policy that receives a single observation at each timestep in order to infer the appropriate action. This policy does not have access to information that would allow it to infer the environment dynamics, which should lead it to take small, conservative actions, allowing it to recover from any large, unexpected state transition.
Adaptive Policy: A policy that receives the sequence of past states and actions as its input. Such a policy aims to use the history of states and actions in its input to create an internal representation of the current environment dynamics online, and adapt to these dynamics in real time.
Second, we implement and put to the test online system identification methods, that aim to build some representation of the environment dynamics online, and use that representation to condition universal policies, in order to help them adapt to the different environments. Particularly, we have identified two methods that essentially differ in their choice of dynamics representation: The Universal Policies with Online System Identification (UPOSI) method which aims to explicitly regress the simulator parameters defining the environment, and the Environment Probing Interaction Policies (EPI) method that aims to find a latent representation of the environment dynamics. Both of these methods have to the best of our knowledge only been evaluated in simulated environments, and here we compare them to the more standard approaches enumerated above in a real-world setting.
Our three evaluation environments can be seen in the Figure on the left, and consist of a Reaching Environment, a Pushing Environment and a Sliding Environment.
Reaching Environment: The aim on this environment is to reach a target that is located 2.5cm above the table. The target locations are sampled at each episode from within some pre-specified regions. For our real world experiments we use three goal locations (easy, intermediate and hard) to evaluate our policies on.
Pushing Environment: The aim on this environment is to push a cuboid to a given target location. Both the start and goal positions are sampled within some pre-determined regions. In our real experiments, again three goal locations are used to evaluate the policies.
Sliding Environment: The aim of this environment is to use gravity to slide the same cuboid on a sliding panel from a pre-determined starting position to a pre-determined goal positions. As opposed to reaching and pushing, the start and target positions in this task are not varied in order to facilitate learning.
A summary of our results can be seen in the table above, and a more detailed breakdown can be found in the appendix of our main paper.
We can summarise our findings in the following:
Aggregate statistics show RFI as being the best performing across the board, although looking at the full breakdown of the results there is no clear winner amongst the methods, where the ranking of the methods was largely varying depending on the task/goal combination. This is significant, as to the very least we can conclude that DR performed no better in our experiments than RFI, despite requiring several days of tuning and the use of post-training adjustment , while RFI only required a few hours of tuning and no post-training re-adjustment of the distributions.
Overall RFI performed better than RFI+, which is surprising, as RFI+ conceptually accounts more thoroughly for the sim-to-real discrepancies. We conjecture that RFI+ may have benefitted from further tuning the parameter ranges, and this highlights a trend seems to form: The more parameters that are considered for environment randomisation, the more tuning of those parameters needs to be done for the performance to be satisfactory, while showing no evidence to be better than simpler alternatives in our experiments.
The adaptive policies performed worse than the conservative policies, which is contrary to what some previous works suggest. We conjecture that either the structure of our tasks(e.g. episode length) was not propitious to the emergence of implicit meta-learning in the recurrent networks (which is necessary for the adaptive networks to be able to adapt to the environment dynamics online), or that further tuning efforts would have been required for this to be the case.
UPOSI performed very poorly, while EPI seemed to respond with different levels of success to the different tasks. Upon further experimentation we found that although the universal policies utilise their system identification modules to make predictions, those system identification modules seem to have trouble making meaningful predictions in such complex tasks.