Crossing the Gap: A Deep Dive into Zero Shot Sim-to-Real Transfer for Dynamics
Zero-shot sim-to-real transfer of tasks with complex dynamics is a highly challenging and unsolved problem. A number of solutions have been proposed in recent years, but we have found that many works do not present a thorough evaluation in the real world, or underplay the significant engineering effort and task-specific fine tuning that is required to achieve the published results. In this paper, we dive deeper into the sim-to-real transfer challenge, investigate why this is such a difficult problem, and present objective evaluations of a number of transfer methods across a range of real-world tasks.Surprisingly, we found that a method which simply injects random forces into the simulation performs just as well as more complex methods, such as those which randomise the simulator’s dynamics parameters, or adapt a policy online using recurrent network architectures.
Understanding The Reality Gap
A policy naively trained in simulation ...
... does not transfer well to the real world
Naively attempting to transfer a Reinforcement Learning (RL) policy trained entirely in simulation to the real world often fails, which is due to the domain shift originating from the differences between the two environments:
First, there are the differences in the sensor modalities of the two domains, which we refer to as the observation shift. Considering the graphical model representation of the problem (figure below), the observation shift occurs because of differences in the emission probability (i.e., from states to observations) between the simulator and the real world. Concrete examples include differences in the colours and textures rendered on images obtained from simulation, or noise present in real-world sensors.
Second, there are the differences between the dynamics of the two domains which is the part our work focusses on. In the graphical model representation this would correspond to differences between transition probabilities of the two domains. Concretely, these arise from differences on the values of the physical parameters of simulation components (such as mass, friction coefficient, or joint damping), as well as inaccuracies on the modelling of physical processes (such as contacts, joint backlash, time delays in the robot operation).
Setting Up the Simulators
In this and the following sections we will be considering the dynamics transfer problem only, unless we state otherwise explicitly.
It is first of all important to recognise that a significant proportion of current zero-shot methods that aim to cross the reality gap and achieve sim-to-real transfer are based on the same principle. It consists of inserting noise into the simulation while training the policies, in order to make them robust to the differences in environments the policies will be deployed on.
There also lies the main challenge that accompanies these techniques: A simulator needs to be set up in a way such that, through sampling distributions of pre-determined parameters, creates a range of reasonable and stable behaviours that are useful in making the policies robust to the reality gap.
Defining those parameters and their distributions is not a trivial matter, and there is no consensus on the best way to do so. We explicitly describe here a step-by-step framework that we used in our experiments. This is largely applicable to techniques that are based on generating environments by randomising underlying parameter distributions, as is demonstrated by the wide variety of methods we are testing and benchmarking in our experiments. Overall, our methodology can be described by the following diagram:
Block Diagram of our Methodology and Evaluated Methods
The process of setting up the policy training environments starts by setting up a simulator, including a set of baseline parameters, that shows reasonable and stable behaviour.
Next, the distribution of the simulator parameters to randomise needs to be determined. This process involves some initial guess of the parameter distributions, which relies on the implementer's prior knowledge and expertise. In order to somewhat refine it, we visually inspected the response of the simulator to some pre-determined policy under different parameter samples, against the response to that same policy on real hardware.
Finally, there is a trial-and-error process where the task policies are trained in simulation, evaluated (in simulation, the real world, or both), and if the performance is not satisfactory the parameter distributions of the simulation are adjusted. Several works that attempt to automate this process could also be used at this stage.
Crossing The Reality Gap
In our experiments, we provide thorough benchmarking and evaluation of recent methods in the area, across three different tasks. In doing so we (1) highlight methods that we believe have been under-considered in the current literature, and show that, while being significantly easier to implement and interpret, are performing just as well if not better than more complex methods in our experiments, and (2) put to the test in the real world methods that have only been evaluated in simulated environments.
First, we consider a set of methods that are a combination of a training randomisation regime, or how the environments are randomised during training, and a transfer method (RL method and policy architecture). Specifically, we consider the following randomisation regimes:
No Randomisation (NR): A baseline where the baseline parameters of the simulators are kept, with no additional randomisation.
Domain Randomisation (DR): This is the most commonly used technique when training networks to cross the reality gap. It consists of choosing a large set of physical and simulator parameters to randomise over in order to train the policies.
Random Force Injection (RFI): This method offers a much simpler way to inject noise in the simulation than DR, as it consists of simply injecting random forces to the different components of the simulator during execution. The only parameters to consider are then the scale of those random forces. To the best of our knowledge, it has only been considered as part of wider systems or as a baseline to other methods. We consider it in its own merit, and our experiments suggest it can offer a viable alternative to the more complex standard DR.
Random Force Injection with Observation Noise (RFI+): RFI injects noise into the dynamics of the simulation in the form of random forces, but does not randomise any component to account for the observation shift in the reality gap. RFI+ is a formulation that takes into account these differences by expanding RFI to include noise that affects the observations of the policy but not the underlying state of the world.
For each of these randomisation methods, we consider the following policies:
Conservative Policy: A policy that receives a single observation at each timestep in order to infer the appropriate action. This policy does not have access to information that would allow it to infer the environment dynamics, which should lead it to take small, conservative actions, allowing it to recover from any large, unexpected state transition.
Adaptive Policy: A policy that receives the sequence of past states and actions as its input. Such a policy aims to use the history of states and actions in its input to create an internal representation of the current environment dynamics online, and adapt to these dynamics in real time.
Second, we implement and put to the test online system identification methods, that aim to build some representation of the environment dynamics online, and use that representation to condition universal policies, in order to help them adapt to the different environments. Specifically, we test the Universal Policies with Online System Identification (UPOSI) method which aims to explicitly regress the simulator parameters defining the environment, and the Environment Probing Interaction Policies (EPI) method that aims to find a latent representation of the environment dynamics. Both of these methods have to the best of our knowledge only been evaluated in simulated environments, and here we compare them to the more standard approaches listed above in a real-world setting.
Our three evaluation environments can be seen in the figure on the left.
Reaching Environment : The aim is to reach a goal located above the table target. For our real-world experiments we use three goal locations (easy, intermediate and hard) to evaluate our policies on.
Pushing Environment : The aim is to push a cuboid to a given goal location. Again, in our real world experiments, three goal locations are used.
Sliding Environment : The aim is to use gravity to slide the same cuboid on a sliding panel from a starting position to a pre-determined goal position. As opposed to reaching and pushing, the target position in this task is not varied, in order to facilitate learning.
Illustrations of our Experimental Environments
A summary of our quantitative results can be seen in the table above, and we can summarise our findings in the following:
Aggregate statistics show RFI as being the best performing across the board. Nonetheless, looking at the full breakdown of the results we distinguish no clear winner amongst the methods, as the ranking of the methods was varied depending on the task/goal combination. This is significant, as to the very least we can conclude that DR performed no better in our experiments than RFI, despite requiring several days of tuning and the use of post-training adjustment , while RFI only required a few hours of tuning and no post-training re-adjustment of the distributions.
Overall RFI performed better than RFI+, which is surprising, as RFI+ conceptually accounts more thoroughly for the sim-to-real discrepancies. We conjecture that RFI+ may have benefitted from further tuning of the parameter ranges, and this highlights a trend that seems to form: The more parameters that are considered for environment randomisation, the more tuning of those parameters needs to be done for the performance to be satisfactory, while showing no evidence in our experiments to be better than simpler alternatives.
The adaptive policies performed worse than the conservative policies, which is contrary to what some previous works suggest. We conjecture that either the structure of our tasks (e.g. episode length) was not favourable to the emergence of implicit meta-learning in the recurrent networks (which is necessary for the adaptive networks to be able to adapt to the environment dynamics online), or that further tuning efforts would have been required for this to be the case.
UPOSI performed very poorly, while EPI seemed to respond with different levels of success to the different tasks. Upon further experimentation we found that although the universal policies utilise their system identification modules to make predictions, those system identification modules seem to have trouble making meaningful predictions in such complex tasks.
To read our full paper, please click here.