Benchmarking Domain Randomisation for Visual Sim-to-Real Transfer
Domain randomisation is a very popular method for visual sim-to-real transfer in robotics, due to its simplicity and ability to achieve transfer without any real-world images at all. Nonetheless, a number of design choices must be made to achieve optimal transfer. In this paper, we perform a comprehensive benchmarking study on these different choices, with two key experiments evaluated on a real-world object pose estimation task. First, we study the rendering quality, and find that a small number of high-quality images is superior to a large number of low-quality images. Second, we study the type of randomisation, and find that both distractors and textures are important for generalisation to novel environments.
Deep learning for robotics
In recent years, deep learning has been successfully applied to a range of robotics applications, particularly those requiring visual observations for control. Deep learning, however, is considered to be data-hungry, and the reliance of deep learning on large labelled datasets presents a significant challenge, especially for robots where collecting real-world training examples comes with a high cost.
Several works in the literature have therefore leveraged the power of simulation engines to train deep neural networks with synthetic data, which is -compared to the real-world data- fast to be collected, labelled, and cheap.
However, do these models work out of the box?
The simple answer is: no. Models trained entirely in simulation often fail in the real world, due to the unmodeled difference between the two environments, a problem referred to as the reality gap.
One of the most promising solutions is sim-to-real transfer, where training is performed in simulation, and a controller is transferred directly to the real world. Of the many sim-to-real transfer methods for vision, domain randomisation is the most popular since it is simple to implement and can achieve zero-shot transfer without any real-world data. However, despite its popularity, there are no significant works that benchmark the different types of domain randomisation for visual sim-to-real transfer.
In this paper, we study a range of different design choices and empirically evaluate their effect on a real-world task. We divide the paper into two distinct experiments, which study two modes of design choices.
In the first experiment, we consider the case when the scene content is known in advance, and the primary role of sim-to-real is to model the effect of illumination and image noise on the observed image. Here, we study how the quality of the rendered images, in terms of the fidelity of the graphics pipeline, affects the sim-to-real performance.
In the second experiment, we consider the case when the scene content is not known in advance, and the role of sim-to-real is to achieve robustness to the unknown, such as illumination conditions and clutter. Here, we study how different types of randomisation, such as colours and textures, affect the sim-to-real performance.
We evaluated both experiments with a 6D object pose estimation task, with a manually-labelled real-world dataset. This not only evaluates sim-to-real for a pose estimation module within a wider control pipeline but also is a proxy to end-to-end control methods which implicitly localise important objects within an image.
Experiment #1 Rendering Quality
This experiment aims at studying the impact of the simulator’s fidelity on the overall model’s performance in the real world. More precisely, our focus is on finding answers to the following three questions:
1) How critical is the quality of the simulator for achieving successful sim-to-real transfers?
2) What is the effect that each simulation parameter has on the overall sim-to-real transfer performance?
3) What is the optimal trade-off of low-quality and high-quality images given a fixed amount of rendering time?
To achieve this, we collected eight different training datasets, that vary in terms of five different parameters. In the above figure, we visually show the difference between the eight quality levels.
In the following table, we show the results obtained upon testing these models in the real world, where we can see a very strong relationship between the quality of the renderer, and the sim-to-real transfer performance.
However, we observed by experiments that rendering a single high-quality image (level 8) requires more than triple the time when compared to the low-quality one (level 1). Thus, we also proposed to combine low-quality images with high-quality images, and found, as shown in the below figures, that for the same overall rendering time, it is more important to have a high percentage of high-quality images, than low-quality images.
Experiment #2 Randomisation Type
In this experiment, we examine the performance of models trained with high-quality synthetic images (level 8), when tested in varied and cluttered real-world environments. More precisely, we assesd the significance of different randomisation settings to the models’ transferability to the real world. Our focus is on answering the following three questions:
1) What elements are most important to randomise while collecting the training data?
2) How does the performance of the models change as we increase the training dataset size?
3) How robust are the trained models to never-seen backgrounds and distractors?
We concluded that randomising both factors (textures and distractors) is important for generalising to novel environments with never-seen distractors and backgrounds. We have also found, as expected, that there is a strong relationship between the training dataset size and the sim-to-real transfer performance.
We observed that our models when trained with both textures and distractors are robust to changing environments and can generalise accurately to the real world. In the below figures, we show some successful examples when the model precisely estimated the 6D pose of the target object, in spite of the novel backgrounds and distractors.
Despite the promising results, the model was not able to accurately estimate the pose of the target object, as shown in the examples below, when another distractor of the same colour is presented in the environment. Further, extremely cluttered environments and dark scenes posed a challenge to the model.
We believe that these conclusions can now be used by others in designing their own domain randomisation datasets, with a view towards achieving optimal sim-to-real performance for a given amount of dataset generation time.