Coarse-to-Fine for Sim-to-Real: Sub-Millimetre Precision Across Wide Task Spaces

Eugene Valassakis, Norman Di Palo, and Edward Johns

Published at IROS 2021

[Paper]              [Supplementary Material]              [BibTex]


In this paper, we study the problem of zero-shot sim-to-real when the task requires both highly precise control with sub-millimetre error tolerance, and wide task space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin with classical motion planning using ICP-based pose estimation, and transition to a learned end-to-end controller which maps images to actions and is trained in simulation with domain randomisation. In this way, we achieve precise control whilst also generalising the controller across wide task spaces, and keeping the robustness of vision-based, end-to-end control. Real-world experiments on a range of different tasks show that, by exploiting the best of both worlds, our framework significantly outperforms purely motion planning methods, and purely learning-based methods. Furthermore, we answer a range of questions on best practices for precise sim-to-real transfer, such as how different image sensor modalities and image feature representations perform.


5-Minute Summary

Our Coarse-to-Fine Framework

blog framework_diagram (3).png
Copy of 5mm_post_reg_ICP_example.png

Estimating the Pose of the

Bottleneck through ICP

Coarse Controller

Our coarse controller starts at the robot's neutral position and use Iterative Closest point (ICP) pose estimation to calculate and execute path to a bottleneck pose, just above the object of interest.

This way, our framework achieves wide task space generalisation, without the need to train a complex end-to-end controller for the simpler parts of a task.


Coarse Trajectory to the 

Bottleneck Pose


Domain Randomisation in Simulation

Fine Controller

A highly precise, closed-loop, end-to-end controller that maps images to actions kicks in at the bottleneck pose to finish the task at the required precision. 

This controller is trained with domain randomisation (DR) in simulation, and then deployed to the real world without further training.


Transferring the Policy onto the

Real World






Sub-millimetre Precision

and Multi-stage Control

By focussing the network training on a small region of space, and using end-effector commands, the network does not need to be globally aware of the task space, and can focus instead on achieving the task with high precision while overcoming the reality gap.

Our tasks are designed to test the capabilities of our framework, both in achieving sub-mm precise control (top row on the left) and complex, multi-stage policies (bottom row on the left). Our experiments show that our framework can satisfy these requirements while also achieving wide task space generalisation, and consistently outperforms simpler baselines.



Robustness and Stress Testing

In order to test the robustness of our method to particularly challenging conditions, we also perform stress testing for our controllers. 

Through our experiments, we show that our controllers are robust to strong lighting conditions, background distractors, and variable execution times.


Available Input Modalities in Simulation and the Real World

Input Modality and

Image Representation Studies

On top of developing our framework, we also conduct a set of experiments studying interesting aspects of the sim-to-real problem.

More specifically, we investigate the following questions: (1) How effective are the different input modalities available on a typical assisted stereo depth camera (see left) in achieving highly precise sim-to-real?, and (2) Are network architectures based on keypoint  image representations better suited for precise sim-to-real control, or are standard convolutional features just as effective?

Benchmarking Results

To answer these questions we conduct a series of experiments with increasing difficulty (see on the right), and compare how different input modalities and network architectures perform on those tasks.

Through our experiments, we find that:

  • Keypoint representations induce a significant jump in performance compared to standard convolutional features.

  • RGB image inputs are superior and sufficient, while depth maps do not offer a satisfactory level of performance.


Square peg insertion with increasing levels of difficulty for a granular comparison between the methods tested in our benchmarking experiments

For more information, please read the paper, and watch the video.