
Key Idea Using 3D generative models to augment a single real-world demonstration into a rich “imagined” dataset, we train omnidirectional robot policies that generalise to unseen initial states and outperform alternative augmentation baselines across real tasks.
Abstract
Video
Recent 3D generative models, which are capable of generating full object shapes from just a few images, now open up new opportunities in robotics. In this work, we show that 3D generative models can be used to augment a dataset from a single real-world demonstration, after which an omnidirectional policy can be learned within this imagined dataset. We found that this enables a robot to perform a task when initialised from states very far from those observed during the demonstration, including starting from the opposite side of the object to the real-world demonstration, significantly reducing the number of demonstrations required for policy learning. Through several real-world experiments across tasks such as grasping objects, opening a drawer, and placing trash into a bin, we study these omnidirectional policies by investigating the effect of various design choices on policy behaviour, and we show superior performance to recent baselines which use alternative methods for data augmentation.

One-Shot Imitation Learning through 3D Generative Models
.png)
We utilise 3D generative models for data augmentation to enable one-shot imitation learning from a single demonstration, with end-to-end policies from a wrist-mounted RGB camera. Given a single real-world demonstration, a full 3D model of the object is generated through the generative model, which allows for 3D augmentation of an infinite number of “imagined” demonstrations via novel view rendering. The augmented dataset is then used to train an “omnidirectional” policy, enabling tasks to be executed from any initial state, even if that state is significantly different from any in the real-world demonstration. Leveraging the generalisation and high-quality novel view rendering capabilities of 3D generative
models, our method can be applied to multiple everyday tasks.
Pipeline Overview

The OP-Gen pipeline begins with a single demonstration, from which posed images are sampled and fed into EscherNet for novel view synthesis (a). The resulting multi-view images are used to construct a NeRF for efficient rendering (b). The extracted 3D mesh of the target object enables our anchored trajectory generation module to create novel trajectories (c). Then we render new observations via the pre-built NeRF and assign corresponding actions (d). These are aggregated into an augmented dataset (e), used to train a diffusion policy (f), which is then deployed in real-world rollouts (g).
3D Generation Gallery
We use a 3D generative model EscherNet to synthesise the target object. The first row displays all the input images provided to the 3D generation module, while the second row shows the resulting generated object. As illustrated, even with limited input views, the 3D generative model is capable of producing plausible novel renderings, enabling our OP-Gen to augment the dataset in an omnidirectional manner.
Robot Videos
The real-world robot rollouts. The EEF starts from an arbitrary pose and executes the omnidirectional policies in a closed-loop manner. Even with human disturbances and distractors, the policy can complete the tasks. The wrist-camera-captured live images are shown on the top right of the video.
Can OP-Gen solve a range of everyday tasks from a single demonstration and
how does it compare to baselines?
We evaluate our data augmentation scheme on six real-world tasks, conducting 20 rollouts per task using four different baselines. During each rollout, the EEF starts at different poses sampled from the working space. Among these, 10 initial poses are sampled within a ±45° fan-shaped range around the object, with the central axis aligned to the first demo camera view. These poses are categorised as Narrow initial poses, while the remaining poses sampled outside this range are categorised as Omni initial poses.
​
We compare our OP-Gen with baselines including:
(1) No Aug: single demonstration without augmentation. This represents the lower bound of the method.
(2) OP-PCD: we reconstruct a partial point cloud of the target object using RGB-D images from the single demonstration. This will show the benefits of using complete 3D generation rather than partial reconstruction.
(3) SPARTN: we use our implementation of SPARTN, which builds a partial NeRF using camera views along the demonstration trajectory. This method can provide plausible data augmentation near the demonstration views but no additional information from the unseen part of the object.
(4) Upper Bound (UB): we replace our 3D generation module with the full-scan NeRF of the object. Note that since we're investigating one-shot imitation learning, demonstration views will be inadequate for the full NeRF training. As a result, we only consider this method as an upper bound rather than a fair baseline.
​
To highlight the data efficiency of our method, we also record the data collection time for each of the baseline methods.

Across real-world experiments, performance is consistently higher in the Narrow setting than Omni. Meanwhile, no-augmentation fails entirely, OP-PCD and SPARTN offer only modest success gains, our method markedly outperforms baselines, approaches the upper bound, and remains robust to large viewpoint changes. These results show that 3D generative models are powerful for data augmentation, strengthening behavioural cloning policies and reducing reliance on extensive real-world data.
How crictical is 3D generation quality for OP-Gen?
To quantitatively evaluate the impact of 3D generation and reconstruction quality, we measure the average SSIM scores between renderings and ground truth (GT) images along the full NeRF scan trajectory across all methods, reflecting their structural similarity.
The right figure shows the relationship between task success rate and SSIM value across different tasks and baselines. While higher image fidelity improves policy performance, consistency across viewpoints is even more crucial for omnidirectional policy learning.

To visually evaluate the impact of the 3D generation model on policy performance, we present qualitative comparisons of novel view renderings produced by different baselines.






(a) Grount Truth






(b) Upper Bound






(c) OP-Gen






(d) SPARTN






(e) OP-PCD