Abstract
Video
We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. To do so, we transform visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained on language, these models excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art techniques in the low-data regime. Rather than operating in the language domain, KAT leverages text-based Transformers to operate in vision and action domains for efficient general imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks.
Key Idea We repurpose text-pretrained Transformers (LLMs) as sequence-to-sequence imitation learning machines, mapping visual inputs to action outputs via our proposed Keypoint Action Tokens framework.
Recording a Demo
In this video, we illustrate how demonstrations are recorded. Observations are translated into visual Keypoint Tokens, and the trajectory of poses of the end-effector is recorded as a series of Action Tokens, triplets of 3D points uniquely defining an end-effector 6D poses. Each demonstration is then added to the textual prompt of the Language Model.
Test-Time Inference
Here, we show the robot's behaviour during testing, after having recorded Keypoint Action Tokens for ~10 demos and added them to the LLM textual prompt. Then the Keypoint Tokens for the image observed during testing are appended to the prompt, after which the model predicts the Action Tokens autoregressively to emulate the behaviour of the demonstrations.
Videos of Tasks
We now illustrate how KAT can solve a series of everyday tasks. All the tasks shown here were provided with 10 demonstrations, after which the task can then be solved with the objects in novel configurations. We can also observe how KAT is robust to visual distractors and change of background in the first two videos.
Quantitative Results
Please see our paper for full results, but the two graphs below summarise some of our experiments.
On the left, we show that Keypoint Action Tokens (KAT) outperforms Diffusion Policies, and is comparable to our own version of Diffusion Policies which uses our proposed Keypoint Action Tokens to represent observations and actions (KeyAct-DP). An important conclusion we can draw here is that, when the number of demonstrations is small, in-context learning with an LLM performs very well, whereas when the number of demonstrations is large, explicit training on this data, such as with Diffusion Policies, performs better. However, compared to using Diffusion Policies from images (rather than keypoints), KAT performs better even with a large number of demonstrations.
On the right, we show that KAT's performance improves as the underlying LLM improves. This suggests that the improvements in LLMs will continue to lead to improvements in robotics "for free", because in KAT, we are repurposing LLMs for imitation learning in robotics, even though these LLMs were not explicitly trained on robotics data. Therefore, we expect KAT and other similar methods to continue improving over the coming years.