=== ABSTRACT ===
=== VIDEO ===
Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation skills, when given access to only object detection and segmentation vision models. We study how well a single task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers, can perform across 26 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigate which design choices in this prompt are the most effective. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly.
=== PROBLEM FORMULATION ===
The core motivation of our work is to investigate whether LLMs can inherently guide robots by predicting a dense sequence of end-effector poses, with minimal dependence on specialised external models and components.
We therefore design a task-agnostic prompt to study the zero-shot control capabilities of LLMs, with the following assumptions:
>=> (1) no pre-existing motion primitives, policies or trajectory optimisers;
>=> (2) no in-context examples;
>=> (3) the LLM can query a pre-trained vision model to obtain information about the scene;
>=> (4) no additional pre-training or fine-tuning on robotics-specific data.
=== TASK EXECUTION ===
Below are some examples of the robot executing trajectories generated by the LLM in the real world, with the full main prompt. Note that the robot executes the tasks from its point of view, and so left and right are swapped with respect to the point of view of the video. All video playback speed is at 1x.
pick the chip bag which is to the right of the can
place the apple in the bowl
move the lonely object to the others
shake the mustard bottle
knock over the left bottle
push the can towards the right
pick the fruit in the middle
draw a five-pointed star 10cm wide on the table with a pen
wipe the plate with the sponge
open the bottle cap
=== TASK SUCCESS DETECTION AND RE-PLANNING ===
Here, we show an example of the LLM being able to recognise task failure and proposing a new trajectory. For the "pick up the bowl" task, the LLM initially attempts to grasp the bowl at its centroid. Given the poses of the bowl over the duration of the task execution represented as numerical values, the LLM detects that the task was not successful and proposes a new sequence of end-effector poses. On its third attempt after re-planning again, it successfully grasps the bowl. We therefore demonstrate that LLMs possess not only the ability to generate trajectories, but also to discern whether they represent successful or unsuccessful episodes, and re-plan an alternative trajectory if necessary.
re-planning for "pick up the bowl"
=== PIPELINE OVERVIEW ===
Our proposed pipeline is as follows:
>=> (1) the main prompt along with the task instruction is provided to the LLM;
>=> (2) the LLM generates high-level natural language reasoning steps before outputting Python code;
>=> (3) the code can interface with a pre-trained object detection model and execute the generated trajectories on the robot;
>=> (4) after task execution, an off-the-shelf object tracking model is used to obtain 3-D bounding boxes of the previously detected objects over the duration of the task, which are then provided to the LLM as numerical values to detect whether the task was executed successfully or not.
pipeline overview
=== FAILURE CASES ===
We attribute the failure cases to the following five sources of error:
>=> (1) gripper pose prediction error;
>=> (2) task planning error;
>=> (3) trajectory generation function definition error;
>=> (4) object detection error;
>=> (5) camera calibration error.
Some examples of task execution failures can be seen below, with the corresponding source(s) of error in brackets.
knock over the left bottle (LLM failed to predict the correct grasp height, error 1; bounding box height not calculated correctly due to noisy camera data; error 5)
move the can to the bottom of the table (LLM failed to predict the correct grasp height, error 1)
move the banana near the pear (LLM failed to predict the correct grasp orientation, error 1; LLM failed to plan the correct open and close gripper steps, error 2)
move the lonely object to the others (LLM failed to query the correct object for detection, error 2; object detector failed to detect "lonely object", error 4)
wipe the table with the sponge, while avoiding the plate on the table (LLM failed to plan a step to lower the gripper to wipe the table, error 2)
draw a five-pointed star 10cm wide on the table with a pen (LLM failed to generate the correct function to draw the star, error 3)
=== PROMPTS AND SAMPLE LLM OUTPUTS ===
We present the prompts used for our investigation, as well as sample LLM outputs. Note that all the prompts used (for trajectory planning, success detection and task re-planning) are task-agnostic, and they can be populated automatically depending on the task.
For the prompts for the ablation studies, you can view the changes made to the main prompt by clicking on the commit message. You can also see the same prompts with highlighted text below.
All the prompts for the ablation studies are shown in the above links. You can find below the same prompts, but with highlighted text for easier visualisation.