Zhenyu Wu1
Ziwei Wang2,3
Xiuwei Xu2,3
Jiwen Lu2,3
Haibin Yan1*
1School of Automation, Beijing University of Posts and Telecommunications, China
2Department of Automation, Tsinghua University, China
3Beijing National Research Center for Information Science and Technology, China
Paper
Code
Demo
Model
Equipping embodied agents with commonsense is important for robots to successfully complete complex human instructions in general environments. Recent large language models (LLM) can embed rich semantic knowledge for the agent in plan generation of complex tasks, while they lack the information about the realistic world and usually predict infeasible action sequences. In this paper, we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models. Specifically, we first construct a multimodal dataset containing triplets of indoor scenes, instructions and action plans, where we provide the designed prompts and the names of objects in the scenes for GPT-3.5 to generate a large number of instructions and corresponding planned action steps. The generated data is leveraged for grounded plan tuning of pre-trained LLMs. During inference, we discover the objects in the scenes by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin, which indicates the practicality of embodied task planning in general and complex environments.
Although multi-modal VLMs have achieved surprising performance on a wide range of fields, embodied task planning still remains a challenging task due to: 1) the lack of relevant datasets; 2) the requirement of simultaneous scene understanding and reasoning.Considering the recent success of GPT models on high-level human-like reasoning, we propose to represent the embodied scenes with texts and leverage ChatGPT/GPT-4 for data generation.
Illustration of TaPA. We first collect multiple RGB images in different achivable standing points and views, and utilize an open-voculary detector to generate the list of existing objects in the scene. With the human instructions and predicted objectlist, our TaPA can generate executable action plans for subsequent navigation or manipulation robots.
We conduct extensive experiments with our generated multimodal dataset where the visual scenes come from the simulator AI2-THOR.
Comparison of different LLMs and LMMs on the task of embodied task planning. For the prompt of baseline methods, LLaMA and LLaVA both employ the same prompt of TaPA in the training phase, while GPT-3.5 adopts the same prompt of TaPA for multimodal data generation.
The percentage of different failure cases in embodied task planning for different action step generation methods.Obviously, our method outperforms the comparative LLMS in both metrics in the figure.
Results of different baseline parsing human abstract instructions are demonstrated. LLaMA, and GPT-3.5 inputs are all the perception results. TaPA inputs are the series of surround view images. LLaVA inputs are only one image. Object List represents the objects labeled by AI2-THOR in that scene (GT).