Zhenyu Wu1
Ziwei Wang2
Xiuwei Xu3
Jiwen Lu3
Haibin Yan1†
1Beijing University of Posts and Telecommunications 2Carnegie Mellon University 3Tsinghua University
Paper (arXiv)
Code (Coming Soon)
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.
Overview of our approach. The scene feature map is constructed based on real-time RGB-D images, which is leveraged as visual clues for the high-level planner and the low-level controller. The planner generates the step-wise plans, which are leveraged to predict the specific actions in the controller. The optimal border between unknown and known regions is selected for scene exploration, and the scene feature map is updated with the visual clues seen in during the exploration.
We demonstrate that the proposed hierarchical planning framework enables robots to have the ability to perform long sequential tasks (e.g., operating a microwave oven). By generating step-by-step planning, the robot is able to understand the execution progress based on contextual information.
In unknown environments, our framework enables robots to interact with the environment to explore more visual information actively. The robot generates the exploration action of opening a fridge to look for eggs instead of navigating randomly.