Embodied Instruction Following in Unknown Environments

arXiv 2024

Zhenyu Wu¹   Ziwei Wang²   Xiuwei Xu³   Jiwen Lu³   Haibin Yan^1†

¹Beijing University of Posts and Telecommunications  ²Carnegie Mellon University  ³Tsinghua University

paper Paper (arXiv)      code Code

Abstract

Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.

Approach

Overview of our approach. The scene feature map is constructed based on real-time RGB-D images, which is leveraged as visual clues for the high-level planner and the low-level controller. The planner generates the step-wise plans, which are leveraged to predict the specific actions in the controller. The optimal border between unknown and known regions is selected for scene exploration, and the scene feature map is updated with the visual clues seen in during the exploration.

Highlights

Long-sequence Task Planning

We demonstrate that the proposed hierarchical planning framework enables robots to have the ability to perform long sequential tasks (e.g., operating a microwave oven). By generating step-by-step planning, the robot is able to understand the execution progress based on contextual information.

Active Interactive Exploration

In unknown environments, our framework enables robots to interact with the environment to explore more visual information actively. The robot generates the exploration action of opening a fridge to look for eggs instead of navigating randomly.

Bibtex

@article{wu2024embodied, title={Embodied Instruction Following in Unknown Environments}, author={Wu, Zhenyu and Wang, Ziwei and Xu, Xiuwei and Lu, Jiwen and Yan, Haibin}, journal={arXiv preprint arXiv:2406.11818}, year={2024} }