Towards Intelligent and Generalisable Agents for Vision-and-Language Navigation
30 September 2025 1:00pm
The School of EECS is hosting the following HDR Progress Review 3 Confirmation Seminar:
Towards Intelligent and Generalisable Agents for Vision-and-Language Navigation
Speaker: Haodong Hong
Host: A/Prof Sen Wang
Abstract: Vision-and-Language Navigation (VLN) works as a cornerstone of embodied AI, aiming to develop agents that can interpret natural language instructions and navigate complex environments. Despite notable progress, existing formulations often operate under simplified assumptions that diverge from real-world conditions. Existing tasks face four key limitations: (1) reliance on abstract text-only instructions; (2) assumption of fixed, obstruction-free navigation graphs; (3) evaluation through one-time executions without continuous adaptation; and (4) neglect of earlier embodied stages such as exploration and representation construction. These limitations hinder practical deployment, where agents must achieve multimodal grounding, robustness to dynamic environments, adaptability in persistent contexts, and reliable mapping and planning after exploration.
This thesis addresses these issues through four directions. First, to mitigate the ambiguity of text-only instructions, I propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), which augments textual guidance with image-based prompts. VLN-MP maintains backward compatibility with text-only inputs while delivering consistent gains across benchmarks with multi-modal prompts, demonstrating the value of visual cues to strengthen grounding and reduce ambiguity. Second, to account for real-world obstacles, I introduce R2R-UNO, a dataset that integrates diverse obstructions such as closed doors and blocked paths through inpainting models. To handle such instruction–reality mismatches, I propose ObVLN, a method that combines curriculum training with virtual graph construction, achieving robust performance in both obstructed and unobstructed scenarios. Third, recognizing that practical agents often operate in persistent environments, I formulate the General Scene Adaptation (GSA) task in VLN, which emphasizes continuous adaptation from consistent layouts and diverse instruction styles. To support this setting, I construct GSA-R2R, an extended dataset with both in-distribution and out-of-distribution splits, and propose Graph-Retained DUET, a method that incorporates memory-based navigation graphs with environment-specific training, establishing state-of-the-art results. Fourth, to unify fragmented embodied AI tasks, I present ERNav, the first benchmark that integrates exploration, representation, and navigation into a single pipeline. ERNav challenges agents to explore large-scale buildings, build reliable representations, and reason over complex navigation instructions. Alongside it, I propose 3D-LangNav, a strong baseline with dual-sighted exploration and LLM-based reasoning, significantly surpassing existing methods and offering a novel perspective on VLN through 3D scene understanding.
Together, these contributions advance the embodied navigation by addressing ambiguity in instruction grounding, robustness under environmental uncertainty, adaptability in persistent contexts, and scalability to unified embodied pipelines. I conclude by reflecting on the remaining challenges—such as long-horizon reasoning, multimodal representation learning, and the real-world deployment—and by outlining promising directions toward intelligent and generalisable navigation agents in real-world environments.
Bio: Mr. Haodong Hong is a PhD candidate from the Data Science group at the School of Electrical Engineering and Computer Science, the University of Queensland (UQ), Australia. He received his Bachelor’s degree in Electronic Engineering from Tsinghua University. His research focuses on multimodal learning, embodied agents, and vision-and-language navigation, under the supervision of Associate Professor Sen Wang, and Associate Professor Jiajun Liu.
About Data Science Seminar
This seminar series is hosted by EECS Data Science.
Venue
Zoom: https://uqz.zoom.us/j/3575613004?omn=81339171254