Learning Open-World Embodied Agents: From Single-Task Generalization to Multi-Skill Composition

The School of EECS is hosting the following HDR Progress Review 1 Confirmation Seminar:

Learning Open-World Embodied Agents: From Single-Task Generalization to Multi-Skill Composition

Speaker: Mr. Zhizhen Zhang

Host: Prof. Archie Chapman

Abstract: Vision-Language-Action (VLA) models are promising for embodied intelligence, but current systems still struggle with two key challenges: learning robust policies from limited robot data, and scaling from single-skill experts to unified multi-skill agents. This work addresses both problems from the perspectives of pretraining and model composition. First, we study how to better use human action videos for robotic vision-language pretraining. Existing methods often rely on rigid temporal assumptions that are unsuitable for noisy egocentric videos. We propose AcTOL, which improves representation learning by modeling the ordering and continuity of actions, leading to better generalization in downstream robotic policies. Second, we study how to merge multiple single-skill VLA models into one policy. We identify the main causes of merging failure and propose MergeVLA, a merging-oriented architecture with sparse task adaptation, modular action experts, and test-time task routing. Together, this work aims to support more scalable, generalizable, and adaptable embodied agents.

Bio: Mr. Zhizhen Zhang is a PhD student in the School of Electrical Engineering and Computer Science at The University of Queensland (UQ), Australia. He received his Master’s degree from Tsinghua University, China. His research focuses on the intersection of robotics and computer vision. He is supervised by Dr. Yadan Luo and Professor Zi (Helen) Huang.

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Location:

In-person: 49-561
Zoom: https://uqz.zoom.us/j/7604082508