Adaptive Reinforcement Learning for Sequential Decision-making

The School of EECS is hosting the following HDR Progress Review 3 Seminar:

Adaptive Reinforcement Learning for Sequential Decision-making

Speaker: Yi Zhang

Host: A/Prof Sen Wang

Abstract: Reinforcement learning (RL), originating from control theory, has emerged as a powerful technology for sequential decision-making, excelling in scenarios where cumulative reward maximization is essential. Its versatility has led to specialized adaptations for diverse applications. For instance, multi-agent reinforcement learning (MARL) has been developed to enable parallel learning and decentralized deployment in multi-agent systems. Furthermore, offline reinforcement learning is investigated in some scenarios where online exploration is extremely expensive, such as recommender systems, to learn policies purely from offline interaction data. While existing RL methodologies have achieved promising results, critical aspects of their adaptation to realistic settings remain underexplored, including the efficiency of MARL deployment and the influence of world models in offline RL.

This thesis addresses key dilemmas in integrating RL to realistic scenarios by tackling four pivotal challenges: (1) enhancing the generalization of conventional MARL training paradigms, (2) mitigating the impact of world model inaccuracies on policy learning, (3) dynamically adapting regularization in policy learning objective functions, and (4) accelerating and ensuring fairness in RL training. To address the first challenge, I propose a cost-efficient federated MARL framework that decouples centralized training from decentralized execution. This framework leverages asynchronous critics on each client to estimate utility and a server-side aggregator for weighted merging of local updates based on global objectives, thereby overcoming non-i.i.d. data challenges, i.e., heterogeneous agent observation distributions, and significantly reducing gradient exchange. For the second challenge, I introduce a simple yet effective reward shaping method that refines world models in model-based offline RL. This method utilizes nearest-neighbor inference and a clustering-driven uncertainty penalty, obviating the need for costly model ensembles. Building upon this, to address the third challenge, I develop a dynamically adapted uncertainty regularization. This method iteratively enhances the robust estimation of decision risk, leading to continuous reward refinement and further mitigating inaccuracies within the world models of model-based offline RL. Finally, leveraging the generalization capacity of large language models (LLMs), I propose an adversarial inverse RL method to accelerate policy learning by utilizing LLM policy trajectories as expert demonstrations. Concurrently, through careful design of reward weighting during objective function optimization, the method enables effective policy learning from sub-optimal expert demonstrations. In summary, this thesis contributes to the practical adaptation of RL algorithms by proposing cost-efficient, effective, and dynamic methodologies. I conclude by discussing remaining challenges and outlining future directions for adapting RL to realistic applications, emphasizing insights into existing RL designs and the dilemmas they present.

Bio: Yi Zhang is a PhD candidate in the School of Electrical Engineering and Computer Science at the University of Queensland. He earned his Bachelor degree of Data Science and Big Data Technology from Peking University. His research focuses on reinforcement learning and large language models for recommender systems, under the supervision of Associate Professor Sen Wang, Associate Professor Jiajun Liu, and Dr Ruihong Qiu.

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Zoom: https://uqz.zoom.us/j/81181269569
Room: 78 - 632