Training Robust Deep Reinforcement Learning Policies in the Real World

The School of EECS is hosting the following PhD candidature milestone 3 seminar

Training Robust Deep Reinforcement Learning Policies in the Real World

Speaker: Peter Bohm
Host: Dr Archie Chapman

Abstract: Artificial intelligence and machine learning have significantly advanced many fields, notably through deep reinforcement learning (DRL). This technique, which merges deep learning and reinforcement learning, has excelled in complex games such as Go, Dota 2, and StarCraft II. DRL has also demonstrated impressive results in dexterous manipulation tasks, bipedal and quadrupedal locomotion, enabling robots to walk and run in diverse environments. However, DRL research often employs expensive robotic platforms and high-fidelity simulations. Yet the difference between simulated and real-world environments, known as the sim-to-real gap, poses a challenge. This gap can be addressed by using domain randomization, though excessive use may result in low sample efficiency and extended training periods. Another approach involves direct hardware training, but this can lead to increased wear and tear, safety issues, and slower data collection. Resolving these challenges and transitioning from expensive specialty hardware to affordable, commodity robots that can be efficiently trained and deployed at scale is critical for the advancement of DRL and its broader adoption in practical robotic systems. This thesis aims to bridge the sim-to-real gap by making contributions to the following four areas:

i) A unique non-blocking, asynchronous DRL training architecture tailored for non-linear, real-time dynamic systems, adept at managing variable sensing, communication and actuation delays. This method diverges from traditional DRL training by independently decoupling the RL loop and collection of transition tuples, allowing for asynchronous action and observation streaming. It mitigates delay effects and improves sample efficiency by providing delay-length measurements to the training loop and frequently retraining the DRL network. This enables the adjustment of action step time to discover an optimal control frequency for a specific system while efficiently managing streamed observations arriving with random delays independent of action timing.

ii) Gated feature extraction enhances DRL training for real-world robots. An untrained gated recurrent unit (GRU) is used to encode a condensed representation of the state observation sequence prior to DRL training. This method allows for dimensionality reduction and eliminates same-length input requirements by encoding the observations cumulatively. The RL network is then trained on the current step’s raw observations, combined with the GRU-encoding of previous steps.

iii) Quantization of DRL models and onboard inference run on MCU-class devices. This approach demonstrates that using an untrained GRU minimizes performance loss due to post-training quantization. Some of the quantized models even achieve higher test rewards than their full-precision counterparts. These quantized models also show greater resilience to changes in the observation space, such as the removal of certain observations.

iv) The design of low-cost robotic platforms specifically engineered for real-world experimentation. Whereas traditional robots use high-frequency control and precise measurements with robustness in a limited action space, robots designed for DRL experiments demand resilience over a broader range of potential actions to enable effective exploration during training.

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Zoom: https://uqz.zoom.us/j/81769658613

Room:

46-402