The School of EECS is hosting the following PhD Progress Review 3 Thesis Seminar:

Information-Preserving Efficient Vision Transformers

Speaker: Xuwei Xu
Host: Prof Xue Li

Abstract: In recent years, Vision Transformers (ViTs) have achieved remarkable success across various computer vision tasks. The strong feature representation capability of ViTs results from their two core components: the Multi-Head Self-Attention (MHSA) module and the Feed-Forward Network (FFN). However, these components are computationally expensive and memory-consuming. The high computation demand hinders the deployment of ViTs in many real-world scenarios where low latency is required yet computing resources are constrained, such as edge devices. As a result, balancing model efficiency and performance has become a critical research area in ViT. Some studies have observed redundancy in ViTs and attempted to reduce it to expedite the model during the testing phase, which typically involves pruning image tokens spatially or decreasing features channel-wise. Despite accelerating ViTs, these methods often suffer from increasingly severe performance degradation as the degree of pruning increases.

In this thesis, we focus on expediting ViTs while maintaining high performance. Specifically, we attribute the performance drop in existing methods to the information loss from token pruning or channel reduction, and subsequently put forward solutions to retain information while reducing the model size. The contributions of this thesis are: (i) to mitigate the unexpected information loss, a simple but effective token ilding strategy is proposed to keep some tokens aside rather than recklessly pruning them in the network, thereby reducing the number of tokens participating in the computation while preserving all the information; (ii) to facilitate information compression and reservation, a graph-based token propagation approach is designed to distribute the information of removed tokens to their neighbouring tokens that are remaining in the ViT network, eventually constructing a condensed network; (iii) to enrich the information, a channel idle module is embedded to lightweight ViTs where the feature channels are increased without a corresponding rise in computational cost; (iv) to further advance the model compression for ViTs, a generic structural reparameterization technique is brought up on FFN, which reparameterizes the linear projection weights and shortcuts to significantly boost speed and shrink the memory consumption with no performance drop during inference. To conclude, we discuss open questions and future research directions for efficient ViT methods, aiming to promote the comprehensive application of efficient ViTs in real-world scenarios.

Speaker bioXuwei Xu is a PhD candidate at the School of Electrical Engineering and Computer Science (EECS), The University of Queensland, under the supervision of AsPr Sen Wang and Dr Jiajun Liu. He received dual bachelor's degree from the Australian National University and Shandong University. His research interests include Vision Transformer and efficient network designs.

 

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Room 78-217 or zoom https://uqz.zoom.us/j/3378112704