Investigating the role of metadata in evaluating the data quality of repurposed data
The School of EECS is hosting the following PhD Progress Review 3 Seminar:
Investigating the role of metadata in evaluating the data quality of repurposed data
Speaker: Hui Zhou
Host: Professor Shazia Sadiq
Abstract: Existing approaches for evaluating data quality were established for settings where user requirements regarding data use could be explicitly gathered. Currently, however, users are often faced with new, unfamiliar, and repurposed datasets where they have not been involved in the data collection and creation processes. Furthermore, there is evidence that despite various standardisation initiatives, supporting information or metadata for such datasets is provided in a variety of ways or even lacking altogether. Yet, users need to evaluate the quality of such data to determine if it is suitable for their intended purposes. The widespread adoption of repurposed datasets is also increasing the risks for users who must evaluate data quality without traditional context or reliable metadata. Consider the case of a data analyst tasked with assessing customer churn risk using repurposed marketing datasets. Without clear metadata on how 'churn' is defined, or knowledge of potential data quality issues from previous processing, the analyst may resort to assumptions, leading to misguided insights and wasted resources.
In this regard, there is limited understanding of the role of metadata in evaluating the quality of repurposed datasets. Thus, this PhD research investigates the critical yet under-explored role of metadata in evaluating the quality of repurposed data. It aims to uncover current practices, challenges, and the most effective ways to leverage metadata in these scenarios. Through a multifaceted approach combining interviews, lab experiments with eye-tracking technology, and cued-retrospective think-aloud protocols, this study examines the complex interplay between data quality errors and metadata usage. The focus will be on identifying specific patterns of metadata usage that correlate with successful repurposed data evaluation, as well as highlighting common challenges users face.
The results of our study shed light on the critical role metadata plays in evaluating repurposed data, highlight the existence of relationships between data quality error type and metadata, and identify a number of metadata usage patterns relative to the task. This bears implications for the design of systems or tools related to data quality discovery and evaluation.
This thesis also delivers theoretical insights into the relationship between data quality and metadata within the context of repurposing, as well as practical guidelines for metadata creation and utilization to maximize the value of previously collected datasets in new applications. By optimizing metadata practices, this research contributes to more reliable data-driven decisions and greater efficiency in knowledge discovery processes.
Biography: Mr Hui Zhou is a PhD candidate from the School of EECS under the supervision of Prof. Shazia Sadiq, Prof. Marta Indulska and A/P Gianluca Demartini. He received his B.S. degree from UQ. His research interests are data quality management, metadata, and data repurposing.
About Data Science Seminar
This seminar series is hosted by EECS Data Science.