Automating Expertise in Format Identification: From Experts, Through Novices, to Machines

The School of EECS is hosting the following PhD progress review 3 seminar

Automating Expertise in Format Identification: From Experts, Through Novices, to Machines

Speaker: Shaochen Yu
Host: Dr Joel Mackenzie

Abstract: Data quality is pivotal for the efficacy of any data analysis model. Enhancing data quality via data cleaning is a demanding and time-intensive process. Data scientists devote substantial effort to address myriad data quality issues. One such crucial issue, format inconsistency, arises often from data integration. It is essential to address these inconsistencies since a unified format is a prerequisite for applying other standard data cleaning techniques across the entire dataset. However, the common solution, which involves writing regular expressions tailored to each data format, demands domain expertise and is predominantly tackled by experts. This thesis, ``Automating Expertise in Format Identification: From Experts, through Novices, to Machines,`` delves into the intricacies of expert and novice behavior patterns in data quality tasks. It harnesses the power of crowdsourcing while aiming to automate the process of identifying data format inconsistencies in large structured datasets.

We commence by unveiling a human-oriented data quality system, ``DataOps-4G'', which facilitates the exploration of five data quality issues and seamlessly labels data records with problems in large structured datasets. Designed to support non-experts in data quality discovery, ``DataOps-4G'' eliminates the need for coding. This is achieved through its ``DataOps'', a feature comprising buttons that bundle various codes for different functions, as well as its integrated regular expression generators. Our experiment, involving both experts and non-experts, demonstrated the effectiveness and efficiency of ``DataOps-4G'' compared to baseline approaches. An in-depth analysis revealed distinct behavior patterns and strategies between the two groups, potentially informing the design of future data-driven systems.

Transitioning to a human-machine hybrid model, we introduce ``Data-Scanner-4C'', a crowdsourcing system targeting format inconsistencies in large volume structured datasets. It employs both crowd workers and rule-based learning. Unlike ``DataOps-4G'', it emphasizes solely on format inconsistency. Crowd workers identify examples for the machine, which subsequently generates regular expressions. This model sidesteps the need for domain expertise and thus data experts are no longer required. With the inclusion of novice workers, ``Data-Scanner-4C'' eclipses the performance of experts using ``DataOps-4G'' and outshines existing automated regular expression generators.

Our journey culminates with ``Format Tree'', a novel algorithm that offers a automated method to discern valid data formats and generate pertinent regular expressions. While many counterparts claim automation, they often still rely on human-labeled data for effective functioning. The distinctiveness of ``Format Tree'' lies in its genuine automation capabilities; unlike its peers, it can discern between positive and negative data autonomously, without human intervention. Despite being an unsupervised learning algorithm, ``Format Tree'' trumps baselines in both effectiveness and efficiency. It doesn't aim to marginalize human involvement but rather to repurpose their roles, enabling them to focus on result refinement and other high-level tasks.

In summation, our progression from ``DataOps-4G'' to ``Format Tree'' encapsulates the evolving automation in format identification expertise. We initiated with tools that empowered both ends of the expertise spectrum and then synergized the prowess of crowdsourcing with machine aid in ``Data-Scanner-4C''. Our development of ``Format Tree'' represents a notable advancement towards achieving higher automation in format identification. Rather than sidelining humans, we have redefined their roles, emphasizing the harmonious hybrid between human and machine intelligence. Our research in format analysis establish a resilient foundation for forthcoming data analysis ventures. As the role of data amplifies in decision-making, our methodologies promise efficient and streamlined solutions in format identification. We anticipate that these contributions will not merely enhance data cleaning but also catalyze innovative approaches in the broader realm of data analytics.

Bio: Mr Shaochen Yu is a current PhD student in computer science at the School of EECS, UQ. His research interests include data quality, format identification, and crowdsourcing.

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Room:

78-632 (General Purpose South)