Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data.

These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics.

Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results.

We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd.

Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.

 

  • Building data curation processes with crowd intelligence
    These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. Our research explores a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence. Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data.
     
  • Building crowd sourced data curation processes (2019–2022)
    The capacity to effectively utilize the increasing number of datasets available to organisations for timely decision making, is diminishing due to onerous data preparation and curation tasks that have to be performed before the data can be consumed by analytics platforms. This project aims to tackle the growing problem of data curation, especially for repurposed datasets, by tapping into crowd intelligence. The project will be a first attempt at using a novel process-oriented approach in micro-task crowdsourcing, and will create new knowledge to harness the full potential of crowd sourced data curation. Significant benefit towards enhanced organizational capacity to accelerate the time-to-value from data analytics projects is expected.
     
  • Managing Data with High Redundancy and Low Value Density (2017–2020)
    Machination data, often found in sensor networks, GPS and RFID applications, vehicle on-board devices and medical monitoring devices, is the next generation of data we need to manage and process. In addition to large volumes and streaming nature, such data typically have high level of redundancy and low value density. There is a need to develop a new breed of database management systems that can support stream query processing as well as managing historical data to support complex data analytics, data mining and data-driven decision making. In this project we advocate a novel database approach to data storage, cleaning, compression, hierarchical summarisation, indexing and query processing for machination data.
  • Data Enhancement, Integration and Access Services for Smarter, Collaborative and Adaptive Whole-of Water Cycle Management (2008–2011)
    Adaptive whole-of-water-cycle approach to water resources management requires rapid, seamless integration of the many independently developed data sources and models. Current approaches involve manual data mapping and tuning by domain experts, which is extremely tedious, time-consuming, not scalable, inflexible and increasingly a bottleneck as complexity, size and scope of the data and models grows. The aim of this project is to improve the speed, rigour and adaptability of the decisions being made within South East Queensland by the partners in the Healthy Waterways Partnership by focussing on services that will improve the quality, completeness, relevance and interpretability of the data used in the models underpinning these decisions.''
     
  • Approaching the limits in Data Quality Management (2007–2009)
    Regardless of the intelligence and sophistication dedicated to functionality of new software solutions, data quality remains a major factor in the successful deployment of IT systems. Importance of managing data quality has increased manifold in today's global information sharing environments. This project will delve into the fundamental questions behind data consistency, integrity and constraints satisfaction by providing deep insights into the computational boundaries of data quality management. Consequently, innovative solutions are expected to be developed offering full scope of automatic data cleansing and unification, targeting generic data quality management requirements that span business as well as scientific data control.