A Systems Approach to Rule-Based Data Cleaning
thesisposted on 10.05.2019 by Amr H Ebaid
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
High quality data is a vital asset for several businesses and applications. With flawed data costing billions of dollars every year, the need for data cleaning is unprecedented. Many data-cleaning approaches have been proposed in both academia and industry. However, there are no end-to-end frameworks for detecting and repairing errors with respect to a set of heterogeneous data-quality rules.
Several important challenges exist when envisioning an end-to-end data-cleaning system: (1) It should deal with heterogeneous types of data-quality rules and interleave their corresponding repairs. (2) It can be extended by various data-repair algorithms to meet users' needs for effectiveness and efficiency. (3) It must support continuous data cleaning and adapt to inevitable data changes. (4) It has to provide user-friendly interpretable explanations for the detected errors and the chosen repairs.
This dissertation presents a systems approach to rule-based data cleaning that is generalized, extensible, continuous and explaining. This proposed system distinguishes between a programming interface and a core to address the above challenges. The programming interface allows the user to specify various types of data-quality rules that uniformly define and explain what is wrong with the data, and how to fix it. Handling all the rules as black-boxes, the core encapsulates various algorithms to holistically and continuously detect errors and repair data. The proposed system offers a simple interface to define data-quality rules, summarizes the data, highlights violations and fixes, and provides relevant auditing information to explain the errors and the repairs.