A Systems Approach to Rule-Based Data Cleaning

2019-05-10T16:10:06Z (GMT) by Amr H Ebaid
<div>High quality data is a vital asset for several businesses and applications. With flawed data costing billions of dollars every year, the need for data cleaning is unprecedented. Many data-cleaning approaches have been proposed in both academia and industry. However, there are no end-to-end frameworks for detecting and repairing errors with respect to a set of <i>heterogeneous</i> data-quality rules.</div><div><br></div><div>Several important challenges exist when envisioning an end-to-end data-cleaning system: (1) It should deal with heterogeneous types of data-quality rules and interleave their corresponding repairs. (2) It can be extended by various data-repair algorithms to meet users' needs for effectiveness and efficiency. (3) It must support continuous data cleaning and adapt to inevitable data changes. (4) It has to provide user-friendly interpretable explanations for the detected errors and the chosen repairs.</div><div><br></div><div>This dissertation presents a systems approach to rule-based data cleaning that is <b>generalized</b>, <b>extensible</b>, <b>continuous </b>and <b>explaining</b>. This proposed system distinguishes between a <i>programming interface</i> and a <i>core </i>to address the above challenges. The programming interface allows the user to specify various types of data-quality rules that uniformly define and explain what is wrong with the data, and how to fix it. Handling all the rules as black-boxes, the core encapsulates various algorithms to holistically and continuously detect errors and repair data. The proposed system offers a simple interface to define data-quality rules, summarizes the data, highlights violations and fixes, and provides relevant auditing information to explain the errors and the repairs.</div>