Purdue University Graduate School
Browse
Dissertation.pdf (2.22 MB)

A Systems Approach to Rule-Based Data Cleaning

Download (2.22 MB)
thesis
posted on 2019-05-10, 16:10 authored by Amr H EbaidAmr H Ebaid
High quality data is a vital asset for several businesses and applications. With flawed data costing billions of dollars every year, the need for data cleaning is unprecedented. Many data-cleaning approaches have been proposed in both academia and industry. However, there are no end-to-end frameworks for detecting and repairing errors with respect to a set of heterogeneous data-quality rules.

Several important challenges exist when envisioning an end-to-end data-cleaning system: (1) It should deal with heterogeneous types of data-quality rules and interleave their corresponding repairs. (2) It can be extended by various data-repair algorithms to meet users' needs for effectiveness and efficiency. (3) It must support continuous data cleaning and adapt to inevitable data changes. (4) It has to provide user-friendly interpretable explanations for the detected errors and the chosen repairs.

This dissertation presents a systems approach to rule-based data cleaning that is generalized, extensible, continuous and explaining. This proposed system distinguishes between a programming interface and a core to address the above challenges. The programming interface allows the user to specify various types of data-quality rules that uniformly define and explain what is wrong with the data, and how to fix it. Handling all the rules as black-boxes, the core encapsulates various algorithms to holistically and continuously detect errors and repair data. The proposed system offers a simple interface to define data-quality rules, summarizes the data, highlights violations and fixes, and provides relevant auditing information to explain the errors and the repairs.

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer Science

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Walid G. Aref

Advisor/Supervisor/Committee co-chair

Ahmed K. Elmagarmid

Additional Committee Member 2

Mourad Ouzzani

Additional Committee Member 3

Sunil Prabhakar

Additional Committee Member 4

Christopher W. Clifton

Additional Committee Member 5

Jennifer Neville

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC