Data Civilizer

A Tool to Find, Ingest, Clean, and Integrate Diverse Data Sets

Modern organizations are faced with a massive number of heterogeneous data sets. It’s not uncommon for a large enterprise to report having 10,000 or more structured databases, not to mention millions of spreadsheets, text documents, and emails. Typically these databases are not organized according to a common schema or representation. As a result, data scientists in large organizations spend 90% or more of the time just trying to find the data they need and transform it into a common representation that allows them to perform the desired analysis.

Data Civilizer includes a number of key components designed to simplify this process, including:

  • Data discovery. Given some input request, this component crawls an organization’s data and returns those objects relevant to the request, employing a new graph-based approach for discovery and efficient data set indexing techniques.

    • Data stitching. Putting relevant data together for user consumption (i.e., data stitching). This requires investigating several issues on how graph-based and query-driving data stitching can be accomplished.

    • Data cleaning. We are investigating new data cleaning approaches along several directions: composition including an interactive dashboard and record expansion for outlier detection.

    • Data transformations. Data often needs to be transformed in order to use a uniform representation. We have developed a new program-synthesis-based transformation engine.

    • Entity consolidation. Our efforts in this area have focused on scaling entity resolution to very large data sets and using program synthesis to discover entity resolution rules.

    • Human-in-the-loop processing. We are working on new techniques to use human effort more effectively throughout the data integration and cleaning process, prioritizing attention on that part of the pipeline where human time can be most effective.

References

2020

  1. Dagger: A Data (not code) Debugger
    El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, and 4 more authors
    In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings, 2020

2019

  1. Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics
    El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, and 5 more authors
    Proc. VLDB Endow., 2019
  2. Unsupervised String Transformation Learning for Entity Consolidation
    Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, and 6 more authors
    In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, 2019

2018

  1. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
    Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, and 5 more authors
    In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, 2018
  2. Building Data Civilizer Pipelines with an Advanced Workflow Engine
    Essam Mansour, Dong Deng, Raul Castro Fernandez, Abdulhakim Ali Qahtan, and 8 more authors
    In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, 2018

2017

  1. The Data Civilizer System
    Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, and 6 more authors
    In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings, 2017