Data Prep

Theories, algorithms, and systems

Modern organizations are faced with a massive number of heterogeneous data sets. It’s not uncommon for a large enterprise to report having 10,000 or more structured databases, not to mention millions of spreadsheets, text documents, and emails. Typically these databases are not organized according to a common schema or representation. As a result, data scientists in large organizations spend 90% or more of the time just trying to find the data they need and transform it into a common representation that allows them to perform the desired analysis.

Data prep includes a number of key problems, including:

  • Data discovery. Given some input request, this component crawls an organization’s data and returns those objects relevant to the request, employing a new graph-based approach for discovery and efficient data set indexing techniques.

    • Data stitching. Putting relevant data together for user consumption (i.e., data stitching). This requires investigating several issues on how graph-based and query-driving data stitching can be accomplished.

    • Data cleaning. We are investigating new data cleaning approaches along several directions: composition including an interactive dashboard and record expansion for outlier detection.

    • Data transformations. Data often needs to be transformed in order to use a uniform representation. We have developed a new program-synthesis-based transformation engine.

    • Entity consolidation. Our efforts in this area have focused on scaling entity resolution to very large data sets and using program synthesis to discover entity resolution rules.

    • Human-in-the-loop processing. We are working on new techniques to use human effort more effectively throughout the data integration and cleaning process, prioritizing attention on that part of the pipeline where human time can be most effective.

References

2023

  1. HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
    Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, and 3 more authors
    Proc. ACM Manag. Data, 2023

2022

  1. Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks
    Jinfeng Peng, Derong Shen, Nan Tang, Tieying Liu, and 4 more authors
    Proc. VLDB Endow., 2022

2021

  1. Mis-categorized entities detection
    Shuang Hao, Nan Tang, Guoliang Li, Jianhua Feng, and 1 more author
    VLDB J., 2021

2020

  1. Pattern Functional Dependencies for Data Cleaning
    Abdulhakim Ali Qahtan, Nan Tang, Mourad Ouzzani, Yang Cao, and 1 more author
    Proc. VLDB Endow., 2020
  2. Data Curation with Deep Learning
    Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan
    In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, 2020
  3. CoClean: Collaborative Data Cleaning
    Mashaal Musleh, Mourad Ouzzani, Nan Tang, and AnHai Doan
    In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, 2020

2018

  1. FAHES: A Robust Disguised Missing Values Detector
    Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and 1 more author
    In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, 2018