Data acquisition

Discovering and selecting training data from data lakes

In many supervised ML projects, the main bottleneck is the lack of sufficient labeled train data (a.k.a. data-centric ML), not which ML models to use and how to optimize these models (a.k.a. model-centric ML), especially for ML practitioners.

The process of getting more labeled data is known as data acquisition, which is categorized into two classes: human-in-the-loop and automatic data acquisition. Human-in-the-loop data acquisition includes weak supervision where users need to define rules (e.g., Snorkel, data programming), and crowd- and expert-sourcing. Automatic data acquisition uses automatic methods to obtain more train data.

References

2022

  1. Selective Data Acquisition in the Wild for Model Charging
    Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, and 1 more author
    Proc. VLDB Endow., 2022

2021

  1. Automatic Data Acquisition for Deep Learning
    Jiabin Liu, Fu Zhu, Chengliang Chai, Yuyu Luo, and 1 more author
    Proc. VLDB Endow., 2021