Data acquisition
Discovering and selecting training data from data lakes
In many supervised ML projects, the main bottleneck is the lack of sufficient labeled train data (a.k.a. data-centric ML), not which ML models to use and how to optimize these models (a.k.a. model-centric ML), especially for ML practitioners.
The process of getting more labeled data is known as data acquisition, which is categorized into two classes: human-in-the-loop and automatic data acquisition. Human-in-the-loop data acquisition includes weak supervision where users need to define rules (e.g., Snorkel, data programming), and crowd- and expert-sourcing. Automatic data acquisition uses automatic methods to obtain more train data.
References
2022
- Selective Data Acquisition in the Wild for Model ChargingProc. VLDB Endow., 2022
2021
- Automatic Data Acquisition for Deep LearningProc. VLDB Endow., 2021