Coreset selection

Selecting a subset of train data

Data-effective machine learning (ML) (a.k.a. data-centric AI) aims at obtaining high-quality training data to release the value of AI, because it is well-known that dirty data may severely degrade the performance of ML models.

Data-efficient ML focuses on making the training process more efficient. A commonly used strategy is to select a core subset of training data (or coreset) to represent the entire dataset such that ML models trained on the coreset can achieve similar performance to the ML models trained on the entire dataset.

Apparently, users desire both data-effective ML (for training better ML models) and data-efficient ML (for saving training cost).

References

2023

  1. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
    Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, and 4 more authors
    Proc. ACM Manag. Data, 2023
  2. Efficient Coreset Selection with Cluster-based Methods
    Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, and 3 more authors
    In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, 2023

2022

  1. Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning
    Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and 1 more author
    Proc. VLDB Endow., 2022