Coreset selection
Selecting a subset of train data
Data-effective machine learning (ML) (a.k.a. data-centric AI) aims at obtaining high-quality training data to release the value of AI, because it is well-known that dirty data may severely degrade the performance of ML models.
Data-efficient ML focuses on making the training process more efficient. A commonly used strategy is to select a core subset of training data (or coreset) to represent the entire dataset such that ML models trained on the coreset can achieve similar performance to the ML models trained on the entire dataset.
Apparently, users desire both data-effective ML (for training better ML models) and data-efficient ML (for saving training cost).
References
2023
- GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete DataProc. ACM Manag. Data, 2023
- Efficient Coreset Selection with Cluster-based MethodsIn Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, 2023
2022
- Coresets over Multiple Tables for Feature-rich and Data-efficient Machine LearningProc. VLDB Endow., 2022