When Does Synthetic Data Help Imbalanced Learning?
Synthetic data are increasingly used to address data scarcity and imbalance in modern learning problems, especially when important outcomes are rare. In classification tasks with severe class imbalance, a common strategy is to generate synthetic minority samples and use them to augment the training set. Although effective in practice, the statistical benefits of this approach remain poorly understood: synthetic data augmentation may reduce the effects of imbalance, but it may also introduce bias if the generated samples do not accurately reflect the target distribution.
In this talk, I will present a statistical answer to the question in the title. Specifically, I will describe a unified theoretical framework that clarifies when synthetic data improve prediction, when they may instead be harmful, and how the optimal amount of augmentation depends on generator quality and the underlying learning task. Our results show that the impact of synthetic data is determined by the interaction among class imbalance, synthetic bias, and the underlying prediction task. These insights also yield a practical, data-driven method for choosing the amount of synthetic augmentation. If time permits, I will also briefly discuss how to directly correct synthetic bias, how synthetic data can improve fairness and threshold-independent metrics such as AUROC and AUPRC, and applications to echocardiogram diagnosis.
Anru Zhang is a primary faculty member jointly appointed in the Department of Biostatistics & Bioinformatics and the Department of Computer Science at Duke University. He is also the Eugene Anson Stead, Jr. M.D. Associate Professor and serves as Associate Chair for Research in the Department of Biostatistics & Bioinformatics. His current research interests include generative models, biomedical data science, tensor learning, and high-dimensional statistics. He received a Leo Breiman Junior Award, COPSS Emerging Leader Award, IMS Tweedie Award, ASA Gottfried E. Noether Junior Award, AMIA Data Science Outstanding Paper Award, and an NSF CAREER Award. Two of his PhD students have received the IMS Lawrence D. Brown Award. He currently serves as an associate editor for the Annals of Statistics, the Journal of the American Statistical Association (Theory & Methods and Applications & Case Studies), Statistica Sinica, ASA Discoveries, and Statistics and Its Interface. His research is currently supported by two NIH R01 grants (one as sole PI and one as MPI with clinical investigators).
Host: Ran Chen