Harnessing Synthetic Data from Generative AI for Statistical Inference

13485
""

Harnessing Synthetic Data from Generative AI for Statistical Inference

Xihong Lin, Professor of Biostatistics and Chair and Professor of Statistics at Harvard University

Integration of statistics and generative AI plays a pivotal role for accelerating trustworthy cross-domain scientific discovery. Recent advances in generative models have dramatically increased the availability and use of synthetic data across scientific domains. While these developments create exciting opportunities for empowering data analysis, they also raise fundamental statistical challenges regarding how synthetic data can be used in a valid, reliable, and principled manner. In this talk, we first discuss the current landscape of synthetic data generation using generative AI models such as transformer- and diffusion- based models. More importantly, we present a principled framework for incorporating synthetic data in downstream statistical analysis that ensures valid statistical inference even when generative AI models are misspecified. We show that the proposed synthetic data assisted methods integrating observed and synthetic data are robust to misspecified black-box generative models and can improve statistical inferential power when the generative AI models are informative. We demonstrate the utility of these synthetic data assisted methods to the analysis of the UK biobank data, by performing genome-wide association studies (GWAS) of proteomic data and whole-genome sequencing (WGS) analyses of brain imaging phenotypes, both characterized by substantial missingness (about 90%).

Host: Ran Chen

A reception will follow the talk in the Weidenbaum Suite, located in Seigle Hall 170.