Graduate Student Seminar: Maximum Likelihood Estimation for Entity Ranking Under Iterative Synthetic Data Augmentation

13502
A group of students attend a seminar

Graduate Student Seminar: Maximum Likelihood Estimation for Entity Ranking Under Iterative Synthetic Data Augmentation

Yiqiao Jin, PhD Student in Statistics & Data Science at Washington University in St. Louis

We study top-$K$ ranking and inference under the Bradley--Terry--Luce (BTL) model using pairwise comparisons collected over sparse comparison graphs. Motivated by the emerging literature on \textit{model collapse} arising from recursive training on synthetic data, we analyze an iterative synthetic \textit{augmentation} workflow in which synthetic comparisons generated from fitted models are iteratively added to the original dataset. In this process, the proportion of real data may vanish as the number of iterations grows. For the resulting iterative maximum likelihood estimator (MLE), we derive its finite-sample optimal $\ell_2$ and $\ell_{\infty}$ statistical rates and establish its asymptotic normality under natural identifiability conditions. We further characterize regimes in which model collapse is avoided despite the diminishing fraction of real data. We validate our theoretical findings through large-scale numerical experiments and an application to the Arena Human Preference 140k dataset.