Graduate Student Seminar: Maximum Likelihood Estimation for Entity Ranking Under Iterative Synthetic Data Augmentation
We study top-$K$ ranking and inference under the Bradley--Terry--Luce (BTL) model using pairwise comparisons collected over sparse comparison graphs. Motivated by the emerging literature on \textit{model collapse} arising from recursive training on synthetic data, we analyze an iterative synthetic \textit{augmentation} workflow in which synthetic comparisons generated from fitted models are iteratively added to the original dataset. In this process, the proportion of real data may vanish as the number of iterations grows. For the resulting iterative maximum likelihood estimator (MLE), we derive its finite-sample optimal $\ell_2$ and $\ell_{\infty}$ statistical rates and establish its asymptotic normality under natural identifiability conditions. We further characterize regimes in which model collapse is avoided despite the diminishing fraction of real data. We validate our theoretical findings through large-scale numerical experiments and an application to the Arena Human Preference 140k dataset.