Statistics and Data Science Seminar: Some new results for streaming principal component analysis
Abstract: Streaming PCA, also known as Oja's algorithm, with roots going back to 1949, has attracted much attention in Statistics and Computer Science in the last decade. In this talk, I will discuss two of our works (jointly with Robert Lunde, Rachel Ward, and Syamantak Kumar) that address uncertainty estimation and Markovian data streams, which have been relatively underexplored in the past.
First, I will talk about the problem of quantifying uncertainty for the estimation error of the leading eigenvector using Oja's algorithm for streaming PCA, where the data are generated IID from some unknown distribution. Combining classical tools from the U-statistics literature with recent results on high-dimensional central limit theorems for quadratic forms of random vectors and concentration of matrix products, we establish a distributional approximation result for the error between the population eigenvector and the output of Oja's algorithm. We also propose an online multiplier bootstrap algorithm and establish conditions under which the bootstrap distribution is close to the corresponding sampling distribution with high probability.
Our second work looks at data streams generated from a Markov chain. While streaming PCA is typically analyzed under the IID data model, in many applications like distributed optimization, data points are sampled from a Markov chain and therefore are dependent. I will show how the data dependence leads to difficulties in the theoretical analysis and present our near-optimal finite sample convergence guarantees under standard assumptions on the underlying Markov chain.
Bio: Purnamrita Sarkar is an associate professor at the Department of Statistics and Data Science at the University of Texas at Austin. Prior to joining UT Austin, she received her Ph.D. in Machine Learning from Carnegie Mellon University and was a postdoctoral scholar at the Departments of Statistics and Electrical Engineering and Computer Science at UC Berkeley. Dr. Sarkar has worked on new methodology and statistical theory for networks, inference for streaming algorithms, and general clustering problems. Her work typically blends tools from theoretical statistics, optimization, machine learning, and random matrix theory. At UT Austin, she co-leads the NSF Tripods Phase 2 Institute and is affiliated with the NSF AI Institute for Foundations of Machine Learning.
Host: Debashis Mondal