Recent Advances in Topic Modeling 

13322

Recent Advances in Topic Modeling 

Tracy Ke, Associate Professor of Statistics at Harvard University

Topic modeling is a widely used technique in text analysis, with classical models relying on an approximate low-rank factorization of the word count matrix. In the first part of this talk, we introduce Topic-SCORE, a spectral algorithm for estimating classical topic models. It is computationally faster than other popular algorithms for topic modeling and enjoys a theoretically optimal rate. 

In the second part, we extend the classical topic model to capture the distribution of word embeddings from pre-trained large language models (LLMs), enabling the incorporation of word context. We propose a flexible algorithm that integrates traditional topic modeling with nonparametric estimation. We showcase the effectiveness of our methods using MADStat, a dataset comprising 83,000 paper abstracts from statistics-related journals.

Host: Robert Lunde