Vedul Palavajjhala Headshot

Undergraduate Honors Thesis Presentation: A Statistical Reading of 16th- and 17th-Century English Printed Sermons

Vedul Palavajjhala, Undergraduate Student at Washington University in St. Louis

In 1993, K. R. Clarke developed 'analysis of similarities' (ANOSIM), a distance-based analog of the one-way ANOVA. ANOSIM has excelled in the ecological world for its minimal statistical assumptions, yet it is rarely applied outside of the life sciences. In this study, I apply ANOSIM to Early Modern English sermons, a subset of EarlyPrint's collection, which encompasses around 124 million words from the broader EP corpus of 1.65 billion words. EarlyPrint is a collaborative effort between Digital Humanities researchers at Northwestern University and WashU to create a comprehensive digital corpus of annotated English printed texts from 1473 to the early 1700s. Printed sermons provide a unique look into both the weekly routines of churchgoers, shaped by the political, religious, and economic landscape of Early Modern English. Minimizing statistical assumptions is essential in the field of text analysis, where we have little or no knowledge of the exact probability distribution from which we are sampling our texts. Furthermore. ANOSIM's use of a ranked distance matrix rather than the distances themselves prevents the influence of outliers in our data. Our experiments with ANOSIM cover similarity comparisons between different time periods in English history (pre-1609, 1610-1659, 1660-), collection strategies, particularly focusing on distinguishing between texts collected by the prolific book collector George Thomason and texts collected by others, and the content of sermons preached at assizes against other sermons in the corpus. First, I will establish the sources and limitations of the EarlyPrint corpus and explore a couple of document embedding approaches, including term frequency-inverse document frequency (tf-idf) and Doc2Vec to empirically justify ANOSIM's robustness. I will enrich this analysis with uniform manifold approximation and projection (UMAP) for visualization and topic modeling using non-negative matrix factorization (NNMF) to add interpretability to ANOSIM’s results.

Advisor: Debashis Mondal