Graduate Student Seminar Series Presents: Integrating literary theory and data science: Assessing GPT-generated synthetic fiction for stylistic accuracy using Random Forests

13452
A group of students attend a seminar

Graduate Student Seminar Series Presents: Integrating literary theory and data science: Assessing GPT-generated synthetic fiction for stylistic accuracy using Random Forests

Claudia Carroll, Postdoctoral Research Associate in TRIADS at Washington University in St. Louis

In this talk, I will describe a recent project of the AI Humanities Lab, in which we compare a corpus of 6,000 paragraph–length synthetic texts generated by GPT4 “in the style of” ten canonical, nineteenth-century, English language authors to authentic texts written by the same authors. Using a feature set determined by research in literary theory on authorial style, we built a Random Forest model that could distinguish between the synthetic and authentic fiction at 96% accuracy, pointing to GPT's failure to effectively incorporate style as a dimension of its writing.