Graduate Student Seminar: Localized Detection of Authenticity in Mixed Source Texts via Epidemic Change-point Perspective
Abstract: With the increasing popularity of large language models, concerns over content authenticity have led to the development of various watermarking schemes. These schemes can be used to detect a machine-generated text via an appropriate key, while being imperceptible to readers with no such keys. The corresponding detection mechanisms usually take the form of statistical hypothesis testing for the existence of watermarks, spurring extensive research in this direction. However, the finer-grained problem of identifying which segments of a mixed-source text are actually watermarked, is much less explored; the existing approaches either lack scalability or theoretical guarantees robust to paraphrase and post-editing. In this work, we introduce a unique perspective to such watermark segmentation problems through the lens of epidemic change point analysis. By highlighting the similarities as well as differences of these two problems, we motivate and proposed WISER: a novel, computationally efficient, watermark segmentation algorithm. Complementing various theoretical results on consistency, we also find through extensive numerical simulations that WISER outperforms state-of-the-art baseline methods, both in terms of computational speed as well as accuracy for diverse watermarking schemes and diverse large language models. It also shows how insights from a classical statistical problem can lead to a theoretically valid and computationally efficient solution of a modern and pertinent problem.