Exploring the theoretical and practical aspects of watermarking techniques for detecting AI-generated content, including trade-offs, failure modes, and information-theoretic limits
The rise of Large Language Models (LLMs) and Multimodal models has changed the nature of digital content, often making the boundary between human and machine authorship increasingly porous. While this capability offers utility, it also introduces risks, such as the spread of misinformation, academic dishonesty, and a general erosion of trust in digital communication
The central question is whether the text generated by an LLM (or in case of Multimodal models any content in the form of text, image or audio) can be reliably distinguished from that written by a human. Several papers have come up with watermarking techniques, notably among them appearing in ICLR presentations and conference
In the sections that follow, we aim to give the readers a robust theoretical as well as practical understanding of watermarking while emphasizing the trade-offs and failure modes of these techniques. We also aim to tie in information-theoretic limits—how much signal one can embed without degrading text—as well as to what extent detection error bounds are acceptable
All of the detection can be roughly bifurcated into two distinct methods; Detection Methods which operate post-hoc on finished text, and watermarking which represents a more proactive approach to establishing provenance. Watermarking is a step which aims to embed an imperceptible statistical signal into text during the generation process, thus creating a verifiable link between an output and its source. This signal is not a secret message itself more a detectable pattern that identifies the text as machine generated.
The AI Generated writing has two notable characteristics that makes it seem a little too-perfect (and a little less human). These two characteristics are Perplexity and Burstiness.
Perplexity - In Language modeling, perplexity quantifies a model’s uncertainty or “confusion” when predicting the next token in the sequence. Mathematically, it is the exponential of the average negative log-likelihood per token.
\[\text{Perplexity}(x_{1:T}) = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log p(x_t \mid x_{<t})\right)\]A lower perplexity scores indicates that the model is more confident in its predictions, as it is effectively choosing from a smaller set of likely next words.
The goal of LLM training is to minimize perplexity on a corpus of human text. Thus they tend to sample high-probability tokens and their generated text often has a lower perplexity score when evaluated by a language model than typical human-written text.
Burstiness - While Perplexity measures the average predictability of a text, burstiness measures its variance. it is defined as the change in perplexity over the course of a document. Mathematically,
\[B = \frac{\lambda - k}{\lambda + k}\]where \(B\) = Burstiness, \(\lambda\) = Mean inter-arrival time between bursts, \(k\) = Mean burst length
Human writing is often characterized by a “bursts” of high perplexity, where a writer uses a creative metaphor, a rare word, or an unconventional sentence structure. In contrast, LLM-generated text tends to maintain a more uniform level of perplexity, resulting in low burstiness.
Beyond such raw statistics, it can also be observed that machine text exhibits noticeable stylistic patterns. Studies like