Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

ICML 2024

1Harvard University, 2George Mason University, 3Sapienza University of Rome

Abstract

Overview of Research: Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes.

We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used.

Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.

Attack Overview

Illustration of Attack: We consider the set of all possible outputs and within it the set of all high-quality outputs (with respect to the original prompt). For any quality-preserving watermarking scheme with a low false-positive rate, the set of watermarked outputs (green) will be a small subset of the high-quality output (orange). We then take a random walk on the set of high-quality outputs to arrive at a non-watermarked output (red) by generating candidate neighbors through the perturbation oracle and using the quality oracle to reject all low-quality candidates. The differences with original watermarked text are highlighted.

Attack Results

For illustration purposes, we only show valid outputs that passed the evaluation of the quality oracle. For final evaluations, we use OpenAI API to query the GPT-4.1


Text Generation

We adopt the KGW scheme to Llama-3-70B-Instruct. We implement our perturbation oracle using T5-XL-v1.1 which performs span replacement and adopt a reward model Deberta-v3-large trained on OpenAssistant as our quality oracle. Watermarked texts are highlighted with green spans, while perturbed texts use red spans. Through pairwise comparison on highlighted spans, we can gain insights on how watermarked texts can be altered without significantly degrading quality with respect to the prompt.

Prompt: What is fair use?

Watermarked text: Fair use is a provision in the U.S. copyright law that allows limited use of copyrighted materials without seeking permission from the copyright holder. The purpose of fair use is to promote creativity, innovation, and education by allowing certain uses of copyrighted works without infringing on the copyright holder's exclusive rights. Fair use applies to educational uses when: 1. The use is for a nonprofit educational purpose, such as teaching, research, or scholarship. 2. The use involves a limited or fair amount of the copyrighted material, such as excerpts or quotes. 3. The use does not harm the market for the copyrighted work or substitute for a purchase. Examples of educational uses that might qualify as fair use: * Quoting a few lines from a copyrighted poem to illustrate a concept in a literature class. * Screening a copyrighted film in class to facilitate analysis and discussion. * Reproducing a copyrighted image on a course webpage to support teaching. However, it's essential to evaluate each use on a case-by-case basis, considering factors such as: * The purpose of the use (non-profit educational, transformative, or commercial). * The nature of the copyrighted work (creative, factual, or educational). * The amount and substantiality of the portion used (small, reasonable, or excessive). * The impact of the use on the market for or value of the copyrighted original. If you're unsure whether your educational use of copyrighted material qualifies as fair use, consult with your institution's copyright officer or librarian for guidance. They can help you assess the use and recommend alternative solutions, such as seeking permission from copyright holders or using open educational resources. Remember that copyright laws and fair use guidelines vary across jurisdictions. If your educational institution is not in the United States, consult your local copyright laws and regulations for guidance on fair use.

Response after attack: Fair use is a provision in the U.S. copyright law that allows limited use of copyrighted materials without a specific license or permission from the copyright owner. The purpose of the fair use rule is in part to encourage scholarship and it originally provided an exemption intended to permit scholars and the general public to engage in acts that would otherwise be prohibited by the law: 1. The use is for the copyist's nonprofit educational purpose, such as teaching, research, or scholarship. 2. Educational copying uses only a fair portion of the original. That means, the portions used must reasonably be connected to the purpose of the educational use. 3. The educational copying will not harm the market for or of the original unless use of the educational copy prevents copy being required to make or purchase. Examples of educational uses that might qualify as fair use would include: Reproducing a few lines from a copyrighted poem to illustrate a concept in a literature class. Screening a copyrighted film in class to facilitate an academic discussion. Displaying copyrighted material on a course webpage to support teaching. However, the federal courts have found that each case must be considered on its own merits by looking at such factors as: the purpose of the use or its nature (educational, transformative, or instructional). Effect of the review. Modification copyrighted material used (conservative, or transformative) of the copyrighted material used (limited or excessive). Impact on the market for the original (inadvertent or substantial). If you're unsure whether your educational use of copyrighted material qualifies as fair use, consult your institution's copyright officer or librarian. They can help you determine whether fair use would support the use and recommend alternative solutions, such as seeking permission from copyright owners or using shared free, open- educational resources. Remember, copyright law and fair use also vary by country. If you're considering an educational use outside of the U.S., check the local copyright law to determine whether it qualifies as educational fair use.

After Attack: z-score: 6.91 1.61, GPT-4 judge: -0.25

Text-to-image Generation

For removing watermarks of vision-language models, we consider the Stable Signature applied to sdxl-turbo. We implement our perturbation oracle using stable-diffusion-2-base which inpaints the image masked by a random square mask of size 0.02 times the image size, and quality oracle as a reward model trained on HumanPreference Score v2. Watermarked image is on the left, and perturbed images are on the right.

Prompt: photo of sailer moon

Initial Image 1 Initial Image 2
After Attack: p-value: 7.0e-6 0.29, CLIP score: 33.04 34.14, GPT-4 judge: -0.25

Animation

Using the same setting as above, we showcase more examples of our attack on vision-language models. There are 200 valid steps in total and each frame is 5 perturbations away from the previous one.

Animation Animation Animation
Animation Animation Animation
Animation Animation Animation

Ethics and Disclaimer

We believe that investigating the possibilities of watermarking schemes at this stage can help to provide a better understanding of the inherent tradeoffs, and give policymakers realistic expectations of what watermarking can and cannot provide.

While our techniques can be used to remove watermarks from existing schemes, they are not the most efficient way to do so, with the benefit being generality rather than efficiency. Moreover, our implementation is for text generation models, while currently widely deployed watermarks are for image generation models.

While it is possible to adapt our ideas to attack deployed image generative models, we do not provide a recipe for doing so in this paper. Thus, our work isn’t likely to be used by malicious actors. Rather, we see exposing fundamental weaknesses in the watermarking paradigm as a contribution to the ongoing discussion on how to mitigate the misuse of generative models. We hope our findings will be taken into account by organizations building generative models and policymakers regulating them.

BibTeX

@inproceedings{
      zhang2024watermarks,
      title={Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models},
      author={Hanlin Zhang and Benjamin L. Edelman and Danilo Francati and Daniele Venturi and Giuseppe Ateniese and Boaz Barak},
      booktitle={Forty-first International Conference on Machine Learning},
      year={2024},
      }
1. We use a GPT-4 judge with a more fine-grained scoring than that in our paper: 1=post-attack response is much better than watermarked one, 0.5=slightly better, 0=of similar quality, -0.5=slightly worse, -1=much worse. There are two evaluations for each step and we report the average to control for position biases.