TL;DR: This post examines when backdoor-based watermarks, adversarial defenses, or transferable attacks are possible in machine learning. We demonstrate that for all learning tasks, at least one of these schemes exists: a watermark, an adversarial defense, or a transferable attack—the latter tied to cryptography and able to bypass defenses while mimicking legitimate data. For a deeper dive, see our recent paper on arXiv.

For those of you that prefer audio.

(generated by NotebookLM)

Introduction

A nonprofit organization plans to open-source a classifier $f$ but wants to detect its use by embedding a watermark directly into the model. Alice is tasked with creating this watermark. Bob aims to make $f$ adversarially robust, i.e., to ensure that it is hard to find queries that appear unsuspicious but cause $f$ to make mistakes. Both face challenges: Alice struggles to create a watermark that cannot be removed, and Bob’s defenses become increasingly complex. They discover their projects are connected. Alice’s idea was to plant a backdoor [1,2] in $f$, enabling her to craft queries with a hidden trigger that activates the backdoor, causing $f$ to misclassify, thus detecting the usage of $f$. Bob’s approach involved smoothing $f$ to enhance robustness, which inadvertently removes such backdoors [2]. They realized that their challenges are two sides of the same coin: the impossibility of one task might guarantee the success of the other.

Definitions

In our protocols, Alice (A, verifier) and Bob (B, prover) engage in interactive communication, with distinct roles depending on the specific task. Each protocol is defined with respect to a learning task $\mathcal{L} = (\mathcal{D}, h)$, an error parameter $\varepsilon \in \left(0, \frac{1}{2}\right)$, and time bounds $T_{A}$ and $T_{B}$. A scheme is successful if the corresponding algorithm satisfies the desired properties with high probability, and we denote the set of such algorithms by $\text{Scheme}(\mathcal{L}, \varepsilon, T_{A}, T_{B})$, where Scheme refers to Watermark, Defense, or TransfAttack.

Definition 1 (Watermark)

An algorithm $A_{\text{Watermark}}$, running in time $T_{A}$, implements a watermarking scheme for the learning task $\mathcal{L}$, with error parameter $\varepsilon>0$, if an interactive protocol in which $A_{\text{Watermark}}$ computes a classifier $f \colon \mathcal{X} \rightarrow \mathcal{Y}$ and a sequence of queries $\mathbf{x} \in \mathcal{X}^q$, and a prover B outputs $\mathbf{y} = B(f, \mathbf{x}) \in \mathcal{Y}^q$ satisfies the following properties:

Correctness: $f$ has low error, i.e., $\text{err}(f) \leq \varepsilon$.
Uniqueness: There exists a prover B, running in time bounded by $T_{A}$, which provides low-error answers, such that $\text{err}(\mathbf{x},\mathbf{y}) \leq 2 \varepsilon$.

Diagram illustrating the interaction between Alice and Bob in a watermarking protocol. Alice sends queries to Bob, who returns responses, visualizing the flow and role of verification in watermarking.

Figure 1: Schematic overview of the interaction between Alice and Bob in Watermark.

Unremovability: For every prover B running in time $T_{B}$, it holds that $\text{err}(\mathbf{x},\mathbf{y}) > 2\varepsilon$.
Undetectability: For every prover B running in time $T_{B}$, the advantage of B in distinguishing the queries $\mathbf{x}$ generated by $A_{\text{Watermark}}$ from random queries sampled from $\mathcal{D}^q$ is small.

Note. Due to uniqueness, we require that any defender who did not use $f$ and trained a model $f_{\text{Scratch}}$ must be accepted as a distinct model. This requirement mirrors real-world scenarios where independent models could have been trained within the given time constraint $T_{A}$. Additionally, the property enforces that any successful Watermark must satisfy the condition that Bob’s time is strictly less than $T_{A}$, i.e., $T_{B} < T_{A}$.

Definition 2 (Adversarial Defense)

An algorithm $B_{\text{Defense}}$, running in time $T_{B}$, implements an adversarial defense for the learning task $\mathcal{L}$ with error parameter $\varepsilon > 0$, if an interactive protocol in which $B_{\text{Defense}}$ computes a classifier $f \colon \mathcal{X} \rightarrow \mathcal{Y}$, a verifier A replies with $\mathbf{x} = A(f)$, where $\mathbf{x} \in \mathcal{X}^q$, and $B_{\text{Defense}}$ outputs $b = B_{\text{Defense}}(f, \mathbf{x}) \in \{0,1\}$, satisfies the following properties:

Correctness: $f$ has low error, i.e., $\text{err}(f) \leq \varepsilon$.
Completeness: When $\mathbf{x} \sim \mathcal{D}^q$, then $b = 0$.
Soundness: For every A running in time $T_{A}$, we have $\text{err}(\mathbf{x},f(\mathbf{x})) \leq 7\varepsilon \ \text{ or } \ b = 1$.

Diagram showing the interaction between Alice and Bob in an adversarial defense protocol. Alice sends queries to Bob, and Bob returns responses that determine if the classifier is under adversarial influence.

Figure 2: Schematic overview of the interaction between Alice and Bob in Adversarial Defense.

The key requirement for a successful defense is the ability to detect when it is being tested. To bypass the defense, an attacker must provide samples that are both adversarial, causing the classifier to make mistakes, and indistinguishable from samples drawn from the data distribution $\mathcal{D}$.

Definition 3 (Transferable Attack)

An algorithm $A_{\text{TransfAttack}}$, running in time $T_A$, implements a transferable attack for the learning task $\mathcal{L}$ with error parameter $\varepsilon > 0$, if an interactive protocol in which $A_{\text{TransfAttack}}$ computes $\mathbf{x} \in \mathcal{X}^q$ and $B$ outputs $\mathbf{y} = B(\mathbf{x}) \in \mathcal{Y}^q$ satisfies the following properties:

Transferability: For every prover $B$ running in time $T_A$, we have $\text{err}(\mathbf{x},\mathbf{y}) > 2\varepsilon$.
Undetectability: For every prover $B$ running in time $T_B$, the advantage of $B$ in distinguishing the queries $\mathbf{x}$ generated by $A_{\text{TransfAttack}}$ from random queries sampled from $\mathcal{D}^q$ is small.

Diagram depicting the interaction in a transferable attack protocol where Alice sends crafted queries to Bob, who replies with outputs that test the transferability of adversarial vulnerabilities.

Figure 3: Schematic overview of the interaction between Alice and Bob in Transferable Attack. Note that Bob does not receive $f$ here, but replies directly to the queries $\mathbf{x}$ from Alice.

Main Result

Theorem 1. (Main Theorem, informal) For every learning task $\mathcal{L}$ and $\varepsilon \in \big(0, \frac{1}{2}\big)$, $T \in \mathbb{N}$, where a learner exists that runs in time $T$ and, with high probability, learns $f$ satisfying $\text{err}(f) \leq \varepsilon$, at least one of these three exists:

$$ \begin{align*} \text{Watermark} &\left( \mathcal{L}, \varepsilon, T, T^{1/\sqrt{\log(T)}} \right), \\ \text{Defense} &\left( \mathcal{L}, \varepsilon, T^{1/\sqrt{\log(T)}}, O(T) \right), \\ \text{TransfAttack} &\left( \mathcal{L}, \varepsilon, T, T \right). \end{align*} $$

Proof Idea

Building on the story developed in the introduction, we can now imagine Alice and Bob playing a zero-sum game with a set of Watermarking Algorithms (A) and Defense Algorithms (B) respectively. In particular,

\[\begin{align*} \mathcal{G}(A,B) = \mathbb{P}(\text{B detects tricky samples from A or has low error on them})\, + \\\mathbb{P}(\text{B has low error on samples from the distribution}) \end{align*}\]

From the above game, it is clear that Alice wants to minimize $\mathcal{G}$ and Bob wants to maximize $\mathcal{G}$. This is a finite zero-sum game, albeit with exponentially many actions and this admits a Nash equilibrium. This will give us a way to “obtain a defense from a failed watermark” and vice-versa. As an example, consider the first case, let us say that the value of the Nash equilibrium (which is unique for a zero-sum game) $v^* = \mathcal{G}(A_{\texttt{Nash}},B_{\texttt{Nash}})$ is greater than some threshold $\tau$, (i.e., the probabilities described in $\mathcal{G}$ are large). Then the following procedure can give as a Defense:

Learn a low-error classifier $f$.
Send $f$ to the party that is attacking the Defense.
Receives queries $\mathbf{x}$.
Simulate $(\mathbf{y},b) = B_{\texttt{Nash}}(f,\mathbf{x})$, where $b = 1$ if $B_{\texttt{Nash}}$ thinks it is attacked.
Reply with $b' = 1$ if $b = 1$, and if $b = 0$ reply with $b'=1$ if the fraction of queries on which $f(\mathbf{x})$ and $\mathbf{y}$ differ is high.

If $B_{\texttt{Nash}}$ thinks it is being attacked, i.e., $b=1$, the defense also returns $b'=1$. This is intuitive. On the other hand, if $b=0$ then the fact that unremovability is broken by $B_{\texttt{Nash}}$ implies that the error of $\mathbf{y}$ is low. But it might be that $f(\mathbf{x})$ has a high error even though $\mathbf{y}$ has a low error! To detect this situation we check on what fraction of entries do $f(\mathbf{x})$ and $\mathbf{y}$ differ,¹ as a high fraction of difference allows us to detect a high error of $f(\mathbf{x})$.

Remark. Note that if there were multiple valid outputs for a given input then the approach of comparing $f(\mathbf{x})$ and $\mathbf{y}$ would not work as it would give no information! This problem would need to be taken into account in trying to generalize the Main Result to the generative (vs classification) case. The last section of this blogpost discusses this in more detail (cf. section Beyond Classification).

Figure 3. Overview of the taxonomy of learning tasks, illustrating the presence of Watermarks, Adversarial Defenses, and Transferable Attacks for learning tasks of bounded VC dimension. The axes represent the time bound for the parties in the corresponding schemes. The blue regions depict positive results, the red negative, and the gray regimes of parameters which are not of interest. The blue regions represent examples of Watermarks and Adversarial Defenses we prove in our paper. The curved line represents a potential application of the Main Result, which says that at least one of the three points should be blue.

Transferable Attacks are “equivalent” to Cryptography

In this section, we show that tasks with Transferable Attacks exist. To construct such examples, we use cryptographic tools. But importantly, the fact that we use cryptography is not coincidental. As a second result of this section, we show that every learning task with a Transferable Attack implies a certain cryptographic primitive. One can interpret this as showing that Transferable Attacks exist only for complex learning tasks, in the sense of computational complexity theory. The two results together justify, why we can view Transferable Attacks and the existence of cryptography as “equivalent”.

Figure 4. The left part of the figure represents a Lines on Circle Learning Task $\mathcal{L}^\circ$ with a ground truth function denoted by $h$. On the right, we define a cryptography-augmented learning task derived from $\mathcal{L}^\circ$. In its distribution, a “clear” or an “encrypted” sample is observed with equal probability. Given their respective times, both A and B are able to learn a low-error classifier $h^A$, $h^B$ respectively, by learning only on the clear samples. A is able to compute a Transferable Attack by computing an encryption of a point close to the decision boundary of her classifier $h^A$.

A Cryptography-based Task with a Transferable Attack

Our example of a Transferable Attack is such that A (the party computing $\mathbf{x}$) has less time than B (the party computing $\mathbf{y}$), specifically $\approx 1/\varepsilon$ compared to $1/\varepsilon^2$. Moreover, that underlying learning task is such that $\approx \frac{1}{\varepsilon}$ time (and $\approx \frac{1}{\varepsilon}$ samples) is enough, and $\approx \frac{1}{\varepsilon}$ samples (and in particular time) is necessary to learn a classifier of error $\varepsilon$. The construction (given in Figure 4) crucially uses a cryptographic tool called Fully Homomorphic Encryption.

Fully Homomorphic Encryption (FHE).

FHE [4] allows for computation on encrypted data without decrypting it. An FHE scheme allows to encrypt $x$ via an efficient procedure $e_x = \text{FHE.Enc}(x)$, so that later, for any algorithm $C$, it is possible to run $C$ on $x$ homomorphically. More concretely, it is possible to produce an encryption of the result of running $C$ on $x$, i.e., $e_{C,x} := \text{FHE.Eval}(C,e_x)$. Finally, there is a procedure $\text{FHE.Dec}$ that, when given a secret key $sk$, can decrypt $e_{C,x}$, i.e., $y := \text{FHE.Dec}(sk,e_{C,x})$, where $y$ is the result of running $C$ on $x$. Crucially, encryptions of any two messages are indistinguishable for all efficient adversaries.

Transferable Attack

The attack that A runs is defined as follows.

Collect $O(1/\varepsilon)$ samples from the distribution $\mathcal{D}_{Clear}$.
Learn a classifier $h^A \in \mathcal{H}$ that is consistent with these samples.
Sample a point $x_{Bnd}$ uniformly at random from a region $\approx \varepsilon$ close to the decision boundary of $h^A$.
With equal probability, set as an attack $\mathbf{x}$ either $\text{FHE.Enc}(x_{Bnd})$ or a uniformly random point from $\mathcal{D}_{Clear} = U(\mathcal{X})$.²

Since the VC-dimension of the hypothesis class is 2, $h^A$ has error at most $\varepsilon$ with high probability.³ Because of this, $x_{Bnd}$ is a uniformly random point from an arc containing the boundary of $h$ (see Figure 4). The running time of B is upper-bounded by $1/\varepsilon^2$, meaning it can only learn a classifier with error $\gtrapprox 10\varepsilon^2$.⁴ Taking these two facts together, we expect B to misclassify $\mathbf{x}$ with probability $\approx \frac12 \cdot \frac{10\varepsilon^2}{\varepsilon} = 5 \varepsilon > 2\varepsilon$, where the factor $\frac12$ takes into account that we send an encrypted sample only half of the time. This implies transferability.

This means that A is able to learn a low-error classifier on $\mathcal{D}$.

Note that $\mathbf{x}$ is encrypted with the same probability as in the original distribution because we send $\text{FHE.Enc}(x_{Bnd})$ and a uniformly random $\mathbf{x} \sim \mathcal{D}_{Clear} = U(\mathcal{X})$ with equal probability. Crucially, $\text{FHE.Enc}(x_{Bnd})$ is indistinguishable, for efficient adversaries, from $\text{FHE.Enc}(x)$ for any other $x \in \mathcal{X}$. This follows from the security of the FHE scheme. Consequently, undetectability holds.

Tasks with Transferable Attacks imply Cryptography

As mentioned before the fact that we used cryptography in our construction is not coincidental. Indeed, what we show next is that a Transferable Attack for any task implies a cryptographic primitive.

EFID Pairs

In cryptography, an EFID pair [5] is a pair of distributions $\mathcal{D}_0,\mathcal{D}_1$, that are Efficiently samplable, statistically Far, and computationally Indistinguishable. More formally, for two time bounds $T, T'$ we call a pair of distributions $(\mathcal{D}_0,\mathcal{D}_1)$ a $(T,T')-$EFID pair if (i) $\mathcal{D}_0,\mathcal{D}_1$ are samplable in time $T$, (ii) $\mathcal{D}_0,\mathcal{D}_1$ are statistically far, (iii) $\mathcal{D}_0,\mathcal{D}_1$ are indistinguishable for algorithms running in time $T'$.

Tasks with Transferable Attacks imply EFID pairs

We show that any task with a Transferable Attack implies the existence of a type of EFID pair. More formally, for every $\varepsilon, T,T', T \leq T'$, every learning task $\mathcal{L}$, if there exists a Transferable Attack A with parameters $\varepsilon,T,T'$ then there exists a $(T,T')$-EFID pair.

EFID-pair construction

Define $\mathcal{D}_0 := \mathcal{D}^q$, where $q$ is the number of samples A sends in the attack. Secondly, define $\mathcal{D}_1$ to be the distribution of $\mathbf{x} := A$.

Firstly, $\mathcal{D}_0$ and $\mathcal{D}_1$ are samplable in time $T$ as A runs in time $T$. Secondly, $\mathcal{D}_0$ and $\mathcal{D}_1$ are indistinguishable for $T'$-bounded adversaries by the undetectability of A. Finally, the fact that $\mathcal{D}_0$ and $\mathcal{D}_1$ are statistically far follows from the transferability. Indeed, a careful examination determines that the following procedure, receiving input $\mathbf{x} \in \mathcal{X}^q$, is a distinguisher: (i) Run the learner to obtain $f$, (ii) $\mathbf{y} := f(\mathbf{x})$ and (iii) if $\text{err}(\mathbf{x},\mathbf{y}) \leq 2\varepsilon$ return $0$, otherwise return $1$.

Beyond Classification

We conjecture a possibility of generalizing our results to generative learning tasks. Instead of a ground truth function, one could consider a ground truth quality oracle $Q$, which measures the quality of every input and output pair. This model introduces new phenomena not present in the case of classification as hinted by the discussion of the Main Result earlier. For example, the task of generation, i.e., producing a high-quality output $y$ on input $x$, is decoupled from the task of verification, i.e., evaluating the quality of $y$ as output for $x$. By decoupled, we mean that there is no clear formal reduction from one task to the other. Conversely, for classification, where the space of possible outputs is small, the two tasks are equivalent. As discussed, this decoupling is the reason why the proof of the Main Result does not automatically transfer to the generative case (cf. Proof Idea) of the Main Result.

This decoupling introduces new complexities, but it also suggests that considering new definitions may be beneficial. For example, because generation and verification are equivalent for classification tasks, we allowed neither A nor B access to $h$, as it would trivialize the definitions. However, a modification of the definition of a Watermark, where access to $Q$ is given to B could be investigated in the generative case. Interestingly, such a setting was considered in [7], where access to $Q$ was crucial for mounting a provable attack on all strong watermarks. Our example of a Transferable Attack can be seen as a task, where generation is easy but verification is hard – the opposite of what [7] posits.

We hope that careful formalizations of the interaction and capabilities of all parties might give insights into not only the schemes considered in this work, but also problems like weak-to-strong generalization [3] or scalable oversight [6].

References

[1] Adi, Yossi, et al. “Turning your weakness into a strength: Watermarking deep neural networks by backdooring.” 27th USENIX security symposium (USENIX Security 18). 2018.

[2] Goldwasser, Shafi, et al. “Planting undetectable backdoors in machine learning models.” 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2022.

[3] Burns, Collin, et al. “Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.” arXiv preprint arXiv:2312.09390 (2023).

[4] Gentry, Craig. “Fully homomorphic encryption using ideal lattices.” Proceedings of the forty-first annual ACM symposium on Theory of computing. 2009.

[5] Goldreich, Oded. “A note on computational indistinguishability.” Information Processing Letters 34.6 (1990): 277-281.

[6] Brown-Cohen, Jonah, Geoffrey Irving, and Georgios Piliouras. “Scalable AI safety via doubly-efficient debate.” arXiv preprint arXiv:2311.14125 (2023).

[7] Zhang, Hanlin, et al. “Watermarks in the sand: Impossibility of strong watermarking for generative models.” arXiv preprint arXiv:2311.04378 (2023).

Footnotes

Note that both $f(\mathbf{x})$ and $\mathbf{y}$ have $q$ entries. ↩
In this proof sketch, we have $q = 1$, i.e., A sends only one $x$ to B. This is not true for the formal scheme. ↩
A can also evaluate $h^A$ homomorphically (i.e., run $\text{FHE.Eval}$) on $\text{FHE.Enc}(x)$ to obtain $\text{FHE.Enc}(y)$ of error $\varepsilon$ on $\mathcal{D}_{Enc}$ also. ↩
This is a relatively simple lower-bound. ↩

The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses