Poisoning deep learning algorithms

Up to this point, when we have been talking about adversarial attacks on machine learning algorithms, it has been specifically in the context of an existing, fixed model. Early work in this area assumed a process where an attacker had access to test examples after capture (e.g., after a photograph was taken) but before it was sent to the model under attack, and tended to focus on changes to inputs that were imperceptible to humans but caused catastrophic inference failures (for more, see what is adversarial machine learning). Concrete examples of this scenario include a user taking a photograph on their phone, then applying an adversarial attack to it before uploading it to a public photo sharing website; or, a user leaving a review online for a product or service, and adding adversarial text triggers before clicking submit.

A couple of years after adversarial attacks were first described, we starting seeing strategies for attack scenarios where the electronic version of the model input was no longer under the attacker's control. In this scenario, the attacker needs to modify the physical environment in a way that will survive the process of the adversary capturing it as data, which rules out most "imperceptible" changes. Instead, these attacks focused on highly perceptible changes, but which would either small in their total area or would otherwise appear to be benign (for more, see Fooling AI in real life with adversarial patches). So noticeable, but not noticed.

In both cases, these attacks have a significant disadvantage in the form of the model itself. Because the model is outside of the control of the attacker, it's possible for someone defending against an attack to retrain the model to be more robust to adversarial examples, or to replace the model with an cascade architecture that has one or more explicit defense components (more on how these work in a future blog post).

The good news (from the point of the attacker!) is that modern deep learning architectures tend to be built from publicly available data, typically scraped from the web with little or no review or filtering. This means that in many cases, it's possible to change the behavior of a model just by uploading data into the right public websites. Changing the behavior of a model by injecting adversarial training data is called data poisoning, or a poisoning attack.

One interesting avenue that it opens up -- compared to adversarial examples -- is that an attacker can degrade the performance of a model on all later examples, not only specially crafted ones. This is called an availability attack. Another interesting idea is injecting training data that relies on a specific pattern of inputs, so that most of the time the model looks fine, and only misbehaves on demand. These are called backdoor attacks.

The "why" of this is pretty straight forward: deep learning models interpolate over training data, so if you perturb the training data you can perturb the model. In other words, garbage in means garbage out. How to actually do it turns out to be more complicated.

Let's make this a bit more explicit: the objective of the attack is to provide some set of modified inputs, X', that modify a model's internal, learned representations, from W to W', with the hope of changing that model's behavior in a specified way on clean data later, f(X, W') -> y'. This means that you, the attacker, need to understand how changes in W relate to changes in y, and how changes in X relate to changes in W, so that you can 𝚫X -> 𝚫W -> 𝚫y.

Under a set of very strong assumptions (e.g. convex loss, binary outcome, stationary relationship between 𝚫X and 𝚫W) you can compute a closed form solution to generating the attack. Unfortunately, this largely restricts you to logistic regression. There was a lot of research in the late 2000s about extending these attacks to support vector machines, which was awesome, but which we won't be talking about here. Instead, we're going to talk about applying this to deep learning models.

The problem formulation is still the same, except that now we are in a framework where we have direct access to the gradient for W with respect to y (and can get the gradient for X very cheaply). At first glance, this might seem like we have solved the problem, but recall that we don't need 𝚫X -> 𝚫y, but 𝚫X -> 𝚫W, which is not the same thing. It turns out there is a formula for computing it directly (again, under certain assumptions), but most models have very large sets of internal parameters (gigabytes of them!) and the runtime complexity of those approaches are cubic, which makes them a non-starter for deep learning problems.

What we can do instead is estimate 𝚫X->𝚫W during the training process 1. Concretely, we can take a perturbed training example, X', and put it through a few iterations of gradient descent. We can calculate directly the changes in W before and after seeing X', and use this to estimate the effect of that perturbed example on both the weights and also the resulting prediction from the model. We can then use that estimate to update our adversarial perturbations, and proceed with the model training, stopping to do this estimate-gradient-and-update step for each perturbed sample (we can skip it for all the regular data points).

Would you like to know more? You can read “Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization” (with actual equation formatting) on the arxiv.

This blog post first appeared on adversarial-designs.


  1. L. Muñoz-González et al., “Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization,” arXiv:1708.08689 [cs], Aug. 2017, Available: http://arxiv.org/abs/1708.08689