De-Anonymization

We keep hearing about Big Data as the latest magic solution for all society’s ills. The sensors that surround us collect ever more data on everything we do; companies use it to work out what we want and sell it to us. But how do we avoid a future where the secret police know everything?

We’re often told our privacy will be safe, because our data will be made anonymous. But Dorothy Denning and other computer scientists discovered in about 1980 that anonymization doesn’t work very well. Even if you write software that will only answer a query if the answer is based on the data of six or more people, there’s a lot of ways to cheat it. Suppose university professors’ salaries are confidential, but statistical data are published, and suppose that one of the seven computer science professors is a woman. Then I just need to ask “Average salary computer science professors?” and “Average salary male computer science professors?” And given access to a database of "anonymous" medical records, I can query the database before and after the person I’m investigating visits their doctor and look at what changed. There are many ways to draw inferences.

For about ten years now, we’ve had a decent theoretical model of this. Cynthia Dwork’s work on differential privacy established bounds on how many queries a database can safely answer, even if it’s allowed to add some noise and permitted a small probability of failure. In the best general case, the bound is of the order of N2 where there are N attributes. So if your medical record has about a hundred pieces of information about you, then it’s impractical to build an anonymized medical record system that will answer more than about ten thousand queries before a smart interrogator will be able to learn something useful about someone. Common large-scale systems, which can handle more than that many queries an hour, simply cannot be made secure—except in a handful of special cases.

One such case may be where your navigation app uses the locations of millions of cell phones to work out how fast the traffic is moving where. But even this is hard to do right. You have to let programmers link up some successive sightings of each phone within each segment of road to get average speeds, but if they can link between segments they might be able to reconstruct all the journeys that any phone user ever made. To use anonymization effectively—in the few cases where it can work—you need smart engineers who understand inference control, and who also have the incentive to do the job properly. Both understanding and incentive are usually lacking.

A series of high-profile data scandals has hammered home the surprising power de-anonymization. And it’s getting more powerful all the time, as we get ever more social data and other contextual data online. Better machine-learning algorithms help too; they have recently been used, for example, to de-anonymize millions of mobile phone call data records by pattern-matching them against the public friendship graph visible in online social media. So where do we stand now?

In fact, it’s rather reminiscent of the climate change debate. Just as Big Oil lobbied for years to undermine the science of global warming, so also Big Data firms have a powerful incentive to pretend that anonymization works, or at least will work in the future. When people complain of some data grab, we’re told that research on differential privacy is throwing up lots of interesting results and our data will be protected better real soon now. (Never mind that differential privacy teaches exactly the reverse, namely that such protection is usually impossible.) And many of the people who earn their living from personal data follow suit. It’s an old problem: it’s hard to get anyone to understand anything if his job depends on not understanding it.

In any case, the world of advertising pushes towards ever more personalisation. Knowing that people in Acacia Avenue are more likely to buy big cars, and that forty-three year olds are too, is of almost no value compared with knowing that the forty-three year old who lives in Acacia Avenue is looking to buy a new car right now. Knowing how much he’s able to spend opens the door to ever more price discrimination, which although unfair is both economically efficient and profitable. We know of no technological silver bullet, no way to engineer an equilibrium between surveillance and privacy; boundaries will have to be set by other means.