FEATURE28 August 2013

Scrubbed the right way

Impact

Flawed though it may be, data controllers must take care to safeguard the process of anonymisation as it has a crucial role to play in a data-sharing society, says Barry Ryan.

Res_4010366_anonymous_458

As far as the law is concerned, anonymisation is a magic bullet. Personal data is regulated, anonymised data is not. According to recent studies, however, the magical properties of anonymisation have been overstated.

What does it mean to anonymise data? It is the process of removing, obscuring, aggregating or altering identifiers to prevent identification. The term identifier is often understood to mean a formal identifier such as name, address or ID number. In principle however any piece of information could be an identifier. This reality is reflected in the legal definition of personal data.

Personal data is (and I am paraphrasing here) information relating to an identifiable individual, either on its own or in combination with any other information available. Whether a piece of information identifies an individual is therefore context specific. Researchers know this from experience. The more your client knows about a population (for example its own customers), the less you can give them without compromising the anonymity of respondents.

One feature of anonymisation is that it reduces data utility. Anonymised data is less useful than detailed, accurate, fine-grained data. This creates an incentive to keep deletions or alterations in a dataset to a minimum, but doing this reliably is increasingly problematic. This caused Paul Ohm to declare in Broken Promises of Privacy that data can be either useful or anonymous; it cannot be both.

Identifiable traits
Studies in this area have established a couple of issues that highlight the problem of reliable anonymisation. The first is that we are all different. In fact, we are all different on surprisingly basic level. Simple demographics will uniquely identify individuals even in crowded urban environments. A 2000 analysis of US Census data concluded that 87% of the US population had a unique combination of gender, date of birth and five digit ZIP (postal) code. In the UK, a full postcode on its own will isolate an average of 15 households.

“By creating a stark divide between regulated personal data and unregulated anonymous data, the law has created a false perception of what it means to be anonymous. Anonymity is a process, not an outcome. It does not mean that an individual will never be identified but that it will be very hard to do so”

The second issue relates to our environment, which – in a connected, technologically advanced society – is a data environment: an electronic information-based environment that intersects with our physical environment. There is more data available about us than ever before, and much of that information is unique. It relates to no other person.

January’s Science featured a paper on identifying personal genomes by matching elements of the Y-chromosome with publically available genetic genealogy databases, which contained information on surname, age and US state of residence. March’s Nature published an article demonstrating that four time-and-location points are sufficient to uniquely identify 95% of individuals from mobile location data.

The Nature study also flags something new that has entered the data environment in recent years: behavioural data. We may share a birthday or a postcode with others, but no one shares our habits. For example, no one else has your morning commute. Behavioural data is an increasing part of the data environment, as we share films we view on Netflix, holiday and travel plans on Tripit and our runs and cycles on Strava.

Trade-offs
The apparent ease with which non-obvious identifiers can be linked to individuals has created a headache for many sectors, particularly online. In response, a new concept of pseudonymous data is being proposed, to place a limit on that advance. In return for limiting the processing of identifiers, data controllers would be rewarded with less onerous regulatory obligations.

Traditionally, pseudonymous data is personal data that has been processed to protect the identity of data subjects, but the process is designed to be reversed if required. The key to do so is held separately and securely. This is commonly used in research, where identifiers are replaced with a unique code that permits re-identification, but does not disclose identities to unauthorised parties.

In the current data protection debate, pseudonymous data has been seized upon as a new name for indirectly identifiable data – typically cookies and IP addresses. The problem of course is that with access to the right information this data can identify individuals and that information is not subject to disclosure controls; that is, it is not backed by a rigorous process to ensure that matching cannot take place without authorisation. This has raised eyebrows in legislative circles, and threatens to undermine common sense exemptions for responsible data handling practices.

Sharing is caring
If anonymity is seen as a false prospectus and pseudonymity an attempt to evade regulation, the scope of data protection law becomes unlimited. In a modern society defined by mobile communications and personalisation of services, everything relates to an individual who is somehow identifiable.

By creating a stark divide between regulated personal data and unregulated anonymous data, the law has created a false perception of what it means to be anonymous. Anonymity is a process, not an outcome. Anonymity does not mean that an individual will never be identified but that it will be very hard to do so.

Anonymisation is part of the disclosure control toolkit. To preserve its availability, there is a duty on parties engaged in anonymisation to do so carefully, with regard to the environment in which data are to be released. One option is to limit the scope of that environment, by providing secure access to datasets, subject to strict conditions, similar to the Research Passport system used for NHS data.

This is not an academic debate – resolving this holds the key to a successful data-driven society, if that is what we want. Data sharing is important because it encourages debate, promotes the efficient allocation of resources in the economy, and maximises transparency and accountability. Anonymisation is a very valuable tool, allowing sensitive data to be shared while preserving privacy. It is also the premise under which researchers have sought and obtained the voluntary participation of the public for the last 60 years.

Barry Ryan is director of the MRS Policy Unit

Res_4010366_Impact_Issue_2_cover

This article was first published in Impact, the new quarterly magazine from the Market Research Society. Follow the link to read the digital version of Impact.

Includes:

  • A special report on customer experience
  • Profiles of the Tate, SABMiller and Auto Trader, showing how they use data and insight to shape strategy and decision-making
  • How the UK government’s Nudge Unit is changing policy development
  • How hackathons can help data and analytics companies innovate

3 Comments

11 years ago

And don't forget, in this googlable world, just removing someone's name, username, twittername, etc., does not make the data anonymous. Just pop any verbatim into google and you can instantly find/harass/stalk/annoy/argue with the original author. Always be very careful about what you put in research reports because anonymous does not always mean unidentifiable.

Like Report

11 years ago

I think Paul Ohm's assertion that "data can be either useful or anonymous; it cannot be both" is wrong. Market researchers have been delivering highly useful, de-identified statistical data to clients for years. I do agree that it is challenging today to anonymize large datasets of online information as there are many sources that could be used to try to re-identify individuals. The Federal Trade Commission in the U.S. recognizes the challenges and has a view on them. When companies want to use anonymization as a magic bullet, this means that they must take reasonable steps to de-identify data. If they then disclose the de-identified data to another company, the receiving party should agree never to attempt to re-identify the data. If the receiving party makes good on its promise and if both parties have adequate security controls in place to ensure that the de-identified data is never breached or made public, then the risk of re-identification is infinitesimally low. I agree with Annie's point that comments made by individuals online in public forums are not anonymous because they are searchable. Of course, verbatim comments made in different contexts -- e.g. online, telephone or F2F surveys -- are not Googlable. It is still wise to review verbatim comments and filter or redact content, if necessary, before putting them in research reports.

Like Report

11 years ago

There is a large area where individually identifiable information is used every day, without so much as a thought by the general public. That area is software development and testing by the very companies that one does business with, Some organizations have already moved to anonymizing data for testing and sharing with offshore partners, but many have not. Is there a need for an on-shore or off-shore software developer to know your bank account information, your medical history or your tax payment history? Here is an area where de-identification can be fully taken advantage of and eliminate accidental or malicious disclosures of personal data.

Like Report