A sattelite

Anonymisation and data protection

How can I share research data without breaking the law?

The Data Protection Act 1988 governs the processing of data or information for living individuals in the UK. The Act requires data be handled in a way that is fair, proportional, secure, and justified when obtaining, using, holding and sharing personal information. LSE has guidance [PDF] on how to meet these principles in the context of research.

There are two additional things to remember about the Act in relation to research. First, it only applies to personal or sensitive personal data, not necessarily all data gathered from a participant. Second, the act contains exemptions for specified purpose and retention of personal data when processed for research.

Anonymised data that cannot be linked to a living individual is not subject to the Data Protection Act.

What counts as “anonymised” is measured by a “likely reasonably” test. The UK’s Information Commissioner’s Office states: "Anonymisation is the process of turning data into a form which does not identify individuals and where identification is not likely to take place." This means that if, on the balance of probabilities, third parties cross-referencing “anonymised” data with information or knowledge already available to the public cannot identify individuals then data is not personal and not subject to the Act.

From 25 May 2018 the Data Protection Act with be replaced by the European Union General Data Protection Regulation. 

How can I anonymise research data to protect participants?

  •  Anonymisation refers to direct identifiers and indirect ones which in combination can identify an individual.
  • Plan and apply anonymisation early in your research and log changes so it is clear what is anonymised.
  • Anonymisation tools are available.
  • Consider alternative access options like controlled access environments and restrictive licences where sharing anonymised data is problematic.

Usually anonymisation applies to direct and indirect identifiers. Direct identifiers like name, address, or telephone numbers specify an individual. Indirect identifiers when pieced together could also reveal an individual by, for example, cross-referencing occupation, employer, and location.

If data requires anonymising, it’s critical to think early on about how you are going to construct and implement a strategy to protect the identity of participants. Planning anonymisation before undertaking data collection produces both better informed consent and requires a less resource intensive process when doing data anonymisation.

Given the strength of the DPA, it is worth questioning what data you plan to collect and why.

Knowing what data you wish to collect will help guide an anonymisation strategy consistent across your data set and produce ethically responsible reusable data that does not contravene data protection laws. For example, administrative data like names and addresses may not have research value but constitute personal and sensitive information. Do they need to be collected, if so, can they be separated from the research data set and deleted early in the research process?

Remove direct identifiers or use meaningful pseudonyms and replacements for identifiers. Ideally, replacements should be expressive in the sense of preserving the character of the identifier while concealing the identity. For example, instead of “Birmingham” use “Major British metropolitan area” and instead of “Scott” use “Trevor”. This is preferable to replacing identifiers with “City” or “Name” or, worst of all, “deleted”.

If using pseudonyms is unworkable, could you apply restrictions on upper and lower ranges of variables? Can you remove a variable without it compromising the re-use value of the data (in which case, ask if you should you even measure that variable)? Could you apply low-level aggregation of data, like moving to a larger spatial unit or transforming age from a continuous variable into a discrete categorical one? Can dates, times, or measurements be rounded?

In any case, it is best practice to create a log of anonymisation undertaken and to flag anonymised identifiers so it is clear that something is anonymised.

There are a number of Open Source tools developed to help researchers anonymise research data.

Finally, can data be shared in a restricted environment? Most standard archive and data re-use agreements have a clause prohibiting third party attempts to either identify or re-contact participants. In other cases, controlled access environments and applying approved researcher status may be a way to responsibly share research data to a limited extent.

Of course, the principle of informed consent allows participants to waive their right to anonymity should they wish, and if in the researcher’s judgment, no harm will result or no other legal reason exists to prevent waiving anonymity. In the case of oral history or elite interviews, tied to the participant’s identity are their memories, perceptions and experiences. Consequently, data in these approaches is not anonymised even if it is subject to tighter access conditions or an embargo period.

The European Union General Data Protection Regulation and research

  • The GDPR is a European Union wide regulation on data protection, replacing existing data protection laws from May 2018.
  • The regulation doesn't substantively differ from existing UK law on data protection and research. 
  • The regulation includes tougher penalties for data protection breaches and stronger requirements for obtaining informed consent. 

From 25 May 2018 data protection in the UK will be governed by a new European Union wide regulation: the General Data Protection Regulation (GDPR) [PDF].

For researchers the regulation doesn't substantively differ from current UK law. It protects exemptions for research data on reuse but tightens the principle of informed consent and contains stronger penalties for data protection breaches.

The regulation reinforces the requirement of those collecting data to safeguard personal data. Safeguards include technical and organisational barriers to access, like an encryption, authentication requirements and user licences, or applying anonymisation or pseudonoymisation that would "no longer permit the identification of data subjects" (Article 89, 1). Significant fines can be imposed for breaches where data could have been better protected and was not.

The publicised "right to be forgotten" is in the regulation (Article 17). However, that right does not apply where personal data is necessary for "archiving purposes in the public interest or for scientific statistical and historical purposes" nor can it apply if the data is anonymised or pseudononymised to prevent identification of living individuals.

Article 9, 1 adds genetic data and biometric data to the list of personal data characteristics, with allowance for member states to "maintain or introduce further conditions, including limitations" in relation to these characteristics (Article 9, 4).

Informed consent is a strong theme throughout the regulation. It states lawful processing of personal data is based on consent (Article 6, 1), including a requirement to be able to demonstrate consent has been given (Article 7, 1) and that subjects have the right to withdraw consent (Article 7, 3). Informed means the subject is aware of who is collecting the data and why. The Regulation states: "Consent should not be regarded as freely given if the data subject has no genuine or free choice or is unable to refuse or withdraw consent without detriment." (recital 42)

The regulation does allow for exemptions on revealing personal data where explicit consent is allowed by  law and has been given by the subject (Article 9,1). The regulation also defines consent as being lawful from the age of 16 although EU states can vary that age to as low as 13 (Article 8, 1).

If you're planning a research project it's important to factor in the effects of the new regulation in your consent and data protection planning.

Privacy Impact Assessment

You may be asked by a data supplier, public body or private organisation to provide a Privacy Impact Assessment (PIA). A PIA is a process used to identify any potential privacy risks and describe actions to address them. LSE has a template [.docx] to help you write a Privacy Impact Assessment with support from IMT Information Security and Governance, Legal and Policy Information Rights teams.

Further reading

Finnish Social Science Data Archive Anonymization and Identifiers

Irish Qualitative Data Archive (2008) Anonymisation Guidelines [PDF]

Information Commissioners Office (2012)Anonymisation: Managing Data Protection [PDF]

Risk Code of Practice JISC Legal (2014) Data Protection and Research Data: Questions and Answers

UK Anonymisation Network (2016) Anonymisation Decision-Making Framework