Back to Top

Statistical disclosure control (anonymization) - Practice Guide

A microdata dissemination challenge: Balancing data protection and data utility

Project status: Open
Sponsor(s): DFID Trust Fund No TF011722 administered by the World Bank, Development Data Group (WB-DECDG) and World Bank World Bank Knowledge for Change Program II: A microdata dissemination challenge: Balancing data protection and data utility. The Concept Note is available here: Concept Note

Thousands of surveys have been documented and catalogued through programs run by the World Bank and IHSN. Only a fraction of these datasets are openly accessible. Legal and ethical issues around data privacy and respondent identity protection still prevent many data producers, including the World Bank, from publically releasing some of their microdata. A few agencies address the problem by generating Public Use Files by applying data protection techniques. These techniques can be very effective for generating “safe” datasets but, may also result in significant information loss and create data of limited utility (e.g., when important variables are removed or aggregating to levels that make analysis difficult). In many cases the use of alternative disclosure control methods would allow for a better balance between data protection and data utility. Some of these methods are complex and the pool of experts with experience in in applying them is still small. While there is substantial literature on the methods available there is little step-by-step practical documentation available to agencies which link the methods and the tools available to implement them.

A demand exists for practical solutions and technical support for applying Statistical Disclosure Control (SDC), also known as, “microdata anonymization”. This demand stems from the need or obligation for data producers to disseminate data but at the same time to comply with privacy protection regulations. The provision of adequate solutions and technical support has the potential to “unlock” a large number of datasets.

Ensuring that a free open source solution is available to agencies (the package developed by the IHSN) was an important step forward, but not a sufficient one. There is still limited consolidated and reported knowledge on the impact of disclosure risk reduction methods on data utility. This limited access to knowledge combined with a lack of experience in using the tools and methods makes it difficult for many agencies to implement “optimal” solutions. By optimal we mean meet their obligations towards privacy protection but also their obligation to release data useful for policy monitoring and evaluation. This World Bank Project attempts to fill this critical gap by documenting research conducted at the World Bank through a large-scale evaluation of anonymization techniques, and (ii) translating these results into practical guidelines.

Releasing data in a safe way is required to protect the integrity of the statistical system, by ensuring agencies honor their commitment to respondents to protect their identity. Agencies do not widely share, in substantial detail, their knowledge and experience using SDC and the processes for creating safe data with other agencies. This makes it difficult for agencies new to the process to implement solutions. We consolidated knowledge from literature as well as from our own experience to inform our discussion of the processes and methods presented in this guide. This guide focuses on the implementation of methods and uses the free R based package sdcMicro for its examples. If you are interested in reading in detail about the theory behind the methods used, we suggest reading our accompanying guide: Statistical Disclosure Control for Microdata: Theory.