Statistical disclosure control (anonymization) - Software Development
To address the legal issue of data confidentiality, the IHSN is producing guidelines and tools on statistical disclosure control (SDC), a.k.a. microdata anonymization, in partnership with experts and practitioners. The objective is to provide practical solutions for the assessment and reduction of disclosure risk in microdata files, which will allow data producers to generate useful but safe-for-release anonymized microdata files in accordance with their own national legislation and dissemination policies. The IHSN is also commissioning case studies by the national statistics offices of the Philippines and Uruguay.
Project status: | Open |
Sponsor(s): | World Bank Development Grant Facility, Grants No 4001010-06, 4001011-06, 4001012-06 administered by the PARIS21 Secretariat at OECD |
Implemented by: | PARIS21 and WB-DECDG, with IHSN consultants involved at different stages of the project |
Type of output: | Technical guidelines, case studies, and open-source software application |
Software and user instructions are freely available; see our sections on Software - Statistical Disclosure Control and Guidelines on Microdata Anonymization.
Project description
In the last 20 years, many initiatives to develop knowledge and share expertise and resources in the field of SDC have flourished. Some of these initiatives are solely academic, others are led by national statistics offices, and still others combine communities interested in SDC. Notably, much has been achieved in SDC by several European projects, starting with the 4th Framework SDC project (1996-1998) and continuing with the 5th Framework CASC project (2000-2003), CENEX project (2006), and two ESSnet projects (2008-2013) on statistical disclosure control and remote access to microdata in a secure environment. These workstreams have given rise to mu-argus, which has been for a long time the only software available for microdata protection. Despite these initiatives, only limited guidance and technical assistance on SDC has been made available to NSOs.
The prevalence of popular statistical analysis software in NSOs drove the IHSN to create well-documented specialized programs for Stata, SPSS, and SAS. This helped avoid long-term maintenance and support issues; most users and their organizations do not want or cannot invest in new software training. Moreover, using available specialized SDC software had not been fully satisfactory, due to (1) concerns about the sustainability of available development and support; (2) poor documentation; (3) lack of user friendliness; and, most important, (4) issues of performance and relevance to large survey datasets.
With the support and involvement of various experts, the IHSN developed a collection of plug-ins for C++ that support optimal performance. The plug-ins were successfully tested on Stata 8, 9, and 10, SPSS 16+, and Windows/Linux at the command line. They were developed and optimized for the following anonymization techniques, which are extensively used and described in the literature):<.p>
- Risk measurement
- 1. SUDA-DIS risk measurement
- 2. Mu-Argus weighted sample risk measurement (individual and hierarchical)
- 3. k-anonymity
- 4. l-diversity
- Risk reduction
- 1. Local recoding (based on maximum weighted matching algorithm)
- 2. k-anonymity (using the Hilbert space filling curve)
- 3. Numerical rank swapping
- 4. Noise addition
- 5. MDAV (fixed length micro aggregation algorithm)
- 6. PRAM
- 7. Sampling (implementing two sampling methods: systematic and balanced; sampling can be used to create subsamples with sampling probability depending on sensitive numeric variables or the risk measurement itself)
In recent years, statistical software environment R has become more comprehensive and relevant in academic and official statistics circles, for advanced statistical purposes. Today, R is now the leading open source statistical software. With its increasing popularity, R is becoming a standard programming language in its field. Since it’s assumed that this trend will continue, implementing IHSN C++ plug-ins into R has several benefits, which include:
- The C++ code can be used within a free and open-source statistical software environment
- These new methods are provided within increasingly popular statistical software
- The integration of C++ code allows for fast computations in R
The R package sdcMicro is a well-known collection of microdata protection methods developed by Statistics Austria, which is already in use in several national statistics offices. sdcMicro has become one of the standard tools for microdata protection during the last five years. The IHSN is supporting the further development of sdcMicro and has partnered with its developers to:
- Include in sdcMicro relevant methods available in the IHSN plug-ins
- Test sdcMicro on real datasets to calibrate its outputs and facilitate their interpretation
- Develop practical guidelines to support the use of a toolbox and help users navigate between methods and associated algorithms
SdcMicro already includes several popular methods for microdata anonymization; some of these methods can also be found in the IHSN C++ plug-ins. The overlapping methods have been tested and compared with their analogous implementation in sdcMicro. Three new methods (or improved implementations) have been included in sdcMicro: suda2 (i.e., finding minimal samples unique), rank swap (i.e., numerical rank swapping), and mdav (i.e., micro-aggregation). Since the C++ code contained specific class structures and required multiple and sometimes different header files to be included when compiling the code, the inclusion of these new methods into R has proved to be a complex task. The following Figure 1 shows computation time efficiency gains between the old and new implementation of rank swapping in sdcMicro, based on 100 runs on a 10-dimensional dataset with varying numbers of observations.
Figure 1 computation time efficiency gains between old and new implementation of rank swapping algorithms
Version 4.4.0 of the sdcMicro package is available on the Comprehensive R Archive Network (CRAN). Existing guidelines and a user guide for sdcMicro are being updated. A specific tutorial is being developed to show how to implement these concepts and algorithms on real datasets. This tutorial is being drafted with examples of the European Union SES dataset. The IHSN is promoting the adoption of sdcMicro and an associated guidelines toolbox for the creation of Public Use Files and Scientific Use Files. See our pages Software - Statistical Disclosure Control and Guidelines on Microdata Anonymization.