Reducing the disclosure risk
Statistical disclosure limitation methods can be classified in two categories:
- Methods based on data reduction. Such methods increase the number of individuals in the sample/population who share the same or similar identifying characteristics presented by the investigated statistical unit. These procedures tend to avoid the presence of unique or rare recognizable individuals.
- Methods based on data perturbation. Such methods achieve data protection in two ways. First, if the data are modified, re‑identification by means of record linkage or matching algorithms is more difficult and uncertain. Second, even when an intruder can re-identify a unit, he/she cannot be confident that the disclosed data are consistent with the original data.
An alternative solution consists of generating synthetic microdata.
Data reduction
Removing variables
The first obvious application of this method is the removal of direct identifiers from the data file. A variable should be removed when it is highly identifying and no other protection methods can be applied. A variable can also be removed when it is too sensitive for public use or irrelevant for analytical purpose. For example, information on race, religion, HIV, etc. may not be released in a public-use file, but they may be released in a licensed file.
Removing records
Removing records can be used as an extreme measure of data protection when the unit is identifiable in spite of the application of other protection techniques. For example, in an enterprise survey dataset, a given enterprise may be the only one belonging to a specific industry. In this case, it may be preferable to remove this particular record rather than removing the variable "industry" from all records. Since it largely impacts the statistical properties of the released data, removing records must be avoided when possible.
When the records to be removed are selected according to a sampling design, the method is called sub-sampling; it’s called sampling when the original matrix represents census data.
Global recoding
Global recoding consists of aggregating the values observed in a variable into pre-defined classes (for example, recoding the age into five-year age groups, or the number of employees into three class sizes: small, medium, and large). This method applies to numerical variables, continuous or discrete. It affects all records in the data file.
When dealing with categorical variables (or numerical categorized), the global recoding method collapses similar or adjacent categories.
Consider, for example, the variable "marital status" that is often observed in the following categories: Single, Married, Separated, Divorced, Widowed. The sample frequency of the Separated category may be low, especially when cross-tabulated with other variables. The two adjacent categories, Separated and Divorced, can be joined into a single category called "Separated or Divorced". The frequency of the combinations in this new category would be higher than those relative to Separated and Divorced separately. The categories that can be combined depend on data utility as well as statistical control of the frequencies.
This method can also be applied to key variables, such as geographic codes, to reduce their identifying effect.
Top and bottom coding
Top and bottom coding is a special case of global recoding that can be applied to numerical or ordinal categorical variables. The variables "Salary" and "Age" are two examples. The highest values of these variables are usually very rare and therefore identifiable. Top coding at certain thresholds introduces new categories such as "monthly salary higher than 6,000 dollars" or "age above 75", leaving unchanged the other observed values. The same reasoning applied to the smaller observed values defines bottom coding. When dealing with ordinal categorical variables, a top or bottom category is defined by aggregating the "highest" or "smallest" categories, respectively.
Local suppression
Local suppression consists of replacing the observed value of one or more variables in a certain record with a missing value. Local suppression is particularly suitable for setting categorical key variables and when combinations of scores on such variables are at stake. In this case, local suppression consists of replacing an observed value in a combination with a missing value. The method reduces the information content of rare combinations, resulting in an increase in the frequency count of records containing the same (modified) combination. For example, suppose the combination "Marital status=Widow; Age=17" is a population unique. If the information on Age is suppressed, the combination "Marital status=Widow; Age=missing" will no longer be identifying. Alternatively, one can suppress the information on Marital status as well. A criterion is therefore necessary to decide which variable in a risky combination must be locally suppressed. The primary criterion is obviously to minimize the number of local suppressions. For example, consider the values of key variables, "Sex=Female; Marital status=Widow; Age=17; Occupation=Student," observed in a unit. Both the combinations "Marital status=Widow; Age=17" and "Sex=Female; Marital status=Widow; Occupation=Student" characterize the unit and may be population unique, i.e., combinations at risk. To minimize the number of local suppressions, one can choose to replace the variable “Marital status” with missing values, so that both combinations are simultaneously protected using a single local suppression. If the variables were considered independently, two local suppressions would be required. Another criterion can be defined according to a measure of information loss (for example, the value minimizing an entropy indicator might be selected for local suppression). Moreover, suppression weights can be assigned to the key variables to drive the local suppression to less important variables. Local suppression also requires a selection criterion for the records. The previous paragraph indicated several rules defining a record at risk; local suppression could be applied only to risky records, i.e., records that contain combinations at risk.
Data perturbation
Micro-aggregation
Micro-aggregation is a perturbation technique first proposed by Eurostat as a statistical disclosure method for numerical variables. The idea is to replace an observed value with the average computed on a small group of units (small aggregate or micro-aggregate), including the investigated one. The units belonging to the same group will be represented in the released file by the same value. The groups contain a minimum predefined number k of units. The k minimum accepted value is 3. For a given k, the issue consists of determining the partition of the whole set of units in groups of at least k units (k-partition), minimizing the information loss usually expressed as a loss of variability. Therefore, the groups are constructed according to a criterion of maximum similarity between units. The micro-aggregation mechanism achieves data protection by ensuring that there are at least k units with the same value in the data file.
When micro-aggregation is independently applied to a set of variables, the method is called individual ranking. When all variables are averaged at the same time for each group, the method is called multivariate micro‑aggregation.
The easiest way to group records before aggregating them is to sort the units according to their similarity and the values resulting from this criterion, and to aggregate consecutive units into fixed size groups. Size adjustment is eventually required for the first or last group. For univariate micro-aggregation, the sorting criterion may be the variable itself.
For multivariate micro-aggregation, similarity can be used as a criterion for the observed variables or, to increase the effectiveness of the method, it can be defined as a combination of variables. For example, the first principal component or the sum of Z-score values along the set of variables can be criteria for fixed-size micro-aggregation.
Multivariate micro-aggregation is considered much more protective than individual ranking because the method guarantees that at least k units in the file are identical (all variables are averaged at the same time), but the information loss is higher.
Data swapping
Data swapping was initially proposed as a perturbation technique for categorical microdata, and intended to protect tabulation stemming from the perturbed microdata file. Data swapping consists of altering a proportion of the records in a file by swapping values of a subset of variables between selected pairs, or swap pairs, of records.
The level of data protection depends on the perturbation level induced in the data. A criterion must be applied to determine which variables and records (the swapping rate) to be swapped. For categorical data, swapping is frequently applied to records that are sample unique or sample rare, as these records usually present higher risks of re-identification.
Finding data swaps that provide adequate protection while preserving the exact statistics of the original database is impractical. Even when the univariate moments are maintained, data swapping generally modifies the data too much.
Post-randomization (PRAM)
As a statistical disclosure control technique, PRAM induces uncertainty in the values of some variables by exchanging them according to a probabilistic mechanism. PRAM can therefore be considered as a randomized version of data swapping. As with data swapping, data protection is achieved because an intruder cannot be confident whether a certain released value is true, and therefore matching the record with external identifiers could lead to mismatch or attribute misclassification. The method has been introduced for categorical variables, but it can be generalized to numerical variables as well.
Adding noise
Adding noise consists of adding a random value ε, with zero mean and predefined variance σ2, to all values in the variable to be protected. Generally, methods based on adding noise are not considered very effective in terms of data protection.
Resampling
Resampling is a protection method for numerical microdata that consists of drawing with replacement t samples of n values from the original data, sorting the sample, and averaging the sampled values. Data protection level guaranteed by this procedure is generally considered quite low.
Synthetic microdata
Synthetic microdata are an alternative approach to data protection, and are produced by using data simulation algorithms. The rationale for this approach is that synthetic data do not pose problems with regard to statistical disclosure control because they do not contain real data but preserve certain statistical properties. Initially, Rubin proposed synthetic data generation through multiple imputations, while Feinberg proposed using bootstrap methods. Additional approaches have been suggested, such as multiple imputation, Latin hypercube sampling, modeling, and data distribution by probability.
Generally, users are not comfortable with synthetic data as they cannot be confident of the results of their statistical analysis. This approach, however, can help produce “test microdata sets,” where synthetic data files would be released to allow users to test their statistical procedures to successively access “true” microdata in a data enclave.