Anonymization principles
Anonymizing a micro dataset consists of removing or modifying its identifying variables. "Typically an identifying variable is one that describes a characteristic of a person that is observable, that is registered (identification numbers, etc.), or generally, that can be known to other persons." (µ Argus Manual)
Identifying variables include:
- Direct identifiers, which are variables such as names, addresses, or identity card numbers. They permit direct identification of a respondent but are not needed for statistical or research purposes, and should therefore be removed from the published dataset.
- Indirect identifiers, which are characteristics that may be shared by several respondents, and whose combination could lead to the re-identification of one of them. For example, the combination of variables such as district of residence, age, sex, and profession would be identifying if only one individual of that particular sex, age, and profession lived in that particular district. Such variables are needed for statistical purposes, and should not be removed from the published data files. Anonymizing the data involves determining which variables are potential identifiers (based on personal judgment) and modifying the specificity of these variables to reduce the risk of re-identification to an acceptable level. The challenge is to maximize the security while minimizing the resulting information loss.
Disclosure risk depends not only on the presence of identifying variables in the dataset, but also on:
- The existence of an intruder, which in turn depends on the potential benefit this intruder would reap from re-identification. For some types of data such as business data, the intruder's motivation can be high. For other types, like household surveys in developing countries, the motivation would typically be much lower as there is little to gain in re-identifying respondents.
- The cost of re-identification. The higher the cost, the lower the benefit for an intruder.
To account for these parameters, a disclosure scenario must be defined as a first step in the anonymization process. Scenarios can be classified in two categories:
- Nosy neighbor scenarios. These scenarios assume the intruder has enough information on a unit or units; this information stems from his/her personal knowledge. In other words, the intruder belongs to the circle of acquaintances of a statistical unit.
- External archive scenarios. Such scenarios are based on the assumption that the intruder can link records belonging to the distributed dataset to records from another available dataset or register, which contains direct identifiers. The intruder does so by using identifying variables available in both datasets as merging keys (i.e., data matching). Conservative assumptions are often made to define a worst-case scenario.
When producing microdata files, one should always keep the user perspective in mind. It is fundamental that the released file meets the researcher's requirements. Both information content and the choice of protection methods must focus as much as possible on the user's needs. Knowing what statistical analysis the user wants to perform helps in deciding the anonymization strategy.