Access options
Microdata files for dissemination almost always differ from those strictly for use by staff of data-producing agencies. Preparing raw microdata files for dissemination involves processes that may adjust the content and/or number of records. The content of records in microdata files for dissemination is edited by suppressing information from direct and indirect identifiers to protect the anonymity of respondents. But suppressing information does not necessarily mean removing variables. In some cases, re-coding variables into less detailed categories to make them less informative is sufficient. Sometimes this also requires truncating the number of records contained in a disseminated microdata file, especially in the case of population census data. Processes to safeguard respondents’ identity are referred to collectively as Statistical Disclosure Control (SDC) or anonymization. Microdata files collected for official statistics should be disseminated if respondent confidentiality can be protected adequately. Consideration of three types of files is recommended when it comes to establishing dissemination policy: Public use files, licensed files, and data enclaves. These files differ in their level of accessibility to users and the extent to which they are anonymized.
“No individual (…) may claim entitlement to obtain or access identifiable data (…) by virtue of his or her employment. Access to identifiable data is not determined solely by employment status, organizational affiliation, or financial commitment. More important are the need for the identifiable data, the use to which the data will be put, and the requestor’s role and responsibility with respect to the data collection activity. Since any access to identifiable data poses risk, access to such data will be carefully evaluated and tracked after access is granted.” [National Center for Health Statistics (NCHS). 2002. “Policy on Micro-data Dissemination”.]
Public Use Files (PUF)
PUFs are available to anyone who agrees to comply with a set of simple conditions that determine what can be done with the data (e.g., data cannot be sold). In some cases PUFs are disseminated without conditions and are often available on-line. These data are made easily accessible because the risk of identifying individual respondents is considered minimal. Minimizing the risk of disclosure involves eliminating all content that can identify respondents directly—for example, names, addresses, and telephone numbers. In addition, this requires purging relevant indirect identifiers from the microdata file. These vary across survey designs, but commonly suppressed indirect identifiers include geographical information below the sub-national level at which the sample is representative. Occasionally, certain records may be suppressed from PUFs, as well as variables characterized by extremely skewed distribution or outliers. In lieu of deleting entire records or variables from microdata files, however, alternative SDC methods can minimize the risk of disclosure while maximizing information content. Such methods include top-and-bottom coding, local suppression, or data perturbation techniques.
PUFs are typically generated from census data files—a subset of records rather than the entire file—and household surveys. While technically possible to create PUFs for business surveys, this presents a particular set of challenges that will be addressed separately.
PUFs should be as informative as possible. As stated by the US National Center for Health Statistics (NCHS) in 2002, “the objective is to make microdata available as widely and in the most detailed form possible, subject only to limits imposed by resources, data quality, technology, and the need to protect confidentiality.”
Conditions for Accessing and Using PUFs
1. Data and other material provided by the NSO will not be redistributed or sold to other individuals, institutions, or organizations without the NSO’s written agreement.
2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reporting aggregated information, including modeling, and not for investigating specific individuals or organizations.
3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person or establishment discovered inadvertently. Any such discovery will be reported immediately to the NSO.
4. No attempt will be made to produce links between datasets provided by the NSO or between NSO data and other datasets that could identify individuals or organizations.
5. Any books, articles, conference papers, theses, dissertations, reports, or other publications employing data obtained from the NSO will cite the source, with the citation requirement provided with the dataset.
6. An electronic copy of all publications based on the requested data will be sent to the NSO.
7. The original collector of the data, the NSO, and relevant funding agencies bear no responsibility for the data’s use or interpretation or inferences based upon it.
Note: Items 3 and 6 in the list require that users be provided with an easy way to communicate with the data provider. It is good practice to provide a contact number, an email address, and possibly an on-line “feedback provision” system.
Licensed Files
Licensed Files—also called Research Files—are distinct from PUFs: their dissemination is restricted to users who have received access authorization after submitting a documented application and signing an agreement governing the data’s use. While licensed files are usually anonymized to minimize the risk of identifying individuals when used in isolation, they may contain potentially identifiable data if linked with other data files.
Direct identifiers such as respondents’ names must be removed from a licensed dataset. The data files may, however, still contain indirect variables that could identify respondents by matching them to other data files such as voter lists, land registers, or school records.
When disseminating licensed files, establishing and signing an agreement between the data producer and external bona fide users—trustworthy users with legitimate need to access the data—is recommended. The agreement should govern access and use of such microdata files. Occasionally, licensing agreements are entered into only with users affiliated with an appropriate sponsoring institution, i.e., research centers, universities, or development partners.
It is further recommended that, before entering into a data access and use agreement, the data producer asks potential users to complete an application form to demonstrate the need to use a licensed file instead of the PUF version, if available, for a stated statistical or research purpose. Template licensed files’ application forms and agreements are provided in Chapter 6, which discusses the conditions under which access to microdata files should be given.
Conditions for Accessing and Using Licensed Data Files
Note: Items 1 to 8 are similar to the conditions for use of public use files. Items 9 and 10 would have to be adapted in the case of a blanket agreement.
1. Data and other material provided by the NSO will not be redistributed or sold to other individuals, institutions or organizations without the NSO’s written permission.
2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reporting aggregated information, including modeling, and not for investigating specific individuals or organizations.
3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person or establishment discovered inadvertently. Any such discovery will be reported immediately to the NSO.
4. No attempt will be made to produce links between datasets provided by the NSO or between NSO data and other datasets that could identify individuals or organizations.
5. Any books, articles, conference papers, theses, dissertations, reports, or other publications employing data obtained from the NSO will cite the source, in accordance with the citation requirement provided with the dataset.
6. An electronic copy of all publications based on the requested data will be sent to the NSO.
7. The NSO and the relevant funding agencies bear no responsibility the data’s use or for interpretation or inferences based upon it.
8. An electronic copy of all publications based on the requested data will be sent to the NSO.
9. The researcher’s organization, principal, and other researchers involved in using the data must be identified. The principal researcher must sign the license on behalf of the organization. If the principal is not authorized to sign on behalf of the receiving organization, a suitable representative must be identified.
10. The intended use of the data, including a list of expected outputs and the organization’s dissemination policy must be identified.
Note: conditions 9 to 11 may be waived for educational institutions.
Agreement between [providing agency] and [receiving agency] regarding the deposit and use of microdata
A. This agreement relates to the following microdatasets:
1. _______________________________________________________
2. _______________________________________________________
3. _______________________________________________________
4. _______________________________________________________
5. _______________________________________________________
B. Terms of the agreement:
As the owner of the copyright in the materials listed in section A, or as duly authorized by the owner of the copyright in the materials, the representative of [providing agency] grants the [receiving agency] permission for the datasets listed in section A to be used by [receiving agency] employees, subject to the following conditions:
1. Microdata (including subsets of the datasets) and copyrighted materials provided by the [providing agency] will not be redistributed or sold to other individuals, institutions, or organizations without the [providing agency]’s written permission. Non-copyrighted materials that do not contain microdata (such as survey questionnaires, manuals, codebooks, or data dictionaries) may be distributed without further authorization. The ownership of all materials provided by the [providing agency] remains with the [providing agency].
2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reporting aggregated information, including modeling, and not for investigating specific individuals or organizations.
3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person or establishment discovered inadvertently. Any such discovery will be reported immediately to the [providing agency].
4. No attempt will be made to produce links between datasets provided by the [providing agency] or between [providing agency] data and other datasets that could identify individuals or organizations.
5. Any books, articles, conference papers, theses, dissertations, reports, or other publications employing data obtained from the [providing agency] will cite the source, in accordance with the citation requirement provided with the dataset.
6. An electronic copy of all publications based on the requested data will be sent to the [providing agency].
7. The [providing agency] and the relevant funding agencies bear no responsibility the data’s use or for interpretation or inferences based upon it.
8. An electronic copy of all publications based on the requested data will be sent to the [providing agency].
9. Data will be stored in a secure environment, with adequate access restrictions. The [providing agency] may at any time request information on the storage and dissemination facilities at the [recipient agency].
10. The [recipient agency] will provide an annual report on uses and users of the listed microdatasets to the [providing agency], with information on the number of researchers who have accessed each dataset, and the output of this research.
11. This access is granted for a period of [provide information on this period, or state that the agreement is open ended].
C. Communications
The [receiving organization] will appoint a contact person who will act as unique focal person for this agreement. Should the focal person be replaced, the [recipient agency] will immediately communicate the name and coordinates of the new contact person to the [providing agency]. Communications for administrative and procedural purposes may be made by email, fax, or postal mail as follows:
Communications made by [providing agency] to [recipient agency] will be directed to:
Name of contact person:
Title of contact person:
Address of the recipient agency:
Email:
Tel:
Fax:
Communications made by [recipient agency] to [depositor agency] will be directed to:
Name of contact person:
Title of contact person:
Address of the recipient agency:
Email:
Tel:
Fax:
D. Signatories
The following signatories have read and agree with the Agreement as presented above:
Representative of the [providing agency]
Name ____________________________________________________
Signature _______________________________ Date ______________
Representative of the [recipient agency]
Name ____________________________________________________
Signature _______________________________ Date ______________
Files accessible in data enclave
Some files may be offered to users under strict conditions in a data enclave. This is a facility equipped with computers that are not linked to the Internet or an external network and from which no information can be downloaded via USB ports, CD-DVDs, or other drives. Data enclaves contain data that are particularly sensitive or allow direct or easy identification of respondents. Examples include complete population census datasets, enterprise surveys, and certain health-related datasets containing highly confidential information.
Users interested in accessing a data enclave may not have access to the full dataset, but only to the particular data subset they require. They will be asked to complete an application form demonstrating a legitimate need to access these data to fulfill a stated statistical or research purpose (see Chapter 6 for an example). Outputs generated must be scrutinized via a full disclosure review before release.
Operating a data enclave may be expensive. It requires special premises and computer equipment, and staff with the skills and time to review outputs before their removal from the data enclave to ensure there is no risk of disclosure. Such staff must be familiar with data analysis and able to review the request process and manage file servers.
Because of the substantial operating costs and technical skills required, some statistical agencies or other official data producers opt to collaborate with academic institutions or research centers to establish and manage data enclaves. Examples of data enclaves with informative websites include the Michigan Census Research Data Center (MCRDC), a joint project of the US Census Bureau and the University of Michigan (www.isr.umich.edu/src/mcrdc/); the National Opinion Research Center (NORC) at the University of Chicago (www.norc.org/DataEnclave); the Research Data Centres (RDC) program of Statistics Canada (www.statcan.gc.ca/rdc-cdr/index-eng.htm); and the US NCHS Research Data Center (http://www.cdc.gov/nchs).