Back to Top

Scope of the documentation

What information should be provided?

The following information was extracted from Good Practices in Data Documentation, UK Data Archive, University of Essex. See also the IHSN "Quick reference Guide for Data Archivists."

There are three primary types of documentation ideal for a dataset: Explanatory materials, contextual information, and cataloguing material.

1. Explanatory material

This represents the minimum of material that should be created and preserved to ensure the long-term viability and functionality of a dataset and the full understanding of the dataset and its contents.

  • Information about the data collection methods
  • This section describes the data collection process, whether it is a survey, collection of administrative information, or transcription of a document source. It should describe the instruments used and methods employed, and how they were developed. If applicable, details of the sampling design and sampling frames should be included. It is also useful to include information on any monitoring process undertaken during the data collection as well as details of quality controls.

  • Information about the structure of the dataset
  • Key to this information is a detailed document describing the structure of the dataset and including information about relationships between individual files or records within the study. For example, it should include key variables required for unique identification of subjects across files, the number of cases and variables in each file, and the number of files in the dataset. For relational models, a diagram should be constructed showing the structure and relations between datasets records and elements.

  • Technical information
  • This information relates to the technical framework and should include the computer system used to generate the files; the software packages with which the files were created; the medium on which the data was stored, and a complete list of all data files present in the dataset.

  • Variables and values, coding and classification schemes
  • The documentation should contain a full list describing all variables or fields in the dataset, including a complete explanation and full details about the coding and classifications used for the information allocated to those fields. It is especially important to have blank and missing fields explained and accounted for. It is helpful to identify variables to which standard coding classifications apply, and to record the version of the classification scheme used, preferably with a bibliographic reference to that code.

  • Information about derived variables
  • Many data producers derive new variables from original data. This may be as simple as grouping raw age (in years) data according to groups of years appropriate for the survey, or it may be much more complex and require the use of sophisticated algorithms. When grouped or derived variables are created, it is important that the logic for the grouping or derivation is clear. Simple grouping, such as for age, can be included within the data dictionary. More complex derivations require other means of recording the information. The best method of describing these is by using flowcharts or accurate Boolean statements. Sufficient supporting information should be provided to allow an easy link between the core variables used and the resultant variables. In addition, computer algorithms used to create the derivations should be saved together with information on the software.

  • Weighting and grossing
  • Weighting and grossing variables must be fully documented, with explanations of the construction of the variables and clear indications of the circumstances in which they should be used. The latter is particularly important when different weights are applied for different purposes.

  • Data source
  • Details about the source from which the data is derived should be included. For example, when the data source consists of responses to survey questionnaires, each question should be carefully recorded in the documentation. Ideally, the text will include a reference to the generated variable(s). It is also useful to explain the conditions under which a question would be asked, including, if possible, the cases to which it applies and, ideally, a summary of response statistics.

  • Confidentiality and anonymization
  • It is important to determine whether the data contains any confidential information on individuals, households, organizations, or institutions. If so, such information should be recorded together with any agreement on how to use the data, such as with survey respondents. Issues of confidentiality may restrict the analyses to be undertaken or results to be published, particularly if the data is to be made available for secondary use. If the data was anonymized to prevent identification, it is wise to record the anonymization procedure and its impact on the data, as such modification may restrict subsequent analysis.

2. Contextual information

This provides users with material about the context in which the data was collected and how it was put to use. This information adds richness and depth to the documentation, and enables the secondary user to fully understand the background and processes behind the data collection exercise. It also creates a vital historical record for future researchers.

  • Description of the originating project
  • Details should be provided about the history of the project or the process that produced the dataset, including information on the intellectual and substantive framework. For example, the description could cover topics such as:

    • Why the data collection was necessary;
    • Objectives of the project;
    • Who or what was being studied;
    • Geographic and temporal coverage;
    • Publications or policy developments to which the project contributed or because of which the project arose as a response, and
    • Other relevant information.
  • Provenance of the dataset
  • This information relates to aspects such as the history of the data collection process, changes and developments that occurred in the data and methodology, or any adjustments made. The following can be provided as well:

    • Details of data errors;
    • Problems encountered in the process of data collection, entry, checking, and cleaning;
    • Conversion to a different software or operating system;
    • Bibliographic references to reports or publications related to the study, and
    • Other useful information on the life cycle of the dataset.
  • Serial and time-series datasets, new editions
  • For repeated cross-section, panel, or time series datasets, it is helpful to obtain additional information describing changes in the question text, variable labeling or sampling procedures, or other changes.

3. Cataloguing material

This material serves two purposes. First, it is a bibliographic record of the dataset. This allows for the dataset to be properly acknowledged and cited in publications, and the material becomes a formal record for preservation purposes. Second, it is the basic instrument for resource discovery, allowing the dataset to be uniquely identified within the collection by providing information to help secondary users identify a study as useful to them.