Metadata standards and models
A set of metadata standards and models has been developed to facilitate data communication between organizations and software systems and improve the quality of statistical documentation provided to data users. These metadata standards provide a structured framework for organizing and disseminating information on the content and structure of statistical information.
XML metadata standards:
- Data Documentation Initiative (DDI) metadata standard, developed specifically for the documentation and cataloguing of microdata
- Dublin Core Metadata Initiative (DCMI)
- Statistics Data and Metadata Exchange (SDMX) standard, developed developed for the documentation and sharing of time series data; it is not directly usable for survey microdata, but relevant for derived indicators
Other metadata standards and metadata models (or "frameworks"):
- ISO 11179
- Generic Statistical Business Process Model (GSBPM)
- Generic Statistical Information Model (GSIM).
The XML Language
eXtensible Markup Language, or XML, was developed as a common tool to structure information to be shared on the Web and between software systems. XML is a way of tagging text for meaning instead of appearance, i.e., XML can organize the content of text by tagging it with meaningful information. Although the "tags" are conceptually the same as the "fields" in a database in terms of organization, the difference between XML and database files is that the former are regular text files which can be viewed and edited using any standard text editor. The file can be searched and queried like a regular database using tools like Xpath or Xquery, and edited using Xforms. (A web-based tutorial on these tools can be found at http://www.w3schools.com/xml.) Just as the content of a database can be converted into a report, XML documents can be read and transformed by other software applications into user-friendly formats such as spreadsheets, PDF files, or Web pages.
The following example shows how textual information about a survey could be presented in XML.
The same information converted into XML using DDI tags would look like this:
<titl>Multiple Indicator Cluster Survey 2005</titl> <altTitl>MICS</altTitl> <AuthEnty>National Statistics Office (NSO)</AuthEnty> <fundAg abbr="UNICEF">United Nations Children Fund</fundAg> <collDate date="2005-01" event="start"/> <collDate date="2005-03" event="end"/> <nation>Popstan</nation> <geogCover>National</geogCover> <sampProc>5,000 households, stratified two stages</sampProc> <respRate>98 percent</respRate>
The use of tags is particularly powerful when a user community agrees on a common set of tags (such as DDI or DCMI standards). Adoption of a common set of XML tags offers major advantages in documenting microdata including creation of a comprehensive "checklist" of useful metadata elements; potential to assess file contents by determining whether particular tags are, or are not, within that file; creation of a dataset catalog which can be queried for key metadata elements; and potential to transform the file into more user-friendly formats. XML files can be converted into HTML, PDF, or other documents using XSL Transformations, or exchanged across networks or the Internet using web services or SOAP. An example of the application of "XSL Transformation" to the earlier XML file is the following HTML web page:
Data Documentation Initiative (DDI)
Traditionally, data producers wrote text-based codebooks. To take full advantage of web technology, most standards are now defined in XML language. The DDI is a standard dedicated to microdata documentation that enables documentation of even the most complex microdata files in a way that is simultaneously flexible and rigorous. It provides a straightforward means of recording and communicating all the salient characteristics of microdatasets.
The DDI Alliance maintains two versions of the DDI specification:DDI Codebook (which is used and recommended by the IHSN), and DDI Lifecycle (a more complex version of the specification). The DDI Codebook is a major transformation of the once-familiar electronic “codebook” and retains the same set of capabilities but greatly increases the scope and rigor of the information contained therein.
The DDI metadata specification originated in the Inter-university Consortium for Political and Social Research (ICPSR), a membership-based organization with more than 500 member colleges and universities worldwide. It is now the project of an alliance of institutions in North America and Europe. Member institutions comprise many of the world’s largest data producers and data archives.
The DDI specification addresses the types of data resulting from surveys, censuses, administrative records, experiments, direct observation, and other systematic methodologies for generating empirical measurements. For example, the units of analysis could be individual persons, households, families, business establishments, transactions, countries, or other subjects of scientific interest. Similarly, observations may consist of measurements at a single point in time in a single setting, such as a sample of people in one country during one week. Or they may comprise repeated observations in multiple settings, including longitudinal and repeated cross-sectional data from many countries, as well as time series of aggregated data. The DDI specification also provides for full descriptions of the study’s methodology (e.g., mode of data collection, applicable sampling methods, universe, geographical areas of study, responsible organization and persons, and so on).
Structure
The DDI specification permits all aspects of a survey to be described in detail: Methodology, responsibilities, files, and variables. It provides a structured and comprehensive list of hundreds of elements and attributes that may be used to document a dataset, although it is unlikely that any one study would use all of them. Some elements, however, such as “Title,” are mandatory and must be unique. Other elements are optional and can be repeated, such as “Authoring Entity/Primary Investigator,” since it includes information on the person(s) and/or organization(s) responsible for the survey. DDI Codebook (version 2.n) elements are organized in five sections:
Section 1.0: Document Description
A study (i.e., survey, census or other) is not always documented and disseminated by the same agency as that which produced the data. Therefore, it is important to provide information (i.e., metadata) not only on the study itself, but also on the documentation process. The Document Description consists of an overview—the “metadata about metadata”—describing the DDI-compliant XML document.
Section 2.0: Study Description
The Study Description is an overview of the study and includes information on how the study should be cited; who collected, compiled, and distributed the data; a summary (abstract) of the data content; details of data collection methods and processing; and so on.
Section 3.0: Data File Description
This section describes each data file’s content, record and variable counts, version, producer, and so on.
Section 4.0: Variable Description
This section presents details of each variable, including literal question text, universe, variable and value labels, derivation and imputation methods, and so on.
Section 5.0: Other Material
This section allows for descriptions of other material related to the study. These can include documents such as questionnaires, coding information, technical and analytical reports, and interviewers’ manuals; data processing and analytical programs; photos; or maps.
Dublin Core Metadata Specification (DCMI)
The following content is derived from the DCMI website (http://dublincore.org).
The DCMI Metadata Element Set (ISO standard 15836), also known as the Dublin Core metadata standard, is a simple set of elements for describing digital resources. This standard is particularly useful in describing resources related to microdata, such as questionnaires, reports, manuals, data processing scripts and programs, etc. It was founded in 1995 by the Online Computer Library Center (OCLC) and the National Center for Supercomputing Applications (NCSA) at a workshop in Dublin, Ohio. Over the years, it has become the most widely used standard for describing digital resources on the Web and was approved as an ISO standard in 2003. The standard is maintained and further developed by the DCMI, an international organization dedicated to the promotion of interoperable metadata standards.
A major reason behind the success of the Dublin Core metadata standard is its simplicity. From the outset, it has been the goal of the designers to keep the element set as small and simple as possible to allow the standard to be used by non-specialists. The standard also makes it easy and inexpensive to create simple descriptive records for information resources, while providing for effective retrieval of those resources on the Web or in any similar networked environment.
In its simplest form the Dublin Core consists of the following15 metadata elements, all of which are optional and repeatable: Title; Relation; Rights; Subject; Coverage; Date; Description; Creator; Format; Type; Publisher; Identifier; Source; Contributor; Language.
ISO 11179 - Information Technology - Metadata registries (MDR)
The International Standard ISO/IEC 11179-1 was developed by the Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 32, Data management services. "ISO/IEC 11179 describes the standardizing and registering of data elements to make data understandable and shareable. Data element standardization and registration as described in ISO/IEC 11179 allow the creation of a shared data environment in much less time and with much less effort than it takes for conventional data management methodologies." (Source: ISO-IEC 1999, available at http://metadata-stds.org/11179-1/ISO-IEC_11179-1_1999_IS_E.pdf)
Statistical Data and Metadata Exchange (SDMX)
Focusing on time series and indicators, SDMX is the result of a joint effort between the Bank for International Settlements, European Central Bank (ECB), EUROSTAT, International Monetary Fund (IMF), Organization for Economic Cooperation and Development (OECD), United Nations (UN), and World Bank (WB) to create an XML specification to support the exchange of aggregate data and metadata. SDMX provides three types of statistical metadata standards: Standards for data formats, standards for metadata, and a registry-based architecture to implement these standards and to exchange data between systems.
One of the requirements of SDMX was coordination with other metadata specifications such as the DDI. Any of the DDI metadata, which emphasize archival metadata and microdata rather than aggregate data, are exchangeable in an equivalent SDMX metadata format. This ensures inter-operability of metadata across namespaces.
Generic Statistical Business Process Model (GSBPM)
The GSBPM describes statistical processes, such as the implementation of a survey, in nine phases, each divided into sub-processes:
- Specify the data needs
- Design
- Build
- Collect (includes data entry)
- Process (includes data editing)
- Analyze
- Disseminate
- Archive
- Evaluate
In addition to these nine phases, the GSBPM includes two overarching components: Quality Management and Metadata Management.
Generic Statistical Information Model (GSIM)
GSIM is a reference framework of internationally accepted definitions, attributes, and relationships that describe the information used in the production of official statistics and information objects. This framework enables generic descriptions of the definition, management, and use of data and metadata throughout the statistical production process.
GSIM provides a common language to describe information that supports the entire statistical production process, from the identification of user needs through the dissemination of statistical products.
GSIM is aligned with relevant data management and exchange standards, such as DDI and SDMX, but is not directly tied to them, or to any specific technology.
GSIM is not software, nor an information technology (IT) standard. It is a strategic approach and a new way of thinking, designed to bring together statisticians, methodologists, and IT specialists to modernize and streamline the production of official statistics.
The previous information was extracted from the GSIM brochure, available at the GSIM website.