A sattelite

Metadata and documentation

What do I need to do to make sense of data?

  • Documentation and metadata tell the story of the data providing context and meaning for users, including the original data creation.
  • Documentation is contextual material generated in the course of the research to aid reuse.
  • Metadata is structured information describing the data to aid discovery.
  • Funders and journal publishers require metadata explaining what research data exists, why, when and how it was generated, and how to access it accompanies research data.
  • A repository will have metadata fields that meet a particular recognised standard for researchers to complete when depositing research data.

Generating documentation is an inevitable and necessary aspect of research. But good documentation should tell the story of data: the who, what, where, when and why. Material must be provided that explains the creation, meaning, measurement, structure, alterations and manipulations undertaken to clean and analyse the data so it can be independently understood without having to ask questions of the original data creator.

Creating and sustaining comprehensive documentation is important because it transfers knowledge to other potential users enabling researchers to discover, understand, and properly cite data. Good documentation also helps integrate people joining a research team, and aids the original data creators themselves. Researchers may know during initial data collection and examination meanings, methodologies, and manipulations undertaken during data collection and analysis, but over time, it is human nature to forget or misremember aspects of creation and manipulation. That is why data documentation during collection, processing and analysis is crucial: it provides provenance and context to the data and enables comprehension and reuse in the long term.

Documentation

Documentation may include:

  • Reasons the data was collected. The aims and objectives of the project, often outlined in funding proposals or end of award reports.
  • A "clean" copy of the confidentiality and consent agreement used.
  • Data collection methods and procedures. Definition of the universe of analysis and sample framework, notes on instruments used to collect data and analyse data, plus information on the conditions of data collection.
  • Data collection tools. A copy of the questionnaire(s), prompts, and/or interview schedule(s) used in the research.
  • Database schemas and data structure. Variable labels and descriptions, an outline of relationships within the dataset.
  • Coding schemas. Definition of coding conventions used – including information on missing data, categories, classifications, acronyms and annotations.
  • Data modifications. Specification of any weighting used, identification of derived variables and the syntax used to create them, output files, and subsequent modifications to the original data.
  • Quality control measures. Details on activities undertaken to verify and clean the data, an outline of formatting applied to the data, an explanation of file naming conventions, and if needed, a statement on known problems with the data.

Metadata

Metadata is fundamentally data about data. However, unlike documentation, metadata is a formally agreed set of standards often with controlled fields and vocabularies to facilitate data preservation, discovery for sharing, and data citation. Some fields may be mandatory (title, Principal Investigator), others recommended (language, contact information), and some optional (ownership, retention period).

Metadata exists at both an object (a file, a variable) and collection (files, dataset) level. It can also be grouped into different aspects of data description. For example, data description, access, technical, and methodological.

There are various metadata standards, often tailored to particular types of need (archiving, librarianship) or disciplines. Metadata schemes are designed for computers to read, but human readability is not excluded. The primary metadata standard for social, behavioural, and economic research is Data Documentation Initiative (DDI), designed to describe all stages of research in the social sciences by providing definitions (semantics) for every element of the data from conceptualization, collection, processing, dissemination, analysis, archiving, and, eventually, reuse.

Funders expect research is accompanied by standardised, structured metadata that provides information on research data exists, why, when and how it was generated, and how to access it. Although the fields, structure, and vocabulary are provided by the repository or archive accepting your data, the content is provided by the researcher.

Members of Research Councils UK adopted a common principle that "sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data". The ESRC [PDF] are more specific, in demanding structured metadata is provided for reusers that informs on purpose, origin, time references, geographic location, creator(s), access conditions and terms of use (p.4)

Good metadata can help others discover, comprehend, and evaluate data across time and distance without having to access the data itself. Good researcher generated metadata enhances a collection, but its absence destroys it.