You are on page 1of 15

Metadata

DATA ABOUT DATA

Kashif Rabbani

Technische Universität Berlin, Berlin, Germany

Abstract. In the following document the author will provide a basic


review of Metadata. This review will start with the motivation and
definition of a Metadata. Metadata definition requires to build some
foundation knowledge about maps along with background from history.
Metadata literature will be extended to big picture view focusing on
three core features reflected by metadata. After covering metadata liter-
ature, author will explain few important day-to-day metadata terminolo-
gies, metadata standards topology and top-notch metadata types used
in current-state-of-the-art. Finally, author will conclude by explaining
domain-specific metadata standards in five particular domains, use of
metadata and take-away message. Most of the concepts explained in this
report are based on [3][4].

Keywords: Metadata · Metadata Standards · Metadata Topology ·


Metadata Terminologies · Metadata Types · Metadata domains · Use-of-
metadata
Table of Contents

Metadata DATA ABOUT DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Kashif Rabbani
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Driving Towards Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Defining Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Metadata Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Metadata Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Metadata Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Descriptive Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Dublin Core Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Administrative Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Rights Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Preservation Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Technical Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Structural Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Provenance Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Meta-Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Domain-Specific Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Music Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Geospatial Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Geospatial Metadata Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Geospatial Metadata Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Use of Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1 Data Exhaust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Paradata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
METADATA 3

1 Introduction

Metadata is a term widely used in data science nowadays. Most often this term
is misunderstood due to lack of appropriate knowledge. Philip Bagley[1] coined
the term Metadata for the first time in November 1968. The idea of the concept
metadata belongs to the first library thousands of years ago.
The 1st catalog created for the Library of Alexandria in the year 245BC was
called Pinakes ( in Ancient Greek). It was invented to sort out the critical issue
of finding the relevant book of interest quickly. As an analogy, it was more like
VHS-Tape scrolling technologies we had in the past. Attributes used in these
catalogs were the same as being used in today’s libraries e.g. title, genre and
author.
The 2nd invention in the field of library catalog developments was Codex. It
was also called shelf-list (the book). 3rd and the most revolutionizing invention
was Card Catalogs, invented at the time of French revolution. Card catalog
atomized the shelf-list in two dimensions. 1) records for individual items and
2) Headers/categories shared by the data items if we think about it again, by
breaking the data into records (individual items) and categories that are shared
by the data items you essentially invent a spreadsheet. This atomization in two
dimensions led us to the invention of the databases later.

1.1 Driving Towards Metadata

Let’s build some basis to come up with a technical definition of metadata. Ac-
cording to the theory of Alfred Korzybsk (An American scholar recognized as a
founder of general semantics),
The map is not the territory.
We encounter different types of maps in our daily life, for example, the most
used road maps (Google Maps), topographical maps and nautical charts. All
these different types of maps are entitled to serve a different specific purpose
and possibly they are not interchangeable. The commonality among these maps
is that all these maps simplify the copiousness and complexity of the physical
world into the details that one can need in a specific situation. Precisely, these
maps serve as a Language to reduce the daily life’s complexities. For example,
we do not need topographic (information about the shape and features of land
surfaces) when planning a road trip, we only require weather and traffic/roads
information. Thus we can say that the map is a separate (simple) object of the
territory. Hence we conclude that Metadata is a map. It is a way to simplify the
complexity of an object.
When a task is being performed well by the metadata, its existence fades away
into the background. As an elementary example, every piece of information we
get while backtracking our memory to find out the lost keys of our house is
metadata.
4 Kashif Rabbani

1.2 Defining Metadata


A short and very well known definition of metadata on the internet is “data
about data.” Definition of data is different for everyone and term “about” itself
is not very clear. Therefore, we need to elaborate on this definition technically.
Mr. Jeffrey Pomerantz in his recent book ”Metadata” at MIT came up with the
following definition:
Metadata is a Statement about a Potentially Informative Object.
– Metadata Book by Jeffrey Pomerantz- MIT
Information objects have three features reflected through metadata. Con-
tent, Context, and Structure of the information object. What does the object
contain? What are the “W” aspects in the object’s creation? Moreover, what is
the structure of the object? Metadata answers these questions at any level of
aggregation (single, list, databases).

1.3 Metadata Terminologies


The information object is known as the Resource. To make a statement, we have a
resource to say something about it. What we say about the potential information
object is known as Description. A metadata schema defines multiple rules set.
These rules state what kind of statements are allowed to make about the resource
and how to make such statements. The first metadata schema designed to express
the description of any resource was Dublin Core. We will talk about it in details
in the next section. A piece of data assigned to an element is called value. An
Element in a metadata schema is a category of statement about the resource.
The most used terminology is metadata record, it is a set of statements about
a single resource. There are two types of vocabularies often used in metadata
standards. Uncontrolled vocabulary is an infinite set of terms that suggest a
value for an element while the Controlled vocabulary is an organized finite set.
Figure 1 represents the flow of Resource to description.

Fig. 1: Metadata Terminologies

2 Metadata Standards
There are hundreds of metadata standards available for different domain-specific
areas. However, this report does not aim to overwhelm the readers with meta-
data standards. A topology of metadata standards is formed to illustrate the
METADATA 5

importance and existence of metadata standards.


Standards are like toothbrushes, everyone agrees that they’re a good idea, but
nobody wants to use anyone else’s.
– Attributed to Murtha Baca, Getty Research Institute

Data Structure Standards are based on sets of metadata element and


schemas. These standards are containers of data about the information object,
e.g. Dublin Core Metadata Element Set (DCMES), MARC, EAD, CDWA, VRA,
etc.
Data Value Standards are based on controlled vocabularies. Such stan-
dards represent terms/values used to populate data structure standards or sets
of metadata elements. E.g. LCSH, LC Thesaurus for Graphic Materials (TGM),
ULAN, TGN, ICONCLASS, etc.
Data Content Standards posses cataloging rules and codes. Before-mentioned
standards form the basis of guidelines for formats and syntax rules of the data
values used to populate the metadata elements. E.g. Anglo-American Catalogu-
ing Rules (AACR), RDA, ISBD, DACS.
Data Format and Technical Interchange Standards are in machine-
readable form. These standards represent the manifesto of a specific data struc-
ture standard which is encoded for machine-level execution. There is a long list
of examples but most common are MARC21, MARCXML, EAD XML DTD,
METS, MODS, CDWA Lite XML schema, Simple Dublin Core XML schema,
Qualified Dublin Core XML schema, VRA Core 4.0 XML schema.

3 Metadata Types

Perhaps the most famous and widely used type of metadata is descriptive meta-
data. However, this is not the only type in the market. Different communities
perceive metadata from different angles and thus come up with a new type of
metadata or metadata standards. We will discuss eight different types of meta-
data in the details below.

3.1 Descriptive Metadata

The very first metadata known as ’Dublin Core’ was


categorized as descriptive metadata. In November 1993,
the National Center for Supercomputing Applications
(NCSA) released a first web application to display both
the images and the text simultaneously. Indeed it was a
major step in the World Wide Web (WWW), but within
the very next two years in early 1995, HTTP, FTP and
Telnet (To Transfer Data) took the market. In March
1995, Online Computer Library Center Inc Dublin, Ohio
(OCLC) and NCSA called an invitation-only workshop.
The main agenda was to discuss the ”metadata for the
6 Kashif Rabbani

web.” There was no search engine available at that time.


Not even Google and Yahoo. The goal of the workshop was
to somehow reach the consensus on a core set of metadata
elements to describe the web and network resources. The point of discussion was
the importance of descriptive metadata for the success of web search tools. Fif-
teen elements were introduced known as Dublin Core Elements shown in figure
2. These elements can be extended for other metadata standards. Each element
is a statement stated about a resource, e.g. the element creator will express the
intellectual property of the potential informative object.
Let’s write metadata about Apple Inc. Title: Apple Inc, Creator: Steve Jobs,
Creator: Steve Wozniak, Creator: Ronald Wayne, Date: April 1976, description:
Personal Computers Manufacturer.

Fig. 2: Dublin Core Metadata Elements

Dublin Core Analysis: As Dublin Core metadata standard elements were


defined as a core. It is essential to analyze the success of something which is
defined as a core. The reason is that the audience expects that core should be
adaptable and extendible without much cost. Cost includes financial cost, time
and risk. In figure 3 you can see the consumption of housemade things over the
last century to get an idea of the rate of adoption. To analyze the success of
Dublin Core, we should recall the objective of 1995 OCLC Dublin Workshop.
It was about the importance of metadata in the success of web search engines.
Successful search engines like Google and Yahoo came into existence by making
use of full-text searching approaches by taking advantage of network structure
METADATA 7

and other web features. Hence the purpose of Dublin Core seems to go shallow
here because search engines did not make their foundations based on Dublin Core
metadata standards. Should we declare Dublin Core as a failure now? No, the
first initiative to implement RDF data model was because of Dublin Core. Most
famous RDF data models are the Digital Public Library of America, Europeana,
and DBpedia.
DBpedia aims to extract information from the Wikipedia
project. This structured information is stored in the form of
RDF. It is available on the World Wide Web. It allows query-
ing Wikipedia resources semantically to get details about their
relationships and properties and links to other RDF ontolo-
gies. It is also known as one of the best efforts of decentralized
Linked Data.
Europeana was started to preserve the European cultural
heritage in digital format. Most famous Mona Lisa painting by Leonardo da
Vinic is one of the examples of Europeana. Europeana got contributed by more
than 3000 institutes. Europeana let users explore the European cultural and the
scientific heritage.

Fig. 3: Consumption of Households in last century [4]

The nice thing about standards is that there are so many of them to choose
from. – Admiral Grace Hopper

3.2 Administrative Metadata

It provides information about the complete lifecycle of a resource. This infor-


mation is used in the administration of the resource. Administrative type of
metadata itself is a huge umbrella. It covers three main types of metadata, i.e.
Preservation, Rights and Technical Metadata. Managing a resource requires ev-
ery little piece of information to be stored and analyzed in a way which is both,
8 Kashif Rabbani

useful and extendible at the same time. We will discuss three types of metadata
under the hood of administrative metadata in the subsequent subsections.

Rights Metadata It provides information about access control rules and reg-
ulations of a resource. Digital resources most often suffer from the issue of copy-
rights. A schema to capture the data about rights of the resources; remember
the “rights” element of Dublin Core. Dublin core standards get extended with
three more elements. 1) Access Rights: Policies and rights for the holder to ac-
cess the resource, 2) Rights Holder: It can be an individual or an organization,
3) License: It is a legal document.

Preservation Metadata Ensuring the existence or aliveness of a resource


throughout the life cycle of a process requires supporting information which
can only be provided by Preservation Metadata. Preservation Metadata Imple-
mentation Strategies (PREMIS ) schema is the most fully developed metadata
schema to support the preservation of the resource. In other words, preservation
metadata is the information used by a repository (e-resource collection) to guard
the process of digital preservation. E.g. If the process description is to store the
specific type of medicine in an environment having 25 percents relative humidity,
its PREMIS diagram maps to the architecture shown in figure 4.

Fig. 4: PREMIS Component Diagram [4]

Technical Metadata It addresses the system level technical details about


the functionality of a resource. A most common example is digital photography.
METADATA 9

Modern smartphones and digital cameras automatically generate rich metadata


records and embed them with each captured photograph (image file). Exchange-
able image file format also termed as ”Exif ” is one of the well-known metadata
schema used by most of the modern digital devices. Figure 5 shows Exif metadata
schema used in Canon EOS.

Fig. 5: Exif Metadata Schema [4]

3.3 Structural Metadata


We are habitual of watching digital videos. Structural Metadata plays a signifi-
cant role behind the digital curtains. MPEG-21 is an ISO standard for digital
videos. MPEG-21 provides an open framework for applications to incorporate
multimedia files. Heart of MPEG-21 is a digital item. A structured digital ob-
ject, e.g. a movie includes videos, audio tracks, and images organized in a specific
way. In this case, the movie is a resource and structural metadata is responsible
for capturing the information about its organization. MPEG-21 provides infor-
mation about the correct playlist order besides the video items.
10 Kashif Rabbani

3.4 Provenance Metadata


It is impossible to track the end-to-end history of a resource having informa-
tion about its related entities. Provenance metadata provides a mechanism to
track the data about the entities and cross relationship of other entities with
the resource. Provenance metadata is a method to determine the position and
provide a context of a resource in a social network. E.g. Wiki is storing every
edit made to any of its pages. It leverages wiki users to go through the historical
timeline of a page along with information about editors (IP addresses at least)
and comments.

3.5 Meta-Metadata
Metadata Encoding and Transmission Standard (METS ) started in early 2000
as a result of an enormous increase in data from digital resources like libraries,
museums, archives and cultural heritage. It resulted in an exponential increase
of metadata schemas and standards for the resources mentioned above. Popular
repositories which came into existence includes arxiv.org, Fedora, eprints, and
Dspace. Few of these resources are still up to date and well known. It started
the problem of reproduction of content and functionality of the data. METS
provided a standard structure for metadata about resources and ensured data
exchange among different repositories to solve this problem.
METS creates documents for metadata records. A METS document is a
mechanism to read several relationships that exists between digital library ob-
ject and pieces of contents. There are seven parts of the METS document. The
Header, Descriptive metadata, Administrative metadata, Structural Map, Struc-
tural link, Behavior and comparison analysis.

4 Domain-Specific Metadata
Metadata is everywhere, but few of the most public areas are HealthCare, Envi-
ronmental, GeoSpatial, Education, Music Industry, and the Automobile indus-
try. We will discuss each domain in details below.

4.1 Music Industry


We all love music. Music industries are focusing on releasing new
unique types of music by making use of the latest research tools and
technology. Pandora 1 a popular online music service is making exten-
sive use of metadata. Descriptive metadata is currently an active area
of development in the classical music industry. Music Genome Project
is the heart of Pandora service. It consists of around 450 features to
describe a piece of music. These features are elements of metadata
schema. Pandora has hired a team of musicians to do this job. This
1
https://www.pandora.com/
METADATA 11

team is responsible for listening to every song licensed under Pandora


and map the characteristics of each song over the features of Music Genome
Project. Some of the features are keys, tempo, beats per minute, and the gender
of the vocalist, etc. Evolution in the music industry is mostly because of genre
and technology. Metadata is the best way to keep track of this evolution.

4.2 Education
Education is a broad field, and there are plenty of learning resources available
online to facilitate the learning pathways. Metadata comes into the picture when
we need to standardize the learning objects. The Institute of Electrical and Elec-
tronic Engineers (IEEE) announced the standard for Learning Object Metadata
(LOM) to describe the learning objects in 2002.
Another aspect associated with the process of learning is teaching. Learning
objects support both teaching and learning around a single learning objective.
As most of these learning resources are in the form of digital resources, therefore
it is easy to standardize their distribution to one meta-body. LOM defines the
set of categories. Each category contains a specific set of elements. As a result
of this initiative, many higher education systems adopted LOM. E.g. Learning
management systems (LMS) used in K-12 2 . LOM categories include Educational
category comprised of set TypicalAgeRange, TypicalLearningTime and Rights
Category comprised of Copyright element.

4.3 Transcripts
As the heading does not convey much about this domain, we need to dig into its
essence. Educational institutes are providing transcripts/degrees/certificates to
the students. The fact that not every institute of the state is inter-linked with
each other. A reliable way to avoid the verification of transcripts via physical mail
was the necessity of time. Parchment 3 is a company making use of metadata
for developing schemas to represent degree programs, and courses of students
in a well-structured way. This area has got standardization in higher education
recently. Parchment will facilitate the verification of transcripts across differ-
ent institutes and companies by enabling easy import and export of student’s
transcripts and credentials.

4.4 Publishing
Publications and descriptive metadata are interrelated for many decades. Tradi-
tionally it only consisted of publisher details, publication date, ISBN, etc. But
now with the arrival of ebooks and self-publishing platforms in the online world,
it has gotten the eyes of the audience. Amazon Kindle direct publishing and Lulu
are few of the modern self-publications platforms. It has been observed that the
quality and richness of metadata related to these publications is critical despite
the readers discover the title or not.
2
https://en.wikipedia.org/wiki/K%E2%80%9312
3
https://www.parchment.com/
12 Kashif Rabbani

4.5 Geospatial Metadata

Maps nowadays are in everyone’s pocket. Most of the businesses are also making
use of maps to visualize different aspects of the business projects. Geospatial
elements are making extensive use of metadata. Geospatial metadata describes
maps, Geographic Information System (GIS) files, Imagery, and other location-
based resources. Metadata is a part of the dataset, and it provides context to
the metadata.
Metadata contains the information about data’s origin, custodianship, copy-
rights, and reuse. Metadata is now widely used in spatial data communities
for sharing/transferring the information. Geographic metadata is responsible for
making users aware of geographic data’s limitations, suitability, indexing, and
restrictions.

Geospatial Metadata Standards First geographic metadata standard ISO-


19115 was released by ISO Geo metadata in 2003. Later on, it got endorsed by the
Federal Geographic Data Committee (FGDC) [2] in 2010. ANZLIC4 (The Spatial
Information Council) is providing geospatial metadata standards for Australia
and Newzeland.

Geospatial Metadata Tools There are few well known Geographic Infor-
mation Systems (GIS) systems, e.g. ArcGIS, PYCSW, and OSGeo. ArcGIS5 is
the most famous and widely adopted by industries because it enables users to
create and use geo maps, compile geographic data, share and manage the geo-
information. PYCSW6 and OSGeo7 are in use as a framework to manage and
create geospatial data.

5 Use of Metadata

As metadata is doing its job well in the background, we need to look at it


from another perspective. Our last phone call details, online purchases, money
transactions, and web surfing are also creating a lot of metadata. We are not
aware of the fact that the same metadata which can be used to detect frauds can
be used against you. Similarly, the decisions taken by these big giant companies
and institutes such as Amazon, Banks, Armies, and Agencies are also influencing
the lives of people abruptly.
Former director of NSA and CIA ’General Michael Hayden’ once made a
statement ”We kill people based on Metadata” in a panel debate at Johns Hop-
kins University in 2004. The type of metadata used by NSA and CIA is about
individuals and their networks. These agencies collect phone calls data from
4
http://www.anzlic.gov.au/
5
https://www.esri.com/en-us/arcgis/about-arcgis/overview
6
http://pycsw.org/
7
https://www.osgeo.org/projects/mapbender/
METADATA 13

carrier companies directly and analyze it by combining it with the metadata


obtained from other resources and take a decision about an individual. This
process of utilizing metadata until making a decision is illustrated animatedly
in the figure 6.

Fig. 6: Metadata use by Authorities

5.1 Data Exhaust


With the advent of smart technologies and apps, we are producing too much
data in our day-to-day activities. This data can be collected and used by the
authorities (as explained earlier) or the providers, e.g. Amazon. Data Exhaust
is a by-product of other involved processes. Data exhaust is not like metadata
which is created deliberately but is produced as a result of other activities in-
cidentally. Figure 7 shows an illustration of data exhaust’s example. A famous
online e-commerce company once sent a flyer related to baby-care items to a
customer which revealed the pregnancy of the minor customer to her parents
before she informed them. This decision was based entirely on analyzing the
purchase pattern of the customer.

Fig. 7: Data Exhaust Example

5.2 Paradata
This term is mostly used for metadata about learning resources. Learning re-
sources include education and research. In the context of education, Paradata
is about educational resources, and in the context of research methodology, it is
14 Kashif Rabbani

mostly used to create metadata records and schemas for large datasets used in
the extensive experiments which are sometimes confidential. For example meta-
data records about the origin of the dataset and timeline of data collection and
utilization.

6 Conclusion

We have discussed a few possible aspects of the metadata in this report. Never-
theless, we can not capture all the details due to the scope of the topic. Metadata
is thriving in every domain. Metadata has good and bad aspects depending on
the type of usage. Author of the report has declared metadata as a parasite and
a matter of perspective. We never know if it is there? If it is harmful or not?
What are its type? What is its usage and characteristics.
METADATA 15

References
1. History of information (1968), http://www.historyofinformation.com/detail.php?entryid=4241
2. Fgdc (2010), https://www.fgdc.gov/metadata
3. LePage, A.: Introduction to metadata.
edited by murtha baca. getty research institute. (2009), http://www.
getty.edu/research/conducting research/standards/intrometadata
4. Pomerantz, J.: Metadata. MIT Press (2015)

You might also like