You are on page 1of 29

Metadata for Digital Libraries:

A Functional Approach

Cornell Digital Imaging Workshop


October 21, 1998

Sandra Payette
Digital Library Research Group
Cornell University
payette@cs.cornell.edu
Metadata
CREATOR: Plato
TITLE: The Republic

Metadata is structured data about data that imposes


order on a disordered information universe.

Image File Storage


Access
Image 1 cdrom 1
Control
Image 2 cdrom 1
Image 3 cdrom 2 List
Many Types of Metadata

• Descriptive
• Structural
• Terms and conditions
• Administrative
• Content ratings
• Provenance
• Relationship
Basic Functions We Must Support

• Resource Discovery
• Access and Use
• Preservation and Administration
Resource Discovery:

Focus on Descriptive Metadata


Metadata for Resource Discovery

• Catalogs
– OPAC / MARC Records
• Indexes
– Structured descriptive records (e.g., Dublin Core)
– Abstracts
– Full-text surrogates (e.g, via OCR)
Challenges

• Impracticality of large-scale traditional


cataloging
– time consuming, labor intensive, special skills
– limited coverage - only “selected” items
• Problems with resource discovery
– full-text indexing ineffective (false hits, irrelevancy,
overload)
– full-text approaches not useful for non-textual data
(e.g., audio, video, executable programs)
One Solution:
Simple Descriptive Surrogates

• Easy to create
• Applicable across domains
• Applicable for different genre of objects
• Allows interoperability among robots,
indexers, and search clients
Dublin Core Element Set

• Good baseline descriptive record


• Can exist along side other specialized metadata
• Common ground for discovery across disparate
resources
• No specialized skills required
• Flexibility through qualifiers

Source: http://www.purl.org/Metadata/dublin_core/
Dublin Core : 15 Elements
• Title name given to the work by the author • Resource Type category of the
resource
• Author or Creator person(s)
responsible for the intellectual content
• Format Data representation of the
resource
• Subject and Keywords the topic • Resource Identifier Unique
of the work, keywords, or formal Identification string (e.g. URL, URN,
classification schemes ISBN...)
• Description textual description of the • Source object from which this object is
content (abstract, prose describing an image, derived (if applicable)
etc.)
• Publisher • Language language of the intellectual
the organization making the
content of the object
work available in its present form
• Other Contributor person(s) other
• Relation relationship of the object to
than the author who have made significant other objects or collections
contributions to the intellectual content • Coverage spatial locations and temporal
• Date the date the work was made available duration characteristics
• Rights Management a pointer to a
copyright notice, a rights management
statement, or a rights server.
Dublin Core in HTML META Tags

<html>
<head>
<title>Cornell Digital Library Research Group</title>
<META name="DC.subject" content=”digital library research">
<META name="DC.subject" content="networked object description">
<META name="DC.publisher" content=”Cornell University">
<META name="DC.creator" content=”Lagoze, Carl, lagoze@cs.cornell.edu.">
<META name="DC.creator" content=”Payette, Sandra, payette@cs.cornell.edu.">
<META name="DC.title" content=”Cornell Digital Library Research Group">
<META name="DC.date” content="1998-05-15">
<META name="DC.form" scheme="IMT" content="text/html">
<META name="DC.language" scheme="ISO639" content="en">
<META name="DC.identifier" scheme="URL"
content="http://www2.cs.cornell.edu/NCSTRL/CDLRG/cdlrg.htm">
</head>
<IMG SRC="/mydir/mysubdir/mypicture.gif" WIDTH=208 HEIGHT=216>
</html>

Source: http://www.w3.org/TR/REC-html40/
Warwick Framework

• Developed by Dublin Core community


• Broader framework to accommodate diverse
metadata schemes
• Encourages community-specific definition and
administration of metadata
• Modularity supports interoperability among:
– content providers
– catalogers and indexers
– automated resource discovery systems
Warwick Framework Container

Container Simple Package:


Typed Metadata Set

Package
Dublin Core

Package
Other Descriptive

Package
Reference to MARC

Package
URI MARC Record
WWW Infrastructure
Evolving in this Direction

• Dublin Core submitted to IETF as RFC


– ftp://ftp.isi.edu/in-notes/rfc2413.txt
• Resource Description Framework (RDF)
– http://www.w3.org/RDF/
• Extensible Markup Language (XML)
– http://www.w3.org/XML/
Resource Description Framework
(RDF)

• Influenced by the Warwick Framework,


among others
• Enables interoperability between applications
that exchange metadata
• Mix and match of metadata elements from
different schemas
• An application of XML (transfer syntax)
A Simple RDF Model

DC:Creator

www2.cs.cornell.edu/CDLRG/doc1

DC:Publisher

QCSchema:Rating www.xxx.org/rate

MyRating YourRating

A B
RDF Expressed in XML

<?xml:namespace name=
“http://www.purl.org/Metadata/dublin_core/” as=“DC”>
<?xml:namespace name=
“http://www.w3.org/Schemas/RDF/” as=“RDF”>

<RDF:Serialization>
<RDF:Assertions
href=“http://www2.cs.cornell.edu/CDLRG/doc1”>
<DC:Creator>Sandy Payette</DC:Creator>
<DC:Publisher>Cornell DLRG </DC:Publisher>
</RDF:Assertions>
</RDF:Serialization>
Dublin
Core
Element
Set
RDF: Why is it important?

• Market demand for metadata deployment


• Software infrastructure will be ubiquitous (e.g. free in
browsers, servers, proxies, editors, etc.)
• RDF is a general purpose framework that provides
structured, human-readable and machine-
understandable metadata for the web
• Allows stakeholder communities to independently
developed, maintain, and reuse vocabularies
Access and Use

Focus on Structural Metadata


Structural Metadata
• What is it? Data that….
– Defines structure within documents
– Aggregates images into meaningful entities
– Correlates document components to image files
– Organizes a collection of objects
• Where is it?
– ASCII text files in directories
– Relational databases
– Embedded in documents or surrogates (e.g. SGML)
First... A Data Model

0:1 Table
Front 1:N
Contents
0:1
1:N

Chapter Page
1:N 1:N

0:1
Index 1:N

Data models mirror natural attributes and


relationships of real-world objects
“Binding” Document Images with SGML

<!DOCTYPE EBIND PUBLIC "-//UC Berkeley//DTD ebind.dtd (ElectronicBinding (Ebind))//EN" [


<!ENTITY % birch PUBLIC "-//UC Berkeley//ENTITIES
Birch-tree fairy book (Page Images)//EN">
%birch;]>
<ebind type="book">
<front>
<page><image entityref="birch001" seqno="1" nativeno="i"></page>
<page><image entityref="birch002" seqno="2" nativeno="ii"></page>
<page><image entityref="birch003" seqno="3" nativeno="iii"></page>
<page><image entityref="birch004" seqno="4" nativeno="iv"></page>
<div0 type="titlepage">
<page><image entityref="birch005" seqno="5" nativeno="v"></page>
<page><image entityref="birch006" seqno="6" nativeno="vi"></page>
</div0>
<div0 type="introduction">
<head>Introductory note</head>
<page><image entityref="birch007" seqno="7" nativeno="vii"></page>
</div0>

Source: http://sunsite.berkeley.edu/Ebind/
Finding Aids in SGML

• Encoded Archival Description (EAD)


– SGML mark up of descriptive access tools
(inventories, registers, indexes, and guides)
– provides more detail about a collection than in
typical catalog record
– facilitates access - “drill down” into collection
– potential international standard
– maintained jointly by Library of Congress and
Society of American Archivists (SAA)

Source: http://www.loc.gov/rr/ead/eadhome.html
Preservation and Administration

Focus on Administrative Metadata


and Persistent Identifiers
Administrative Metadata

• Information for managing images… over time


– relocation
– migration (new formats)
– copyright tracking
– archiving of objects and services
• Where is it?
– File headers (to help prevent orphaned images)
– External databases (e.g., relational db)
– Separate files stored with images
Create a Preservation Audit Trail

Image File Attributes: Image Attributes:


• formats • resolution
• versions • bit depth
• compression • orientation

Process Data: Rights Management Data:


• creation date/time •Expiration dates
• equipment used •Copyright info
•source statements
Persistent Identifiers

• Globally unique names


• Persistent … names are permanent, lasting
• Used in resolution services to locate the
object (locations change over time).

Unique cnri.dlib/april97-payette
Identifier:
Naming Item
Authority Name

URL: http://www.somewebserver.org/somedirectory/somefile
Identifiers: Current Initiatives

• IETF Uniform Resource Names (URN)


– specification of URN framework
– requirements for resolution systems
– syntax definition
• Existing Systems
– CNRI’s Handle System
– OCLC PURLs
– DOI Initiative
Further reading
• IFLA: A Good List -
http://www.nlc-bnc.ca/ifla/II/metadata.htm
• Lynch, et. al.: CNI Resource Discovery White Paper -
http://www.cni.org/projects/nidr/nidr.html
• Lagoze: Resource Discovery in the Digital Age -
http://www.dlib.org/dlib/june97/06lagoze.html
• Payette: Persistent Identifiers, RLG DigiNews -
http://www.rlg.org/preserv/diginews/diginews22.html
• W3C: Metadata Overview -
http://www.w3.org/Metadata

You might also like