Professional Documents
Culture Documents
HTML documents
• often generated by applications
• consumed by humans only
• easy access: across platforms, across organizations
No application interoperability:
• HTML not understood by applications
• screen scraping brittle
• Database technology: client-server
• still vendor specific
1
Paradigm Shift on the Web
Semistructured Data
Origins:
Integration of heterogeneous sources
Data sources with non-rigid structure
• Biological data
• Web data
&o1
Object Exchange
complex object paper
paper Model (OEM)
book
references
&o12 &o24 &o29
references
references
author page
author year author
title http title publisher title
author
author
author
&o43 &25
&96
1997
last
firstname
firstname lastname first
lastname
&243 &206
“Serge”
“Abiteboul”
“Victor” 122 133
“Vianu”
atomic object
Database Management Systems, R. Ramakrishnan 6
2
Syntax for Semistructured Data
Bib: &o1 { paper: &o12 { … },
book: &o24 { … },
paper: &o29
{ author: &o52 “Abiteboul”,
author: &o96 { firstname: &243 “Victor”,
lastname: &o206 “Vianu”},
title: &o93 “Regular path queries with constraints”,
references: &o12,
references: &o24,
pages: &o25 { first: &o64 122, last: &o92 133}
}
}
Characteristics of Semistructured
Data
Missing or additional attributes
Multiple attributes
Different types in different objects
Heterogeneous collections
3
Comparison with Relational Data
row row row
XML
4
HTML
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
5
Document Type Descriptors
Shortcomings of DTDs
XML Schema
In XML format
Element names and types associated locally
Includes primitive data types (integers, strings, dates,
etc.)
Supports value-based constraints (integers > 100)
User-definable structured types
Inheritance (extension or restriction)
Foreign keys
Element-type reference constraints
6
Sample XML Schema
<schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”>
<element name=“author” type=“string” />
<element name=“date” type = “date” />
<element name=“abstract”>
<type>
…
</type>
</element>
<element name=“paper”>
<type>
<attribute name=“keywords” type=“string”/>
<element ref=“author” minOccurs=“0” maxOccurs=“*” />
<element ref=“date” />
<element ref=“abstract” minOccurs=“0” maxOccurs=“1” />
<element ref=“body” />
</type>
</element>
</schema>
Issues:
• Distinguish between attributes and sub-elements?
• Should we conserve order?
Database Management Systems, R. Ramakrishnan 21
7
XML Terminology
8
XML-Query Data Model
Describes XML data as a tree
Node ::= DocNode |
ElemNode |
ValueNode |
AttrNode |
NSNode |
PINode |
CommentNode |
InfoItemNode |
RefNode
http://www.w3.org/TR/query-datamodel/2/2001
Database Management Systems, R. Ramakrishnan 25
elemNode : (QNameValue,
{AttrNode },
[ ElemNode | ValueNode])
Æ ElemNode
9
XML Query Data Model
Attribute node:
<book
<bookprice
price==“55”
“55”
price2
price2==attrNode(price,string10)
attrNode(price,string10)
currency string10
string10==valueNode(…)
valueNode(…) /*/*next
next*/*/
currency==“USD”>
“USD”>
<title> currency3
currency3==attrNode(currency,
attrNode(currency,
<title>Foundations
Foundations……</title>
</title>
<author>
string11)
string11)
<author>Abiteboul
Abiteboul</author>
</author>
<author>
string11
string11==valueNode(…)
valueNode(…)
<author>Hull
Hull</author>
</author>
<author>
<author>Vianu
Vianu</author>
</author>
<year>
<year>1995
1995</year>
</year>
</book>
</book>
Value node:
ValueNode = StringValue |
BoolValue |
FloatValue …
10
XML Query Data Model
Example:
price2
price2==attrNode(price,string10)
attrNode(price,string10)
<book
<bookprice
price==“55” string10
“55” string10==valueNode(stringValue(“55”))
valueNode(stringValue(“55”))
currency
currency==“USD”>
“USD”> currency3
currency3==attrNode(currency,
attrNode(currency,string11)
string11)
<title> string11
string11==valueNode(stringValue(“USD”))
valueNode(stringValue(“USD”))
<title>Foundations
Foundations……</title>
</title>
<author>
<author>Abiteboul
Abiteboul</author>
</author> title4
title4==elemNode(title,
elemNode(title,string9)
string9)
<author> string9
string9==
<author>Hull
Hull</author>
</author>
valueNode(stringValue(“Foundations…”))
valueNode(stringValue(“Foundations…”))
<author>
<author>Vianu
Vianu</author>
</author>
<year>
<year>1995
1995</year>
</year>
</book>
</book>
11