You are on page 1of 97

XML Basics

Wednesday May 12, 1999 SD99 Copyright 1999 Elliotte Rusty Harold elharo@metalab.unc.edu http://metalab.unc.edu/xml/slides/

What is XML?
Extensible Markup Language
A syntax for documents A Meta-Markup Language A Structural and Semantic language, not a formatting language

Not just for Web pages

XML is a Meta Markup Language


Not like HTML, troff, LaTeX
Make up the tags you needs as you need them

The tags you create can be documented in a Document Type Definition (DTD)
A meta syntax for domain-specific markup languages like MusicML, MathML, and CML

XML describes structure and semantics, not formatting


XML documents form a tree
Element and attribute names reflect the kind of the element

Formatting can be added with a style sheet

A Song Description in HTML


<dt>Hot Cop <dd> by Jacques Morali, Henri Belolo, and Victor Willis <ul> <li>Producer: Jacques Morali <li>Publisher: PolyGram Records <li>Length: 6:20 <li>Written: 1978 <li>Artist: Village People </ul>

A Song Description in XML


<SONG> <TITLE>Hot Cop</TITLE> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PRODUCER>Jacques Morali</PRODUCER> <PUBLISHER>PolyGram Records</PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST> </SONG>

Style Sheets provide formatting


SONG {display: block} TITLE {display: block; font-family: Helvetica, serif; font-size: 20pt; font-weight: bold} COMPOSER {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt; font-style: italic} ARTIST {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt; font-weight: bold; font-style: italic} PUBLISHER {display: block; font-size: 14pt; font-family: Times, Times New Roman, serif} LENGTH {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt} YEAR {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt}

Attaching style sheets to documents


Processing Instruction
<?xml-stylesheet type="text/css" href="song.css"?>

Converter Program

What is XML used for?


Domain-Specific Markup Languages
Self-Describing Data Interchange of Data Among Applications Structured and Integrated Data

Domain-Specific Markup Languages


Non proprietary format
Dont pay for what you dont use

Self-Describing Data
Much data is lost due to format problems
XML is very simple

XML is self-describing
XML is well documented

<PERSON ID="p1100" SEX="M"> <NAME> <GIVEN>Judson</GIVEN> <SURNAME>McDaniel</SURNAME> </NAME> <BIRTH> <DATE>21 Feb 1834</DATE> </BIRTH> <DEATH> <DATE>9 Dec 1905</DATE> </DEATH> </PERSON>

Interchange of Data Among Applications


E-commerce
Syndication

Structured and Integrated Data


Can specify relationships between elements
Can assemble data from multiple sources

XML Applications
A specific markup language uses the XML meta-syntax is called an XML application Different XML applications have their own more constricted syntaxes and vocabularies within the broader XML syntax Further syntax can be layered on top of this; e.g. data typing through DCDs or other schemas

Example XML Applications


Web Pages
Mathematical Equations Music Notation Vector Graphics Metadata and more

Mathematical Markup Language

Channel Definition Format


<?xml version="1.0"?> <CHANNEL HREF="http://metalab.unc.edu/xml/index.html"> <TITLE>Cafe con Leche</TITLE> <ITEM HREF="http://metalab.unc.edu/xml/books.html"> <TITLE>Books about XML</TITLE> </ITEM> <ITEM HREF="http://metalab.unc.edu/xml/tradeshows.html"> <TITLE>Trade shows and conferences about XML</TITLE> </ITEM> <ITEM HREF="http://metalab.unc.edu/xml/lists.htm"> <TITLE>Mailing Lists dedicated to XML</TITLE> </ITEM> </CHANNEL>

Classic Literature
The Complete Plays of Shakespeare
The Bible The Koran The Book of Mormon

Vector Graphics
Vector Markup Language (VML)
Internet Explorer 5.0 Microsoft Office 2000

Scalable Vector Graphics (SVG)

The Resource Description Framework (RDF)


Meta-data Dublin Core Better Web searching

An Example of RDF
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdfsyntax-ns#" xmlns:dc="http://purl.org/DC/> <rdf:Description about="http://metalab.unc.edu/xml/> <dc:CREATOR>Elliotte Rusty Harold</dc:CREATOR> <dc:TITLE>Cafe con Leche</dc:TITLE> </rdf:Description> </rdf:RDF>

XML for XML


XSL: The Extensible Stylesheet Language
DCD: The Document Content Description Schema Language XLL: The Extensible Linking Language

XSL: The Extensible Stylesheet Language


XSL Transformations
XSL Formatting Objects

DCD: The Document Content Description Schema Language


Data Typing in XML is Weak

<MONTH>9</MONTH>
<DCD> <ElementDef Type="MONTH" Model="Data" Datatype="i1" Min="1" Max="12" /> </DCD>

XLL: The Extensible Linking Language


Any element can be a link
Links can be bi-directional Links can be separated from the documents they connect

<footnote xlink:form="simple" href="footnote7.xml">7</footnote>

File Formats, In-house applications, and other behind the scenes uses
Microsoft Office 2000
Federal Express Web API Netscape Whats Related

Hello XML
<?xml version="1.0" standalone="yes"?> <FOO> Hello XML! </FOO>

Plain ASCII or UTF-8 text .xml is standard file extension Any standard text editor will work

The XML Declaration


<?xml version="1.0" standalone="yes"?>

version attribute
required
always has the value 1.0

standalone attribute
yes no

encoding attribute
UTF-8 8859_1

etc.

The FOO element


<FOO> Hello XML! </FOO>

Start tag <FOO> Contents "Hello XML!"

End tag </FOO>

greeting.xml
<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! </GREETING>

Style sheets
Separate from the XML document Different Languages
Cascading Style Sheets Level 1 (CSS1)
Internet Explorer 5.0 Mozilla 5.0

Cascading Style Sheets Level 2 (CSS2)


Internet Explorer 5 (partial) Mozilla 5.0 (partial)

Extensible Style Language (XSL)


Internet Explorer 5.0 (older draft, buggy) LotusXSL, XT, Other non-browser converters

Document Style and Semantics Language

xml-stylesheet
Style sheets are attached via an xmlstylesheet processing instruction in the prolog
<?xml version="1.0" standalone="yes"?> <?xml-stylesheet type="text/css" href="greeting.css"?> <GREETING>Hello XML!</GREETING> type attribute has the value text/css or text/xsl href attribute is a URL to the stylesheet, possibly relative

Can also use non-browser converters like

greeting.css
GREETING {display: block; font-size: 24pt; font-weight: bold}

A larger example: Baseball statistics


Examine the data
Design a vocabulary for the data Write a style sheet

Sample statistics
http://cbs.sportsline.com/u/baseball/mlb/ stats.htm

Organizing the Data


XML documents are trees.
XML elements contain other elements as well as text

Within these limits there's more than one way to organize the data
Hierarchically Relationally Objects

What is the Root Element


The League?
The Season? A custom Document element?

The Root Element


Choose SEASON for the root element

Everything else will be a descendant of SEASON


This is not the only possible choice
<?xml version="1.0"?> <SEASON> </SEASON>

What are the Immediate Children of The root?


Leagues?
Teams? Players? Games?

Child Elements
<?xml version="1.0"?> <SEASON> <YEAR> 1998 </YEAR> </SEASON>

White space in XML is not especially significant


<?xml version="1.0"?> <SEASON><YEAR>1998</YEAR></SEASON>

Leagues
Major league baseball is divided into two leagues
Each league has
a name
three divisions

Divisions
Each division has
name 4-6 teams

Teams
Each team has
Name City

Players

Player Data
Each player has
First name Last name

Position
Statistics

Player Batting Statistics


G Games Played GS Games Started AB At Bats R Runs H Hits 2B Doubles 3B Triples HR Home Runs RBI Runs Batted In SB Stolen Bases CS Caught Stealing SH Sacrifice Hits SF Sacrifice Flies Err Errors PB Pitcher Balked BB Base on Balls (Walks) SO Strike Outs HBP Hit By Pitch

What does a player look like


Long names vs. short names

The Complete 1998 Major League


Long version

A Style Sheet
1998shortstats.xml
baseballstats.css <?xml-stylesheet type="text/css" href="baseballstats.css"?> styled1998shortstats.xml

Cascading Style Sheets


Partially supported by Mozilla and IE 5.0
Full W3C Recommendation

The Default Rule


Not every element needs a rule
The root element should be at least display: block
SEASON { font-size: 14pt; background-color: white; color: black; display: block}

A style rule for the YEAR element


Make it look like a title
YEAR { display: block; font-size: 32pt; font-weight: bold; text-align: center}

Style Rules for Division and League Names


LEAGUE_NAME { display: block; text-align: center; font-size: 28pt; font-weight: bold}

DIVISION_NAME { display: block; text-align: center; font-size: 24pt; font-weight: bold}

Alternate Style Rules for Division and League Names


LEAGUE_NAME, DIVISION_NAME { display: block; text-align: center; font-weight: bold} LEAGUE_NAME {font-size: 28pt } DIVISION_NAME {font-size: 24pt }

Team name and Team city must be one title

Style Rules for Teams

Must be inline elements


Previous and following must be block elements
TEAM_CITY { font-size: 20pt; font-weight: bold; font-style: italic}

TEAM_NAME { font-size: 20pt; font-weight: bold; font-style: italic}


TEAM, PLAYER {display: block}

Style Rules for Players


TEAM {display: table} TEAM_CITY {display: table-caption} TEAM_NAME {display: table-caption} PLAYER {display: table-row} SURNAME, GIVEN_NAME, POSITION, GAMES, GAMES_STARTED, AT_BATS, RUNS, HITS, DOUBLES, TRIPLES, HOME_RUNS, RBI, STEALS, CAUGHT_STEALING, SACRIFICE_HITS, SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT, HIT_BY_PITCH {display: table-cell}

Finished Style Sheet


SEASON {font-size: 14pt; background-color: white; color: black; display: block} YEAR {display: block; font-size: 32pt; font-weight: bold; text-align: center} LEAGUE_NAME {display: block; text-align: center; font-size: 28pt; font-weight: bold} DIVISION_NAME {display: block; text-align: center; font-size: 24pt; font-weight: bold} TEAM_CITY {font-size: 20pt; font-weight: bold; font-style: italic} TEAM_NAME {font-size: 20pt; font-weight: bold; font-style: italic} TEAM {display: block} PLAYER {display: block}

Possible Extensions
There should be captions like "RBI" or "At Bats. Derived numbers like batting averages are not included. The titles are short. E.g. "1998" instead of "1998 Major League Baseball". The document is so long it's hard to read. Something similar to IE5's collapsible outline view would be nice. Pitcher stats should be separated from batter stats.

Possible Solutions
CSS Level 2
XSL XSL + JavaScript

Open and close all tags

Well-formedness Rules

Empty tags end with /> There is a unique root element Elements may not overlap Attribute values are quoted < and & are only used to start tags and entities Only the five predefined entity references are used

Open and close all tags

Empty tags end with />


<BR/>, <HR/>, and <IMG/> instead of <BR>, <HR>, and <IMG>
Web browsers deal inconsistently with these Can use <BR></BR> <HR></HR> <IMG></IMG> instead

There is a unique root element


One element completely contains all other elements of the document
This is HTML in HTML files

XML Declaration is not an element

<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! </GREETING>

Elements may not overlap


If an element contains a start tag for an element, it must also contain the corresponding end tag
Empty elements may appear anywhere Every non root element has a parent element

Attribute values are quoted


Good:
<A HREF="http://metalab.unc.edu/xml/">

Bad:
<A HREF=http://metalab.unc.edu/xml/>

< and & are only used to start tags and entities
Good: <H1>O'Reilly &amp; Associates</H1> Bad: <H1> O'Reilly & Associates</H1> Good:
<CODE>for (int i = 0; i &lt;= args.length; i++ ) { </CODE>

Bad:
<CODE>for (int i = 0; i <= args.length;

Only the five predefined entity references are used


Good:
&amp; &lt; &gt; &quot; &apos;

Bad:
&copy; &reg; &tm; &alpha; &eacute; &nbsp;

etc.

DTDs and Validity


A Document Type Definition describes the elements and attributes that may appear in a document Validation compares a particular document against a DTD
Well-formedness is a prerequisite for validity

What is a DTD?
a list of the elements, tags, attributes, and entities contained in a document, and their relationship to each other internal vs. external DTDs

The importance of validation


Ensures that data is correct before feeding it into a program
Ensure that a format is followed

Establish what must be supported


Not all documents need to be valid; sometimes well-formed is enough

A DTD for greeting.xml


greeting.xml:
<?xml version="1.0"?> <GREETING> Hello XML! </GREETING>

greeting.dtd:
<!ELEMENT GREETING (#PCDATA)>

Document Type Declarations


<?xml version="1.0"?> <!DOCTYPE GREETING SYSTEM "greeting.dtd"> <GREETING> Hello XML! </GREETING>

specifies the root element


gives a URL for the DTD

Valid:

Invalid Documents
<GREETING> various random text but no markup </GREETING>

Invalid: anything else including


<GREETING> <sometag>various random text</sometag> <someEmptyTag/> </GREETING> or <GREETING> <GREETING>various random text</GREETING>

Validating Tools
Command line programs like XJParse
Online validators
http://www.stg.brown.edu/service/xmlv alid/ http://www.cogsci.ed.ac.uk/%7Erichard/ xml-check.html

Browsers

Element Declarations
Each tag must be declared in a <!ELEMENT> declaration.
A <!ELEMENT> declaration gives the name and content model of the element The content model uses a simple regular expression-like grammar to precisely specify what is and isn't allowed in an element

Content Specifications
ANY

#PCDATA
Sequences

Choices
Mixed Content

Modifiers
Empty

ANY
<!ELEMENT SEASON ANY>

A SEASON can contain any child element and/or raw text (parsed character data)

#PCDATA
<!ELEMENT YEAR (#PCDATA)>

Parsed Character Data; i.e. raw text, no markup

#PCDATA
Valid:
<YEAR>1999</YEAR> <YEAR>99</YEAR> <YEAR>1999 C.E.</YEAR> <YEAR> The year of our Lord one thousand, nine hundred, and ninetynine </YEAR>

Invalid:
<YEAR> <MONTH>January</MONTH> <MONTH>February</MONTH> <MONTH>March</MONTH> <MONTH>April</MONTH> <MONTH>May</MONTH> <MONTH>June</MONTH> <MONTH>July</MONTH> <MONTH>August</MONTH> <MONTH>September</MONTH> <MONTH>October</MONTH> <MONTH>November</MONTH> <MONTH>December</MONTH> </YEAR>

Child Elements
To declare that a LEAGUE element must have a LEAGUE_NAME child:
<!ELEMENT LEAGUE (LEAGUE_NAME)> <!ELEMENT LEAGUE_NAME (#PCDATA)>

Sequences
Separate multiple required child elements with commas; e.g.
<!ELEMENT SEASON (YEAR, LEAGUE, LEAGUE)> <!ELEMENT LEAGUE (LEAGUE_NAME, DIVISION, DIVISION, DIVISION)>

One or More Children +


<!ELEMENT DIVISION_NAME (#PCDATA)> <!ELEMENT DIVISION (DIVISION_NAME, TEAM+)>

Zero or More Children *


<!ELEMENT TEAM (TEAM_CITY, TEAM_NAME, PLAYER*)> <!ELEMENT TEAM_CITY (#PCDATA)> <!ELEMENT TEAM_NAME (#PCDATA)>

Zero or One Children ?

<!ELEMENT PLAYER (GIVEN_NAME, SURNAME, POSITION, GAMES, GAMES_STARTED, AT_BATS?, RUNS?, HITS?, DOUBLES?, TRIPLES?, HOME_RUNS?, RBI?, STEALS?, CAUGHT_STEALING?, SACRIFICE_HITS?, SACRIFICE_FLIES?, ERRORS?, WALKS?, STRUCK_OUT?, HIT_BY_PITCH?, WINS?, LOSSES?, SAVES?, COMPLETE_GAMES?, SHUT_OUTS?, ERA?, INNINGS?, EARNED_RUNS?, HIT_BATTER?, WILD_PITCHES?, BALK?,WALKED_BATTER?, STRUCK_OUT_BATTER?)

>

Finished DTD

Choices
<!ELEMENT PAYMENT (CASH | CREDIT_CARD)> <!ELEMENT PAYMENT (CASH | CREDIT_CARD | CHECK)>

Grouping With Parentheses


Parentheses combine several elements into a single element.

Parenthesized element can be nested inside other parentheses in place of a single element.
The parenthesized element can be suffixed with a plus sign, a comma, or a question mark.
<!ELEMENT dl (dt, dd)*> <!ELEMENT ARTICLE (TITLE, (P | PHOTO | GRAPH | SIDEBAR | PULLQUOTE | SUBHEAD)*, BYLINE?)>

Mixed Content
Both #PCDATA and child elements in a choice
<!ELEMENT TEAM (#PCDATA | TEAM_CITY | TEAM_NAME | PLAYER)*>

#PCDATA must come first #PCDATA cannot be used in a sequence

Empty elements
<!ELEMENT BR EMPTY> <!ELEMENT IMG EMPTY> <!ELEMENT HR EMPTY>

Internal DTDs
<?xml version="1.0"?> <!DOCTYPE GREETING [ <!ELEMENT GREETING (#PCDATA)> ]> <GREETING> Hello XML! </GREETING>

Internal DTD Subsets


<?xml version="1.0"?> <!DOCTYPE GREETING SYSTEM "greeting.dtd" [ <!ELEMENT GREETING (#PCDATA)> ]> <GREETING> Hello XML! </GREETING>

Internal declarations override external declarations

Programming with XML


Java works best
C, Perl, Python etc. can also be used Unicode support is the biggest issue

SAX, the Simple API for XML


Event based
Programs can plug in different parsers

The Document Object Model (DOM)

To Learn More: Books


XML: Extensible Markup Language
IDG Books 1998 ISBN 0-76453-199-9

The XML Bible


IDG Books 1999 ISBN 0-76453-236-7

Questions?

You might also like