Professional Documents
Culture Documents
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
1/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
1/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
1/41
name
ASCII code
ISO-8859-1, ISO-LATIN-1
ISO-8859-7
Unicode
codes
0127
0255
0255
0(232 1)
examples
64 A
0127 as ASCII, 196
0127 as ASCII, 225
0255 as ISO-8859-1
Unicode is organized in 256 groups 256 planes 256 rows 256 cells.
Plane 0 (codes 065535) is called basis multilingual plane (BMP).
Non ISO-8859-1 characters are mapped to higher codes, e.g., 945 .
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
2/41
Unicode
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
3/41
Unicode / Scripts
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
4/41
Unicode / Symbols
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
5/41
characters
coded
character set
character codes
(natural numbers)
character
encoding schema
byte
sequences
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
6/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
7/41
bit sequence
bytes
free bits character codes
0.......
1
7
0x00
110..... 10......
2
5 + 6 = 11
0x80
0
1110.... 10...... 10......
3
4 + 2 6 = 16
0x800
0x
11110... 10...... 10...... 10......
4
3 + 3 6 = 21
0x10000 0x1f
111110.. 10...... 10...... 10...... 10......
5
2 + 4 6 = 26 0x200000 0x3ff
1111110. 10...... 10...... 10...... 10...... 10......
6
1 + 5 6 = 31 0x40000000xffff
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
8/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
9/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
9/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
10/41
Hierarchical URIs
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
11/41
Fragment identifiers
Fragment identifiers are used to identify parts of the resource identified by an
URI.
Example:
http://www.informatik.uni-freiburg.de/xml/books.html#R03
<html>
<body>
<li><a name="EE04">Rainer Eckstein, Silke Eckstein:
<em>XML und Datenmodellierung</em>, 2004.</a></li>
<li><a name="R03">Erik T. Ray:
<em>Learning XML</em>, 2003.</a></li>
</body>
</html>
Figure 7: HTML document at http://www.informatik.uni-freiburg.de/xml/books.html.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
12/41
) [ ? query ]
Figure 8: A Base URI is the context for resolving relative URIs [RFC 2396].
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
13/41
URI schemes
URI schemes are managed by Internet Assigned Numbers Authority (IANA).
Scheme Name
ftp
http
mailto
file
pop
dav
tel
https
urn
..
Description
File Transfer Protocol
Hypertext Transfer Protocol
Electronic mail address
Host-specific file names
Post Office Protocol v3
dav
telephone
Hypertext Transfer Protocol Secure
Uniform Resource Names
..
Reference
RFC 1738
RFC 2616
RFC 2368
RFC 1738
RFC 2384
RFC 2518
RFC 2806
RFC 2818
RFC 2141
..
Type
server-based
server-based
opaque
server-based
server-based
server-based
opaque
server-based
opaque
..
14/41
Uniform Resource
Identifier (URI)
Uniform Resource
Locator (URL)
Uniform Resource
Name (URN)
Uniform Resource
Characteristics (URC)
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
15/41
urn:isbn:0-395-36341-1
urn:newsml:reuters.com:20000206:IIMFFH05643_2000-02-06_17-54-01_L0615658
A book or a news item (identified by an URN) may be retrieved from different
locations (URLs).
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
16/41
URN Namespaces
ietf
pin
issn
oid
newsml
oasis
xmlorg
publicid
isbn
nbn
web3d
mpeg
mace
fipa
swift
..
Value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
..
Reference
RFC 2648
RFC 3043
RFC 3044
RFC 3061
RFC 3085
RFC 3121
RFC 3120
RFC 3151
RFC 3187
RFC 3188
RFC 3541
RFC 3614
RFC 3613
RFC 3616
RFC 3615
..
17/41
hexDigit
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
18/41
19/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
20/41
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
20/41
Every XML document consists of a prolog and a single element, called root element.
document := prolog
element ( Comment | PI | S )*
21/41
<?xml version="1.1"?>
<page/>
Figure 10: A minimal XML document with root element "page".
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
22/41
S ?>
Name s
start with a unicode letter or _
(: is also allowed, but used for namespaces).
may contain unicode letters, uncode digits, -, ., or .
A wellformed document requires,
that start and end tag of each element match,
that for each tag the same attribute never occurs twice.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
23/41
10
11
12
<?xml version="1.1"?>
<book>
<author><fn>Rainer</fn><sn>Eckstein</sn></author>
<author><fn>Silke</fn><sn>Eckstein</sn></author>
<title>XML und Datenmodellierung</title>
<year>2004</year>
</book>
<book>
<author><fn>Erik T.</fn><sn>Ray</sn></author>
<title>Learning XML</title>
<year edition="2">2003</year>
</book>
Figure 11: More than one root element.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
24/41
<?xml version="1.1"?>
<book>
<author><fn>Erik T.</fn><sn>Ray</author></sn>
<title>Learning XML</title>
<year edition="2">2003</year>
</book>
Figure 12: Miss-nested elements.
<?xml version="1.1"?>
<book author="Rainer Eckstein" author="Silke Eckstein">
<title>XML und Datenmodellierung</title>
<year>2004</year>
</book>
Figure 13: Same attribute occurs twice.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
25/41
Element content
content := CharData ?
( ( element | Reference | CDSect | PI | Comment )
CharData ? )*
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
26/41
Character data
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
27/41
Character data
<?xml version="1.1"?>
<abstract>
x^2 = y has no real solution for y < 0.
But there are solutions for y = 0 & for y > 0.
</abstract>
Figure 14: Forbidden characters in character data.
<?xml version="1.1"?>
<abstract>
x^2 = y has no real solution for y < 0.
But there are solutions for y = 0 & for y > 0.
</abstract>
Figure 15: Using references in character data.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
28/41
References
"
All other entities known from HTML (as ä) are not predefined in XML.
Custom entities can be defined in the document type declaration.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
29/41
CDATA sections
CDATA sections are typically used for longer text containing < or &.
CDATA sections are flat, i.e., there is no possibility to structure them with elements
(as < or & are interpreted literally).
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
30/41
<?xml version="1.1"?>
<abstract>
x^2 = y has no real solution for y c; 0.
But there are solutions for y = 0  for y e; 0.
</abstract>
Figure 16: Using numeric character references.
<?xml version="1.1"?>
<abstract><![CDATA[
x^2 = y has no real solution for y < 0.
But there are solutions for y = 0 & for y > 0.
]]></abstract>
Figure 17: Using a CDATA-section.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
31/41
Attribute values
1
<?xml version="1.1"?>
<book abstract="Discusses meaning of "wellformed"">
<author>John Doe</author>
<title>About wellformedness</title>
</book>
Figure 18: Literal usage of attribute delimiter.
<?xml version="1.1"?>
<book abstract=Discusses meaning of "wellformed">
<author>John Doe</author>
<title>About wellformedness</title>
</book>
Figure 19: Using different attribute delimiters.
<?xml version="1.1"?>
<book abstract="Discusses meaning of "wellformed"">
<author>John Doe</author>
<title>About wellformedness</title>
</book>
Figure 20: Using references in attribute values.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
32/41
Comments
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
33/41
Comments
10
11
<?xml version="1.1"?>
<!-- list is not complete yet ! -->
<books>
<!-- yet to be ordered -->
<book>
<author><fn>Rainer</fn><sn>Eckstein</sn></author>
<author><fn>Silke</fn><sn>Eckstein</sn></author>
<title>XML und Datenmodellierung</title>
<year><!-- look up year of publication --></year>
</book>
</books>
12
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
34/41
<?xml version="1.1"?>
<book>
<author><fn>Rainer</fn><sn>Eckstein</sn></author>
<author><fn>Silke</fn><sn>Eckstein</sn></author>
<title>XML und Datenmodellierung</title>
<year <!-- edition="1" -->>2004</year>
</book>
Figure 22: Comments in tags are not allowed.
10
<?xml version="1.1"?>
<books>
<!-- 2004 ---------------------------------------- -->
<book>
<author><fn>Rainer</fn><sn>Eckstein</sn></author>
<author><fn>Silke</fn><sn>Eckstein</sn></author>
<title>XML und Datenmodellierung</title>
<year>2004</year>
</book>
</books>
Figure 23: -- is not allowed in comments.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
35/41
Processing Instructions
PI := <? Name ( S
Char * )? ?>
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
36/41
Character encoding schemata are specified by the name they are registered with
at IANA (http://www.iana.org/assignments/character-sets), e.g.,
US-ASCII
ISO-8859-1
ISO-10646-UCS-2 or csUnicode (UCS2)
ISO-10646-UCS-4 or csUCS4 (UCS4)
UTF-8
UTF-16
...
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
37/41
<?xml version="1.1"?>
<page>
Gr Gott !
</page>
Figure 24: Non-wellformed document (assumed to be ISO-8859-1 coded).
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
38/41
39/41
Language Attribute
meaning
ISO
ISO
ISO
ISO
ISO
ISO
ISO
IANA
IANA
..
source
German
German, Swiss variant
German, German variant
English
US English
Britain English
Klingon
German, traditional orthography
German, orthography of 1996
..
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
40/41
Language Attribute
<?xml version="1.1"?>
<page>
<p xml:lang="de">Guten <s>Morgen</s>!</p>
<p xml:lang="en">Good <s>morning</s>!</p>
<table>
<tr><td>USD</td><td>0</td><td>1</td><td>...</td></tr>
<tr><td>EUR</td><td>0</td><td>0.839818</td><td>...</td></tr>
</table>
</page>
Figure 26: Language attribute.
Prof. Dr. Dr. Lars Schmidt-Thieme, Institute for Computer Science, University of Freiburg, Germany, http://www.informatik.uni-freiburg.de/cgnm
Course on XML and Semantic Web Technologies, summer term 2005
41/41