You are on page 1of 13

White Paper on XML Basics

XML Basics

Document Control

Change Record
Date

Author

Version

Change Reference

04-Apr-09

Anoosha Burlakanti,
Preethi Phani Mummaleti

1.0.0

Initial Document

Page 1 of 13

White Paper on XML Basics


Reviewers
Name

Position

Krishna Mohan Adavi

Distribution
Copy No.

Name

Location

1
2
3
4

Library Master

Project Library
Project Manager

Note To Holders:
If you receive an electronic copy of this document and print it out, please write your name
on the equivalent of the cover page, for document control purposes.
If you receive a hard copy of this document, please write your name on the front cover, for
document control purposes.

Table of Contents:

Topic

Page
No

1. Introduction To XML

2. What is XML

3. Why XML?

4. Difference Between XML and HTML

5. Characteristics of XML

6. XML Tree

Page 2 of 13

White Paper on XML Basics


7. XML Syntax Rules

8. XML Elements

10

9. XML Attributes

10

10. Well-formed XML versus Valid XML

11

11. XML Schema

13

12. References

13

Extensible Markup Language


1. Introduction to XML:
Markup languages evolved from early, private company and government forms into Standard Generalized
Markup Language (SGML), Hypertext Markup Language (HTML), and eventually into XML. SGML can seem

Page 3 of 13

White Paper on XML Basics


complex, and HTML (which was really just an element set) was just not powerful enough to identify
information. XML is designed as an easy-to-use and easy-to-extend markup language.
The Extensible Markup Language (XML) came into existence as a result of an attempt to facilitate the
sharing of information (data) across different information systems working on different technology
platforms, via the internet. The XML is a simplified subset of Standard Generalized Markup Language
(SGML). XML stands for Extensible Markup Language .XML is a markup language much like HTML .It was
designed to carry data, not to display data .XML tags are not predefined. You must define your own tags.
It is designed to be self-descriptive. XML uses a DTD (Document Type Definition) to formally describe the
data.
XML is a complement to HTML. It is important to understand that XML is not a replacement for HTML. In
the future development of the Web, it is most likely that XML will be used to structure and describe the
Web data, while HTML will be used to format and display the same data.
XML was designed to transport and store data.HTML was designed to display data.

2. What is XML?
XML is a software- and hardware-independent tool for carrying information.

XML stands for Extensible Markup Language


XML is a markup language much like HTML
XML was designed to carry data, not to display data
XML tags are not predefined. You must define your own tags
XML is designed to be self-descriptive
XML is a W3C Recommendation

XML (Extensible Markup Language) is a general-purpose specification for creating custom markup
languages. It is classified as an extensible language, because it allows the user to define the mark-up
elements. You can create content and mark it up with delimiting tags, making each word, phrase, or chunk
into identifiable, sortable information.
XML is recommended by the World Wide Web Consortium (W3C). It is a fee-free open standard. The
recommendation specifies lexical grammar and parsing requirements.

3. Why XML?
XML was created so that richly structured documents could be used over the web. The only viable
alternatives, HTML and SGML, are not practical for this purpose.
HTML comes bound with a set of semantics and does not provide arbitrary structure.
SGML provides arbitrary structure, but is too difficult to implement just for a web browser. Full SGML
systems solve large, complex problems that justify their expense. Viewing structured documents sent over
the web rarely carries such justification.
This is not to say that XML can be expected to completely replace SGML. While XML is being designed to
deliver structured content over the web, some of the very features it lacks to make this practical, make
SGML a more satisfactory solution for the creation and long-time storage of complex documents. In many
organizations, filtering SGML to XML will be the standard procedure for web delivery.
Designed for ease-of-use with Standard Generalized Markup Language (SGML).Goal is to enable SGML to
be served, received and processed beyond what is now possible with HTML.

4. Difference between XML and HTML


XML is not a replacement for HTML.

Page 4 of 13

White Paper on XML Basics


XML was designed to transport and store data, with focus on what data is.
HTML was designed to display data, with focus on how data looks.
HTML is about displaying information, while XML is about carrying information.

5. Characteristics of XML
1) XML was created to structure, store, and transport information.
The following example is a note to Anoosha from Preethi, stored as XML:
<note>
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
</note>
The above example is self descriptive. It has sender and receiver information, it also has a heading and a
message body.
This XML document does not do anything. It is just pure information wrapped in tags.
2) XML is Just Plain Text
XML is nothing special. It is just plain text. Software that can handle plain text can also handle XML.
However, XML-aware applications can handle the XML tags specially. The functional meaning of the tags
depends on the nature of the application.
3) With XML You Can Invent Your Own Tags
The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags
are "invented" by the author of the XML document.
That is because the XML language has no predefined tags.
The tags used in HTML (and the structure of HTML) are predefined. HTML documents can only use tags
defined in the HTML standard (like <p>, <h1>, etc.).
XML allows the author to define his own tags and his own document structure.
4) XML is Not a Replacement for HTML
It is important to understand that XML is not a replacement for HTML but complement to HTML. In most
web applications, XML is used to transport data, while HTML is used to format and display the data.
5) XML is a W3C Recommendation
XML became a W3C Recommendation on 10. February 1998.
6) XML is everywhere
XML is now as important for the Web as HTML was to the foundation of the Web.XML is everywhere. It is
the most common tool for data transmissions between all sorts of applications, and is becoming more and
more popular in the area of storing and describing information.

6. XML Tree

Page 5 of 13

White Paper on XML Basics


XML documents form a tree structure that starts at "the root" and branches to "the leaves".

An Example XML Document


Consider the below example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
</note>
The first line is the XML declaration. It defines the XML version (1.0) and the encoding used (ISO-8859-1
= Latin-1/West European character set).
The next line describes the root element of the document
<note>
The next 4 lines describe 4 child elements of the root (to, from, heading, and body):
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
And finally the last line defines the end of the root element:
</note>
XML documents must contain a root element. This element is "the parent" of all other elements.
The elements in an XML document form a document tree. The tree starts at the root and branches to the
lowest level of the tree.
All elements can have sub elements (child elements):
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
The terms parent, child, and sibling are used to describe the relationships between elements. Parent
elements have children. Children on the same level are called siblings (brothers or sisters).
All elements can have text content and attributes (just like in HTML).

Example:

Page 6 of 13

White Paper on XML Basics

The image above represents one book in the XML below:


<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
The root element in the example is <bookstore>. All <book> elements in the document are contained
within <bookstore>.
The <book> element has 4 children: <title>,< author>, <year>, <price>.

7. XML Syntax Rules


The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to
use.
1) All XML Elements Must Have a Closing Tag
In HTML, you will often see elements that don't have a closing tag:
<p>This is a paragraph
<p>This is another paragraph
In XML all elements must have a closing tag:

Page 7 of 13

White Paper on XML Basics


<p>This is a paragraph</p>
<p>This is another paragraph</p>
Note: You might have noticed from the previous example that the XML declaration did not have a closing
tag. This is not an error. The declaration is not a part of the XML document itself, and it has no closing tag.
2) XML Tags are Case Sensitive
XML elements are defined using XML tags.
XML tags are case sensitive. With XML, the tag <Letter> is different from the tag <letter>.
Opening and closing tags must be written with the same case:
<Message>This is incorrect</message>
<message>This is correct</message>
Note: "Opening and closing tags" are often referred to as "Start and end tags". Use whatever you prefer.
It is exactly the same thing.
3)XML Elements Must be properly Nested
In XML, all elements must be properly nested within each other:
<b><i>This text is bold and italic</b></i> (this is wrong)
<b><i>This text is bold and italic</i></b>

(this is correct)

4)XML Documents Must Have a Root Element


XML documents must contain one element that is the parent of all other elements. This element is called
the root element.
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
5)XML Attribute Values must be quoted
XML elements can have attributes in name/value pairs just like in HTML.
In XML the attribute value must always be quoted. Consider the two XML documents below.
<note date=12/11/2007>
<to>Anoosha</to>
<from>Preethi</from>
</note>

(This is wrong because date attribute in the note element is not quoted)

<note date="12/11/2007">
<to>Anoosha</to>
<from>Preethi</from>
</note>

(This is correct)

6)Entity References
Some characters have a special meaning in XML.

Page 8 of 13

White Paper on XML Basics


If you place a character like "<" inside an XML element, it will generate an error because the parser
interprets it as the start of a new element. For example this will generate an XML error:
<emp_sal>if salary < 1000 then</ emp_sal >
To avoid this error, replace the "<" character with an entity reference:
< emp_sal >if salary &lt; 1000 then</ emp_sal >
There are 5 predefined entity references in XML:

&lt;

<

less than

&gt;

>

greater than

&amp;

&

ampersand

&apos;

'

apostrophe

&quot;

"

quotation mark

7)Comments in XML
The syntax for writing comments in XML is similar to that of HTML.
<!-- This is a comment -->
8)White-space is Preserved in XML
HTML truncates multiple white-space characters to one single white-space. With XML, the white-space in a
document is not truncated.
9) XML Stores New Line as LF
In Windows applications, a new line is normally stored as a pair of characters: carriage return (CR) and
line feed (LF). The character pair bears some resemblance to the typewriter actions of setting a new line.
In UNIX applications, a new line is normally stored as a LF character. Macintosh applications use only a CR
character to store a new line.

8. XML Elements
An XML element is everything from (including) the element's start tag to (including) the element's end
tag.
An element can contain other elements, simple text or a mixture of both. Elements can also have
attributes.
Elements are the most common form of markup. Delimited by angle brackets, most elements identify the
nature of the content they surround. Some elements may be empty, in which case they have no content.
If an element is not empty, it begins with a start-tag, <element>, and ends with an end-tag, </element>.
<note date="12/11/2007">
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>

Page 9 of 13

White Paper on XML Basics


<body>Please call me!</body>
</note>
In the example above, <note> have element contents, because they contain other elements. <from>
has text content because it contains text.
In the example above only <note> has an attribute (date="12/11/2007").
XML Naming Rules
The following are the naming rules that one must follow for elements:
1.
2.

Names can contain letters, numbers, and other characters


Names cannot start with a number or punctuation character

3.

Names cannot start with the letters xml (or XML, or Xml, etc)

4.

Names cannot contain spaces

Any name can be used, no words are reserved.

9. XML Attributes
Attributes provide additional information about elements.
Attributes are name-value pairs that occur inside start-tags after the element name. For example,
<div class="preface"> is a div element with the attribute class having the value preface. In XML, all
attribute values must be quoted.
XML Elements vs. Attributes
The following figure shows the anatomy of XML file

XML attributes are normally used to describe XML elements. For example consider these two examples
<note date="12/11/2007">
<to>Anoosha</to>

Page 10 of 13

White Paper on XML Basics


<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
</note>
<note>
<date>12/11/2007</date>
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
</note>
In the first example date is an attribute. In the last example date is an element. Both examples provide
the same information.
There are no rules about when to use attributes and when to use elements In HTML the attributes may be
handy, but in XML one should try to avoid them, as long as the same information can be expressed using
elements.
The reasons why attributes are inferior to elements are:

Attributes can not contain multiple values (elements can)


Attributes are not expandable (for future changes)

Attributes can not describe structures (like child elements can)

Attributes are more difficult to manipulate by program code

Attribute values are not easy to test against a DTD.

10. Well-formed XML versus Valid XML


XML with correct syntax is "Well Formed" XML.
XML validated against a DTD is "Valid" XML.
Well-formed XML is XML that follows all the rules of XML: proper element naming, nesting, attribute
naming, and so on.
Validation is checking your document's structure against rules for your elements and how you defined
child elements for each parent element. You define these rules in a Document Type Definition (DTD) or in
a schema. This validation requires you to create your DTD or schema, and then reference the DTD or
schema file within your XML files.
A "Valid" XML document is a "Well Formed" XML document, which also conforms to the rules of a
Document Type Definition (DTD):
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
</note>
The DOCTYPE declaration in the example above is a reference to an external DTD file. The content of the
file is shown below
<!DOCTYPE note
[

Page 11 of 13

White Paper on XML Basics


<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
]>

note (to,from,heading,body)>
to (#PCDATA)>
from (#PCDATA)>
heading (#PCDATA)>
body (#PCDATA)>

XML DTD

DTD stands for Document Type Definition.


Parties that exchange data in the form of XML documents need to agree on the
structure of the document.

These parties create their own markup languages based on XML.

The purpose of a DTD is to define the structure of an XML document. It defines the structure with a list of
legal elements.
An XML DTD allows computers to check that each component of document occurs in a valid place within
the document. For example it allows computers to check that users do not accidentally enter a third level
of heading without first having a second level heading, etc.
The DTD can be Internal or external. An internal DTD refers to a case where the XML document has the
DTD inline; where as an external DTD is one where the document instance is separated from the formal
definition of elements.
Consider the example below:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Anoosha</to>
<from>Preethi</from>
<heading>Urgent</heading>
<body>Please call me!</body>
</note>
The DOCTYPE declaration in the example above is a reference to an external DTD file. The content of the
file is shown below
<!DOCTYPE
[
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
]>

note
note (to,from,heading,body)>
to (#PCDATA)>
from (#PCDATA)>
heading (#PCDATA)>
body (#PCDATA)>

11. XML Schema


W3C supports an XML-based alternative to DTD, called XML Schema. An XML Schema describes the
structure of an XML document.
An XML Schema:

defines elements that can appear in a document


defines attributes that can appear in a document
defines which elements are child elements
defines the order of child elements
defines the number of child elements

Page 12 of 13

White Paper on XML Basics

defines whether an element is empty or can include text


defines data types for elements and attributes
defines default and fixed values for elements and attributes

XML Schemas are much more powerful than DTDs.XML schema is written in XML.
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>

12. References:
www.w3schools.com
www.wikipedia.com
www.ibm.com

Page 13 of 13