You are on page 1of 30

Training module for XML and DTD

PreparedbySantanuNayak 15072011

XML (Extensible Markup Language) has emerged as the leading standard for data interchange between applications and between organizations. In this XML training class, attendees learn the core fundamentals of XML and its related technologies like XML, XSLT, DTD.

Prior knowledge of HTML and/or relational databases is helpful but not necessary.

To learn how XML and its related technologies function like DTD, XSLT etc. To master the core syntax of XML, DTD To learn the fundamentals of XSL

1. Introduction 1.1 1.2 2. Writing XML 2.1 2.2 2.3 2.4 XML Tree XML Syntax Rules Rules for writing XML Elements, attributes, and values 2.4.1 2.4.2 2.4.3 2.5 2.5.1 2.5.2 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 Element Attribute Value Version declaration Encoding declarations What is XML? The Difference Between XML and HTML

Declaring the XML version

Creating the root element Nesting Element Writing comments Writing symbols and special character What is DTD and what is the role of DTDs? Internal DTD External DTD Type of declarations in DTD 3.4.1 3.4.2 3.4.3 Element declarations Attribute List Declarations Entity declarations

3. DTD (Document Type Definition) fundamentals

3.5 3.6 3.7

PCDATA CDATA Special Character (Entities)

4. Relation between XML and DTD and Validation 4.1 Validation(Parsing) 4.1.1 Well Formed XML 4.1.2 Valid XML Documents

1. Introduction
XML was designed to transport and store data where HTML was designed to display data.

XML XML XML XML XML XML stands for Extensible Markup Language is a markup language much like HTML was designed to carry data, not to display data tags are not predefined. You must define your own tags is designed to be self-descriptive is a W3C Recommendation

It is important to understand that XML is not a replacement for HTML. In most web applications, XML is used to transport data, while HTML is used to format and display the data. The best description of XML is this: XML is a software- and hardware- independent tool for carrying information. XML and HTML were designed with different goals: XML was designed to transport and store data, with focus on what data is HTML was designed to display data, with focus on how data looks HTML is about displaying information, while XML is about carrying information.

2. WritingXML
XML documents form a tree structure that starts at "the root" and branches to "the leaves". XML Documents Form a Tree Structure XML documents must contain a root element. This element is "the parent" of all other elements. The elements in an XML document form a document tree. The tree starts at the root and branches to the lowest level of the tree. All elements can have sub elements (child elements): <root> <child> <subchild>.....</subchild> </child> </root> The terms parent, child, and sibling are used to describe the relationships between elements. Parent elements have children. Children on the same level are called siblings (brothers or sisters). All elements can have text content and attributes (just like in HTML).

The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to use. 1. All XML Elements Must Have a Closing Tag (where In HTML, some elements do not have to have a closing tag) Example: <p>This is a paragraph</p> <p>This is another paragraph</p>

2. XML Tags are Case Sensitive Example: <Message>This is incorrect</message> Wrong <message>This is correct</message> Correct

3. XML Elements Must be Properly Nested Example: <b><i>This text is bold and italic</b></i> Wrong <b><i>This text is bold and italic</i></b> Correct

4. XML Documents Must Have a Root Element Example: <root> <child> <subchild>.....</subchild> </child> </root>

5. XML Attribute Values Must be Quoted Example: Wrong <note date=12/11/2007> <to>Tove</to> <from>Jani</from> </note> Correct <note date="12/11/2007"> <to>Tove</to> <from>Jani</from> </note>

6. Entity References Example: <message>if salary < 1000 then</message> <message>if salary &lt; 1000 then</message>

7. Comments in XML The syntax for writing comments in XML is similar to that of HTML. <!-- This is a comment -->

There are nine basic rules for building good XML: 1. 2. 3. 4. 5. 6. 7. 8. 9. All XML must have a root element. All tags must be closed. All tags must be properly nested. Tag names have strict limits. Tag names are case sensitive. Tag names cannot contain spaces. Attribute values must appear within quotes (""). White space is preserved. HTML tags should be avoided (optional).

2.4Elements,attributes,andvalues 2.4.1Element
A element is just a generic name for a Tag. An opening tag looks like <element>, while a closing tag has a slash that is placed before the element's name: </element>. All information that belongs to an element must be contained between the opening and closing tags of an element. An element can contain: other elements text attributes or a mix of all of the above... Example: <bookstore> <book category="CHILDREN"> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="WEB"> <title>Learning XML</title> <author>Erik T. Ray</author>

<year>2003</year> <price>39.95</price> </book> </bookstore> In the example above, <bookstore> and <book> have element contents, because they contain other elements. <book> also has an attribute (category="CHILDREN"). <title>, <author>, <year>, and <price> have text content because they contain text.

XML elements must follow these naming rules: Names Names Names Names can contain letters, numbers, and other characters cannot start with a number or punctuation character cannot start with the letters xml (or XML, or Xml, etc) cannot contain spaces

Any name can be used, no words are reserved.

Attributes are used to specify additional information about the element. It may help to think of attributes as a means of specializing generic elements to fit your needs. An attribute for an element appears within the opening tag. If there are multiple values an attribute may have, then the value of the attribute must be specified. For example, if a tag had a color attribute then the value would be: red, blue, green, etc. The syntax for including an attribute in an element is: <element attributeName="value"> In this example we will be using a made up XML element named "friend" that has an optional attribute age. Example: <friend age="23">Samantha</friend>

Attributes may have a default value OR a fixed value specified. A default value is automatically assigned to the attribute when no other value is specified. Example:

<note day="14" month="07" year="2011" to="Crest" from="Springer" heading="Book" body="Happy weekend!"></note>

2.5DeclaringtheXMLversion 2.5.1Versiondeclaration
Version declaration, as a type of Processing Instruction, it is information for the application. XML documents start with an XML version declaration (XML declaration) which specifies the version of XML being used. For the time being there exists only version 1.0 of XML. Although the XML version declaration is optional, it is suggested by the W3C specification. It will look something like the following: <?xml version="1.0"?>

Encoding declarations inform the processor what kind of code the document uses (e.g. UFT8, which is the same character Set as ASCII). All XML parsers must support 8-bit and 16-bit Unicode encoding corresponding to ASCII. However, XML parsers may support a larger set. It will look something like the following: <?xml version="1.0"? encoding="UTF-8"?>

The goal of this step is to create a root element, which contains all other elements that you create. It is the most important element, as it contains the rest of the document and becomes synonymous with your document type. It cannot be repeated. Markup documents, whether HTML, XML or SGML, employ a root element, which contains all other elements. The root element usually describes the focus or function of the document. The HTML element in HTML is a good root element because it reveals the name of the markup language. Example:

<?xml version="1.0" standalone="yes"?> <HELP>


To think of nesting in plain English, follow this rule: elements opened first must be closed last. That means that the root element, the first element in an XML document, must also be closed last. Nested elements, ones that occur in the middle of the document, must be closed before those that came before them. When an element appears within another element, it is said that the inner element is "nested". The term nested can be related directly to the word "nest". If an element is nested within another element, then it is surrounded, protected, or encapsulated by the outer element. Besides being such an easy term to understand, nesting also serves a wonderful purpose of keeping order in an XML document. Much like parentheses in a math problem, elements must be closed in the order that they are opened. This means that an element which is nested inside another element must end itself before the outer element. Below are two example XML documents (A & B). One is properly nested and the other has a small problem. Example:

<phonebook> <number> <name> </number> </name> </phonebook>

Comments in XML are nearly identical to comments in HTML. Using comments allows you to understand code you wrote years before, or another developer to review. Comment tags are two parts, the part starting the comment and the part ending it. First, add the first part of the comment tag <! Write whatever comment you would like - just make sure you don't nest comments within other comments. Close the comment tag --> Tips:

Comments cannot come at the very top of your document. In XML, only the XML declaration can come first: <?xml version="1.0"?> Comments may not be nested one inside another. You must close your first comment before you open a second. Comments cannot occur within tags, e.g. <tag ></tag>. Never use the two dashes (--) anywhere but at the beginning and end of your comments. Anything in comments is effectively invisible to the XML parser, so be very careful that what remains is still valid and well-formed.

While parsing the XML file, sometimes you may want to show some Extra characters. For Example Suppose you want to have declaration like <amount> Balance > Investment </amount>

in your XML File. Now ">" is a reserved character which is normally used to declare the Entity Name. To handle such kind of situations, you can replace these characters with these special characters, which get substituted automatically while parsing the XML file. Character & < > " ' Reference &amp; &lt; &gt; &quot; &apos;

So you can declare the above declaration in this format for it to be valid <amount> Balance &lt; Investment </amount>

Character References :

The above List is for predefined characters. You can also use the Unicode value while declaring custom characters. For example you can declare it as <amount> Balance &#147; Investment </amount>

A character reference like &#147; contains a hash mark (#) followed by a number. The number is the Unicode value for a single character, such as 65 for the letter A, 147 for the left-curly quote, or 148 for the right-curly quote.

3. DTD(DocumentTypeDefinition)fundamentals
A DTD is a set of rules that defines what tags appear in a XML document, what attributes the tags may have and what a relationship the tags have with each other. When an XML document is processed, it is compared within the DTD to be sure it is structured correctly and all tags are used in the proper manner. This comparison process is called validation and it is performed by a tool called parser. The purpose of a DTD is to define the legal building blocks of an XML document. It defines the document structure with a list of legal elements. A DTD can be declared inline in your XML document, or as an external reference.

Internal DTD (markup declaration) are inserted within the doctype declaration. DTDs inserted this way are used in the that specific document. This might be the approach to take for the use of a small number of tags in a single document, as in this example: <?xml version="1.0"?> <!DOCTYPE film [ <!ENTITY COM "Comedy"> <!ENTITY SF "Science Fiction"> <!ELEMENT film (title+,genre,year)> <!ELEMENT title (#PCDATA)> <!ATTLIST title xml:lang NMTOKEN "EN" id ID #IMPLIED> <!ELEMENT genre (#PCDATA)> <!ELEMENT year (#PCDATA)> ]> <film> <title id="1">Tootsie</title> <genre>&COM;</genre> <year>1982</year> <title id="2">Jurassic Park</title> <genre>&SF;</genre> <year>1993</year> </film>

DTDs can be very complex and creating a DTD requires a certain amount of work. DTDs are stored as ASCII text files with the extension '.dtd'. In the following example we

assume, that the previously internal DTD was saved as a separate file (under the name film.dtd), and is therefore now referred to as external definition (external DTD): <?xml version="1.0"?> <!DOCTYPE film SYSTEM "film.dtd"> <film> <title id="1">Tootsie</title> <genre>&COM;</genre> <year>1982</year> <title id="2">Jurassic Park</title> <genre>&SF;</genre> <year>1993</year> </film>

There are four kinds of markup declarations in XML within the DTD: element declarations attribute list declarations entity declarations, and notation declarations

Element declarations identify the names of elements and the nature of their content. As in HTML, elements are the basic building blocks of XML. Element type declarations constrain which element types can appear as children of the element. Let us have a look. In the DTD, XML elements are declared with an element declaration. An element declaration has the following syntax: <!ELEMENT element-name (element-content)>

Empty elements are declared with the keyword EMPTY inside the parentheses: <!ELEMENT element-name (EMPTY)> example:


Elements with data are declared with the data type inside parentheses: <!ELEMENT or <!ELEMENT or <!ELEMENT example: <!ELEMENT element-name (#CDATA)> element-name (#PCDATA)> element-name (ANY)> note (#PCDATA)>

#CDATA means the element contains character data that is not supposed to be parsed by a parser. #PCDATA means that the element contains data that IS going to be parsed by a parser. The keyword ANY declares an element with any content. If a #PCDATA section contains elements, these elements must also be declared.

Elements with one or more children are defined with the name of the children elements inside the parentheses: <!ELEMENT element-name (child-element-name)> or <!ELEMENT element-name (child-element-name,child-element-name,.....)> example: <!ELEMENT note (to,from,heading,body)>

When children are declared in a sequence separated by commas, the children must appear in the same sequence in the document. In a full declaration, the children must also be declared, and the children can also have children. The full declaration of the note document will be: <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT note (to,from,heading,body)> to (#CDATA)> from (#CDATA)> heading (#CDATA)> body (#CDATA)>

If the DTD is to be included in your XML source file, it should be wrapped in a DOCTYPE definition with the following syntax: <!DOCTYPE root-element [element-declarations]> example: <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#CDATA)> <!ELEMENT from (#CDATA)> <!ELEMENT heading (#CDATA)> <!ELEMENT body (#CDATA)> ]> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend</body> </note>

<!ELEMENT element-name (child-name)> example <!ELEMENT note (message)> The example declaration above declares that the child element message can only occur one time inside the note element.

<!ELEMENT element-name (child-name+)> example <!ELEMENT note (message+)> The + sign in the example above declares that the child element message must occur one or more times inside the note element.

<!ELEMENT element-name (child-name*)> example <!ELEMENT note (message*)>

The * sign in the example above declares that the child element message can occur zero or more times inside the note element.

<!ELEMENT element-name (child-name?)> example <!ELEMENT note (message?)> The ? sign in the example above declares that the child element message can occur zero or one times inside the note element.

example <!ELEMENT note (to+,from,header,message*,#PCDATA)> The example above declares that the element note must contain at least one to child element, exactly one from child element, exactly one header, zero or more message, and some other parsed character data as well.

In the DTD, XML element attributes are declared with an ATTLIST declaration. An attribute declaration has the following syntax: <!ATTLIST element-name attribute-name attribute-type default-value>

As you can see from the syntax above, the ATTLIST declaration defines the element which can have the attribute, the name of the attribute, the type of the attribute, and the default attribute value. The attribute-type can have the following values: Value CDATA ID IDREF IDREFS NMTOKEN Explanation The value is character data The value is an unique id The value is the id of another element The value is a list of other ids The value is a valid XML name

(eval|eval|..) The value must be an enumerated value


The value is a list of valid XML names The value is an entity The value is a list of entities The value is a name of a notation The value is predefined

The attribute-default-value can have the following values: Value Explanation

#DEFAULT value The attribute has a default value #REQUIRED #IMPLIED #FIXED value The attribute value must be included in the element The attribute does not have to be included The attribute value is fixed

DTD example: <!ELEMENT square EMPTY> <!ATTLIST square width CDATA "0"> XML example: <square width="100"></square> In the above example the element square is defined to be an empty element with the attributes width of type CDATA. The width attribute has a default value of 0.

Syntax: <!ATTLIST element-name attribute-name CDATA "default-value"> DTD example: <!ATTLIST payment type CDATA "check"> XML example: <payment type="check">

Specifying a default value for an attribute, assures that the attribute will get a value even if the author of the XML document didn't include it.

Syntax: <!ATTLIST element-name attribute-name attribute-type #IMPLIED> DTD example: <!ATTLIST contact fax CDATA #IMPLIED> XML example: <contact fax="555-667788">

Use an implied attribute if you don't want to force the author to include an attribute and you don't have an option for a default value either.

Syntax: <!ATTLIST element-name attribute_name attribute-type #REQUIRED> DTD example: <!ATTLIST person number CDATA #REQUIRED> XML example: <person number="5677">

Use a required attribute if you don't have an option for a default value, but still want to force the attribute to be present.

Syntax: <!ATTLIST element-name attribute-name attribute-type #FIXED "value"> DTD example: <!ATTLIST sender company CDATA #FIXED "Microsoft"> XML example: <sender company="Microsoft">

Use a fixed attribute value when you want an attribute to have a fixed value without allowing the author to change it. If an author includes another value, the XML parser will return an error.

Syntax: <!ATTLIST element-name attribute-name (eval|eval|..) default-value> DTD example: <!ATTLIST payment type (check|cash) "cash"> XML example: <payment type="check"> or <payment type="cash">

Use enumerated attribute values when you want the attribute values to be one of a fixed set of legal values.

Note! No attribute name may appear more than once in the same start-tag or emptyelement tag. The attribute must have been declared; the value must be of the type declared for it. No External Entity References. Attribute values cannot contain direct or indirect entity references to external entities. The replacement text of any entity referred to directly or indirectly in an attribute value (other than "&lt;") must not contain a <.

XML documents can be made of information drawn from different files. These pieces of information are called entities. It might be easier to think of entities as a macro for programmers, or as aliases for more complex functions. A single entity name can take the place of a whole lot of text. Entity references cut down on the amount of typing you have to do because anytime you need to reference that bunch of text, you simply use the alias name and the processor will expand out the contents of the alias for you. Entities allow to refer to other data and pages as shortcuts, so that declaring the same information in a document or DTD is not necessary. Entity declarations allow you to associate a name with some other fragments of the document. That construct can be a chunk of regular text, a chunk of the document type declaration, or a reference to an external file containing either text or binary data. Entities are declared in the DTD, similar to elements and attributes.

Syntax: <!ENTITY entity-name "entity-value"> DTD Example: <!ENTITY writer "Jan Egil Refsnes."> <!ENTITY copyright "Copyright XML101."> XML example: <author>&writer;&copyright;</author>

Syntax: <!ENTITY entity-name SYSTEM "URI/URL"> DTD Example: <!ENTITY writer SYSTEM ""> <!ENTITY copyright SYSTEM ""> XML example: <author>&writer;&copyright;</author>

Entities may be either parsed or unparsed. A parsed entity's contents are referred to as its replacement text; this text is considered an integral part of the document. An unparsed entity is a resource whose contents may or may not be text, and if text, may not be XML. Each unparsed entity has an associated notation, identified by name. Beyond a requirement that an XML processor make the identifiers for the entity and notation available to the application, XML places no constraints on the contents of unparsed entities. Parsed entities are invoked by name using entity references; unparsed entities by name, given in the value of ENTITY or ENTITIES attributes.

General entities (or simply entities) are entities for use within the document content. Parameter entities are parsed entities for use within the DTD. These two types of entities use different forms of reference and are recognized in different contexts. Furthermore, they occupy different namespaces; a parameter entity and a general entity with the same name are two distinct entities. The Name identifies the entity in an entity reference or, in the case of an unparsed entity, in the value of an ENTITY or ENTITIES attribute. If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.

Example (general) entity declaration: <!DOCTYPE videocollection [ <!ENTITY R "Romance"> <!ENTITY WAR "War"> <!ENTITY COM "Comedy"> <!ENTITY SF "Science Fiction"> <!ENTITY ACT "Action"> ]> These entities are then used (referred to) in a XML document like this: An (general) entity reference refers to the content of a named entity. References to parsed general entities use ampersand (&) and semicolon (;) as delimiters. <videocollection> <title id="1">Tootsie</title> <genre>&COM;</genre> <year>1982</year> <title id="2">Jurassic Park</title> <genre>&SF;</genre> <year>1993</year> <title id="3">Mission Impossible</title> <genre>&ACT;</genre> <year>1996</year> </videocollection> As in HTML, the name of the entity is preceded with an ampersand (&) and followed by a semicolon (;). Parameter entity declaration is used for shortcuts within the DTD. Example parameter entity declaration: <!ENTITY % NAME "text that you want to be represented by the entity"> <!ENTITY % pub "&#xc9;ditions Gallimard" > <!ENTITY rights "All rights reserved" > <!ENTITY book "La Peste: Albert Camus, &#xA9; 1947 %pub;. &rights;">

Parameter-entity references use percent-sign (%) and semicolon (;) as delimiters. The Parameter entity reference then is: <!ENTITY book "La Peste: Albert Camus, &#xA9; 1947 %pub;. &rights;">

The replacement text for the entity "book" is: La Peste: Albert Camus, 1947 ditions Gallimard.

XML expands the power of entities in a big way.There are three kinds of entities.

If the entity definition is an EntityValue, the defined entity is called an internal entity. There is no separate physical storage object, and the content of the entity is given in the declaration. Internal Entities allow for entities to be defined in DTDs so they can be used throughout the rest of the document. If, for instance, a phrase such as "Science Fiction" occurs frequently in a document, following could be put in the DTD to avoid typing the whole phrase each time. Internal entities allow you to define shortcuts for frequently typed text or text that is expected to change, such as the revision status of a document. Internal entities help avoiding misspellings and retyping of the same information. An internal entity is a parsed entity. Example of an internal entity declaration: <!ENTITY SF "Science Fiction"> Whenever the full term needs to be used in the document, it sufficient to type &SF; Internal entities can include references to other internal entities, but it is an error for them to be recursive.

If the entity is not internal, it is an external entity. External entity references is used for replacement text that is really long. The information is then kept in another file. External entities allow an XML document to refer to an external file. External entities contain either text or binary data. If they contain text, the content of the external file is inserted at the point of reference and parsed as part of the referring document. Binary data is not parsed and may only be referenced in an attribute. Binary data is used to reference figures and other non-XML content in the document. The entity declaration in this example refers to documents that are located in different sections. They are placed into the XML file by using the entities, rather than cutting and pasting the contents of separate files together. You can specify an entity that has text defined external to the document by using the SYSTEM keyword such as: <!ENTITY LIagreement SYSTEM ""> <!ENTITY LOGO SYSTEM "" NDATA GIF87A>

In this case, the XML processor will parse the content of that file as if its content had been typed at the location of the entity reference. The entity is also an external entity, but its content is binary. The LOGO entity can only be used as the value of an ENTITY (or ENTITIES) attribute (on a graphic element, perhaps). The XML processor will pass this information along to an application, but it does not attempt to process the content of /standard/logo.gif.

There are five pre defined XML entities, most of which should be well known to HTML coders: &lt; produces the left angle bracket &gt; produces the right angle bracket &amp; produces the ampersand &apos; produces a single quote character &quot; produces a double quote character < > & ' "

You could also use entity references within tag attributes. For example, consider the following: <INVOICE CLIENT = "&IBM;" product = "&product_id_8762;" quantity ="5">

You may not reference an external entity from within element attributes. The referenced text may not contain the < character because it would cause a parsing error in the element when replaced. Note Note that there may not be any whitespace embedded in an entity reference. & SF; or &SF ; will cause errors. Entities MUST be declared in an XML document before they are referenced.

PCDATA means parsed character data. Think of character data as the text found between the start tag and the end tag of an XML element. PCDATA is text that will be parsed by a parser. Tags inside the text will be treated as markup and entities will be expanded.

CDATA also means character data.

CDATA is text that will NOT be parsed by a parser. Tags inside the text will NOT be treated as markup and entities will not be expanded.

Entities as variables used to define common text. Entity references are references to entities. Most of you will known the HTML entity reference: "&nbsp;" that is used to insert an extra space in an HTML document. Entities are expanded when a document is parsed by an XML parser. The following entities are predefined in XML:
Entity References Character











4. RelationbetweenXMLandDTDandValidation
The relation and the purpose of having a DTD is, when your XML document is processed, it is compared to its associated DTD to be sure it is structured correctly and all tags are used in the proper manner. This comparison process is called validation and is performed by a tool called a parser. Remember, you don't need to have a DTD to create an XML document; you only need a DTD for a valid XML document.

When a XML generate with correct syntax is called "Well Formed" XML. But when a XML validated against a DTD is called "Valid" XML. We have to generate always a Valid XML.

A "Well Formed" XML document has correct XML syntax. The syntax rules were described in the previous chapters: XML XML XML XML XML documents must have a root element elements must have a closing tag tags are case sensitive elements must be properly nested attribute values must be quoted

<?xml version="1.0" encoding="ISO-8859-1"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

A "Valid" XML document is a "Well Formed" XML document, which also conforms to the rules of a Document Type Definition (DTD): <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE note SYSTEM "Note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>