You are on page 1of 12

The Purpose of XML Schema

According to the World Wide Web Consortium (W3C), which approved XML Schema as an official recommendation in 2001, "XML Schemas express shared vocabularies and allow machines to carry out rules made by people. They provide a means for defining the structure, content, and semantics of XML documents." XML Schema was born out of a need to provide a more powerful and flexible alternative to the standard DTD (Document Type Definition), a language for expressing SGML and XML content models. Though many DTDs are still in use today in legacy document frameworks and industry standards, and are even often used in tandem with XSDs, XML Schema offers a lengthy list of advantages for defining XML documents XML Schema is an XML-based language used to create XML-based languages and data models. An XML schema defines element and attribute names for a class of XML documents. The schema also specifies the structure that those documents must adhere to and the type of content that each element can hold. XML Schema provides a much richer set of structures, types and constraints for describing data and is therefore expected to soon become the most common method for defining and validating highly structured XML documents XML documents that attempt to adhere to an XML schema are said to be instances of that schema. If they correctly adhere to the schema, then they are valid instances. This is not the same as being well formed. A well-formed XML document follows all the syntax rules of XML, but it does necessarily adhere to any particular schema. So, an XML document can be well formed without being valid, but it cannot be valid unless it is well formed.

The Power of XML Schema


DTDs are similar to XML schemas in that they are used to create classes of XML documents. DTDs were around long before the advent of XML. They were originally created to define languages based on SGML, the parent of XML. Although DTDs are still common, XML Schema is a much more powerful language. As a means of understanding the power of XML Schema, let's look at the limitations of DTD. 1. DTDs do not have built-in datatypes. 2. DTDs do not support user-derived datatypes. 3. DTDs allow only limited control over cardinality (the number of occurrences of an element within its parent). 4. DTDs do not support Namespaces or any simple way of reusing or importing other schemas.

A First Look
An XML schema describes the structure of an XML instance document by defining what each element must or may contain. An element is limited by its type. For example, an element of complex type can contain child elements and attributes, whereas a simple-type element can only contain text. The diagram below gives a first look at the types of XML Schema elements.

Schema authors can define their own types or use the built-in types. The following is a high-level overview of Schema types. 1. Elements can be of simple type or complex type. 2. Simple type elements can only contain text. They can not have child elements or attributes. 3. All the built-in types are simple types (e.g, xs:string). 4. Schema authors can derive simple types by restricting another simple type. For example, an email type could be derived by limiting a string to a specific pattern. 5. Simple types can be atomic (e.g, strings and integers) or non-atomic (e.g, lists). 6. Complex-type elements can contain child elements and attributes as well as text. 7. By default, complex-type elements have complex content, meaning that they have child elements. 8. Complex-type elements can be limited to having simple content, meaning they only contain text. They are different from simple type elements in that they have attributes. 9. Complex types can be limited to having no content, meaning they are empty, but they have may have attributes. 10. Complex types may have mixed content - a combination of text and child elements.

A Simple XML Schema


Let's take a look at a simple XML schema, which is made up of one complex type element with two child simple type elements.

Code Sample: SchemaBasics/Demos/Author.xsd


<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="Author"> <xs:complexType> <xs:sequence> <xs:element name="FirstName" type="xs:string" /> <xs:element name="LastName" type="xs:string" />

</xs:sequence> </xs:complexType> </xs:element> </xs:schema> Code Explanation As you can see, an XML schema is an XML document and must follow all the syntax rules of any other XML document; that is, it must be well formed. XML schemas also have to follow the rules defined in the "Schema of schemas," which defines, among other things, the structure of and element and attribute names in an XML schema. The document element of XML schemas is xs:schema. It takes the attribute xmlns:xs with the value ofhttp://www.w3.org/2001/XMLSchema, indicating that the document should follow the rules of XML Schema. This will be clearer after you learn about namespaces. In this XML schema, we see a xs:element element within the xs:schema element. xs:element is used to define an element. In this case it defines the element Author as a complex type element, which contains a sequence of two elements: FirstName and LastName, both of which are of the simple type, string.

Validating an XML Instance Document


In the last section, you saw an example of a simple XML schema, which defined the structure of an Author element. The code sample below shows a valid XML instance of this XML schema.

Code Sample: SchemaBasics/Demos/MarkTwain.xml


<?xml version="1.0"?> <Author xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Author.xsd"> <FirstName>Mark</FirstName> <LastName>Twain</LastName> </Author> Code Explanation This is a simple XML document. Its document element is Author, which contains two child elements:FirstName and LastName, just as the associated XML schema requires. The xmlns:xsi attribute of the document element indicates that this XML document is an instance of an XML schema. The document is tied to a specific XML schema with the xsi:noNamespaceSchemaLocationattribute. There are many ways to validate the XML instance. If you are using an XML authoring tool, it very likely is able to perform the validation for you.

The Anatomy of an XML Schema


An XML Schema is comprised of a number of different declarations and definitions that serve to restrict data input through the application of a variety of different rules.

Document type declaration XML Schema is XML, and therefore conforms to the syntax specified in the W3C XML recommendations. This means that XML Schema can be parsed by a standard XML parser,

can be accessed programmatically for integration testing and other validation purposes, and also that it is extensible (as demonstrated by standards such as XBRL). The XML document type declaration is not required by XML Schema, though it is inferred by the root element, <schema>.

Namespace declaration Namespaces provide a context for element and attribute names used within an XML document, allowing architects to build and extend upon XML vocabularies using URIs to ensure the creation of unique data tags.

A namespace declaration is not required in XML Schema, but namespaces can rather be defined inline with element and attribute in the XML instance. For example:

Namespaces can play a vital role in any large data integration project or data exchange scenario, where item names can often come into conflict. Expanded names Namespaces in XML dictates that once namespaces are declared, they are enforced through the use of expanded names. An expanded name is simply a namespace name (defined in the namespace declaration) combined with a local name to denote an item definition that is unique to the declared namespace. In the example below, <xsd:annotation> tells us that the definition of annotation applies specifically and uniquely to the XML Schema (xsd) vocabulary.

Type definitions Type definitions (complexType and simpleType) enable developers to build modular data structures and reuse individual content models without rewriting code every time they need to employ the same data syntax. In the example, line 11 defines a complexType "USAddress", which uses a familiar data structure. This structure is then reused in lines 13 and 14 to describe our shipping and billing addresses, which may contain different content, but will still adhere to the same syntax rules.

Element / Attribute declarations Element and attribute declarations simply define the names that will be used for tags within the XML instance. Both of these can be further defined with a variety of different constraints including id, type, substitutionGroup, max/minOccurs, etc.

Sequence definition The sequence element defines the order in which child elements are required to appear in the corresponding XML document.

XML Schema Applications


XML Schema provides the separation of document structure and semantics that makes XML such a powerful language for a wide variety of uses.

Data Validation
XML Schemas define the structure of elements and attributes within an XML document, and offer a great deal of flexibility in designing and customizing content models for any kind of

documentation requirements. XML parsers use XML Schema to validate the following aspects of XML instances: Document structure, or syntax Datatypes Inclusion of required elements/attributes

This enables application designers to automate the control of user input in any of the many situations where XML is used including Web forms, publishing systems, databases and other backend storage mechanisms, data integration applications, Web services, etc.

Content Model Definition


XML Schema provides a powerful solution for modeling content based on a wide degree of variables. With extended support for both primitive and derived datatype definitions as well as user-defined types, XML Schema gives an enormous amount of flexibility to data modeling, applying programming concepts like inheritance and subclassing to data syntax. This gives content architects the ability to build extended models based on abstract components, and streamlines processes in large-scale documentation projects.

Data Exchange / Integration


XML Schema enables developers to define extensible document structures for XML data. Because XML Schema documents are based on XML syntax, they are programmatically accessible to developers and can add an enormous amount of flexibility to system architectures. XML Schemas can be stored along with other XML documents in XML architectures and data stores and manipulated, referenced, and styled using a growing number of XML companion tools like XPath, XQuery, XInclude/XPointer, and XSL/XSLT.

For example, used in conjunction with other XML technologies, such as XSLT and XMLenabled databases, global elements defined in XSDs can be processed consistently and uploaded to the appropriate database structure or even simultaneously output to HTML, RTF, PDF, and other formats using a methodology called single source publishing. The data-oriented datatypes provided in XML Schema 1.1, in addition to the documentoriented datatypes in the previous version of the recommendation, facilitate

complex document exchange and data integration scenarios, giving it exposure to the B2B and e-commerce architectures that traditionally employ other data formats such as EDI (electronic data interchange). In addition, XML Schemas support for namespaces enables XML documents to contain unique identifiers, and therefore incorporate more than one commonly used vocabulary at

a time. A namespace declaration, or binding, is generally declared in an XML document via an IRI (Internationalized Resource Identifier), and is expressed by applying a prefix to relevant elements and attributes. Namespaces provide enormous opportunities for data exchange and integration, enabling entire XML frameworks to coexist within the same architecture. This is an extremely valuable asset for a global economy, where mergers and acquisitions, supply chain requirements, and industry standards often dictate

heterogeneous data constructs.

Industry XML Standards


Industry XML standards aim to streamline and provide a basis for industry-wide data integration. Implementing a common XML vocabulary enables business partners to seamlessly exchange data across different systems and architectures. XML Schema provides a flexible and extremely portable method for defining these standards and has been used across an ever-growing number of industries including retail, telecommunications, financial services, human resources, healthcare, insurance, e-learning, and printing and publishing. Global data integration based on XML documentation, exchange, and infrastructure standards seems, however, to be a long way off. Compliance is usually voluntary, and there are often several different industry-specific standards to choose from. In addition, many of these specifications are still evolving, making it business and technology decisions increasingly difficult.

Despite these hurdles, the ability to create a flexible and extensible architecture provided by XML Schema and other XML technologies enables early adopters and forward thinking companies to easily adapt to changing industry mandates with resources such as XSLT, XPath, XQuery, and XML-enabled databases.

XML Schema Specifications


The XML Schema recommendation consists of three parts: XML Schema Part 0: Primer Second Edition provides a very useful quick start guide for schema developers and is designed to be used in tandem with the more definitive descriptions in Parts 1 and 2, and assumes a basic knowledge of XML and namespaces. XML Schema 1.1 Part 1: Structures defines the nature and general make up of the XML Schema recommendation, and provides detailed information about schema construction and application with sections including Conceptual Framework,

Schema Component Details, and Schemas and Schema-validity Assessment. This section of the specification depends on and refers directly to other W3C publications: XML Information Set, XML Namespaces, and XPath, as well as the XML Schema: Datatypes. XML Schema 1.1 Part 2: Datatypes describes and defines the strong datatyping capabilities of the XML Schema recommendation and is included as a separate document to enable it to be used as an independent entity and therefore portable to other XML tools and technologies. Datatyping allows schema designers to constrain the input of end-users through the application of recognized abstract concepts such as string, Boolean, integer, etc.

XML Schema restriction


Restrictions are used to define acceptable values for XML elements or attributes. Restrictions on XML elements are called facets.

Restrictions on Values
The following example defines an element called "age" with a restriction. The value of age cannot be lower than 0 or greater than 120:

<xs:element name="age"> <xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/> <xs:maxInclusive value="120"/> </xs:restriction> </xs:simpleType> </xs:element>

Restrictions on a Set of Values


To limit the content of an XML element to a set of acceptable values, we would use the enumeration constraint. The example below defines an element called "car" with a restriction. The only acceptable values are: Audi, Golf, BMW:

<xs:element name="car"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Audi"/> <xs:enumeration value="Golf"/> <xs:enumeration value="BMW"/> </xs:restriction> </xs:simpleType> </xs:element>

The example above could also have been written like this:

<xs:element name="car" type="carType"/> <xs:simpleType name="carType"> <xs:restriction base="xs:string"> <xs:enumeration value="Audi"/> <xs:enumeration value="Golf"/> <xs:enumeration value="BMW"/> </xs:restriction> </xs:simpleType>
Note: In this case the type "carType" can be used by other elements because it is not a part of the "car" element.

Restrictions on a Series of Values


To limit the content of an XML element to define a series of numbers or letters that can be used, we would use the pattern constraint. The example below defines an element called "letter" with a restriction. The only acceptable value is ONE of the LOWERCASE letters from a to z:

<xs:element name="letter"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="[a-z]"/> </xs:restriction> </xs:simpleType> </xs:element>
The next example defines an element called "initials" with a restriction. The only acceptable value is THREE of the UPPERCASE letters from a to z:

<xs:element name="initials"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="[A-Z][A-Z][A-Z]"/> </xs:restriction> </xs:simpleType> </xs:element>
The next example also defines an element called "initials" with a restriction. The only acceptable value is THREE of the LOWERCASE OR UPPERCASE letters from a to z:

<xs:element name="initials"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="[a-zA-Z][a-zA-Z][a-zA-Z]"/> </xs:restriction> </xs:simpleType> </xs:element>
The next example defines an element called "choice" with a restriction. The only acceptable value is ONE of the following letters: x, y, OR z:

<xs:element name="choice">

<xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="[xyz]"/> </xs:restriction> </xs:simpleType> </xs:element>


The next example defines an element called "prodid" with a restriction. The only acceptable value is FIVE digits in a sequence, and each digit must be in a range from 0 to 9:

<xs:element name="prodid"> <xs:simpleType> <xs:restriction base="xs:integer"> <xs:pattern value="[0-9][0-9][0-9][0-9][0-9]"/> </xs:restriction> </xs:simpleType> </xs:element>

Other Restrictions on a Series of Values


The example below defines an element called "letter" with a restriction. The acceptable value is zero or more occurrences of lowercase letters from a to z:

<xs:element name="letter"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="([a-z])*"/> </xs:restriction> </xs:simpleType> </xs:element>
The next example also defines an element called "letter" with a restriction. The acceptable value is one or more pairs of letters, each pair consisting of a lower case letter followed by an upper case letter. For example, "sToP" will be validated by this pattern, but not "Stop" or "STOP" or "stop":

<xs:element name="letter"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="([a-z][A-Z])+"/> </xs:restriction> </xs:simpleType> </xs:element>
The next example defines an element called "gender" with a restriction. The only acceptable value is male OR female:

<xs:element name="gender"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="male|female"/> </xs:restriction> </xs:simpleType> </xs:element>
The next example defines an element called "password" with a restriction. There must be exactly eight characters in a row and those characters must be lowercase or uppercase letters from a to z, or a number from 0 to 9:

<xs:element name="password"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="[a-zA-Z0-9]{8}"/> </xs:restriction> </xs:simpleType> </xs:element>

Restrictions on Whitespace Characters


To specify how whitespace characters should be handled, we would use the whiteSpace constraint. This example defines an element called "address" with a restriction. The whiteSpace constraint is set to "preserve", which means that the XML processor WILL NOT remove any white space characters:

<xs:element name="address"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:whiteSpace value="preserve"/> </xs:restriction> </xs:simpleType> </xs:element>
This example also defines an element called "address" with a restriction. The whiteSpace constraint is set to "replace", which means that the XML processor WILL REPLACE all white space characters (line feeds, tabs, spaces, and carriage returns) with spaces:

<xs:element name="address"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:whiteSpace value="replace"/> </xs:restriction> </xs:simpleType> </xs:element>
This example also defines an element called "address" with a restriction. The whiteSpace constraint is set to "collapse", which means that the XML processor WILL REMOVE all white space characters (line feeds, tabs, spaces, carriage returns are replaced with spaces, leading and trailing spaces are removed, and multiple spaces are reduced to a single space):

<xs:element name="address"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:whiteSpace value="collapse"/> </xs:restriction> </xs:simpleType> </xs:element>

Restrictions on Length
To limit the length of a value in an element, we would use the length, maxLength, and minLength constraints. This example defines an element called "password" with a restriction. The value must be exactly eight characters:

<xs:element name="password"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:length value="8"/> </xs:restriction> </xs:simpleType> </xs:element>
This example defines another element called "password" with a restriction. The value must be minimum five characters and maximum eight characters:

<xs:element name="password"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:minLength value="5"/> <xs:maxLength value="8"/> </xs:restriction> </xs:simpleType> </xs:element>

Restrictions for Datatypes


Constraint Description enumeration Defines a list of acceptable values fractionDigits Specifies the maximum number of decimal places allowed. Must be equal to or greater than zero length Specifies the exact number of characters or list items allowed. Must be equal to or greater than zero

maxExclusive Specifies the upper bounds for numeric values (the value must be less than this value) maxInclusive Specifies the upper bounds for numeric values (the value must be less than or equal to this value) maxLength Specifies the maximum number of characters or list items allowed. Must be equal to or greater than zero

minExclusive Specifies the lower bounds for numeric values (the value must be greater than this value) minInclusive Specifies the lower bounds for numeric values (the value must be greater than or equal to this value) minLength pattern totalDigits whiteSpace Specifies the minimum number of characters or list items allowed. Must be equal to or greater than zero Defines the exact sequence of characters that are acceptable Specifies the exact number of digits allowed. Must be greater than zero Specifies how white space (line feeds, tabs, spaces, and carriage returns) is handled