You are on page 1of 12

XML & XML with Informatica

XML
My best description of XML is this: XML is a cross-platform, software and hardware
independent tool for transmitting information.
XML is used to Exchange Data
With XML, data can be exchanged between incompatible systems.
In the real world, computer systems and databases contain data in incompatible formats.
One of the most time-consuming challenges for developers has been to exchange data
between such systems over the Internet.
Converting the data to XML can greatly reduce this complexity and create data that can be
read by many different types of applications.
XML, DTD, and XML Schema
Extensible Markup Language (XML) is a markup language generally regarded as the universal
format for structured documents and data on the Web. Like HTML, XML contains element
tags and attributes that define data. Unlike HTML, XML element tags and attributes are
not based on a predefined, static set of elements and attributes. Every XML file can have a
different set of tags and attributes. Document Type Definition (DTD) files and XML
schema files define the elements and attribute that can be used and the structure within
which they fit in an XML file.
DTD and XML schema files specify the structure and content of XML files in different
ways. A DTD file defines the names of elements, the number of times they occur, and
how they fit together. The XML schema file provides the same information plus the data
types of the elements.
DTD
The purpose of a DTD is to define the legal building blocks of an XML document. It defines
the document structure with a list of legal elements. A DTD can be declared inline in your
XML document, or as an external reference.
The DTD file contains only metadata. It contains the description of the structure and the
definition of the elements and attributes that can be found in the associated XML file. It
does not contain any data.
A sample DTD looks like this:
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT

employees (companyname, employee ) >


companyname ( id, name) >
employee ( emp+ ) >
emp ( id, info ) >
info ( name, age, sex, job, sal ) >

<!ELEMENT created-date ( format, timestamp ) >


<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT

id ( #PCDATA ) >
name ( #PCDATA ) >
format ( #PCDATA ) >
timestamp ( #PCDATA ) >

eg:
<employees>
< companyname >
<id>01</id>
<name>Wipro Technologies</name>
</ companyname >
< employee >
<emp>
<id>91000</id>
<info>
<name>Dileep</name>
<age>25</age>
<sex>Male</sex>
<job>Project Engineer</job>
<sal>20000</sal>
</info>
</emp>
</employee>
</employees>
XML Schema
The XML schema file, like the DTD file, contains only metadata. In addition to the
definition and structure of elements and attributes, an XML schema contains a description
of the type of elements and attributes found in the associated XML file.
A sample XML Schema file looks like this:
<xs:element name="ECR">
<xs:complexType>
<xs:sequence>
<xs:element ref="ECR_object"/>
<xs:element ref=" ECN_object " minOccurs="0" maxOccurs="n"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="ECR_object">
<xs:complexType>

<xs:sequence>
<xs:element
<xs:element
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="ECN_object">
<xs:complexType>
<xs:sequence>
<xs:element
<xs:element
</xs:sequence>
</xs:complexType>
</xs:element>
eg:
<ECR>

</ECR>

name="number" type="xs:string"/>
name="summary" type="xs:string"/>

name="number" type="xs:string"/>
name="summary" type="xs:string"/>

<ECR_object>
<number>00996</number>
<summary>Testing</summary>
</ECR_object>
<ECN_object>
<number>00896</number>
<summary>Test</summary>
</ECN_object>

Cardinality in XML:
Declaring only one occurrence of the same element (only once)
<!ELEMENT companyname ( id, name) >(For DTD)
<xs:element name="number" type="xs:string"/>(For Schema file)
Declaring minimum one occurrence of the same element (one or more)
<!ELEMENT employee ( emp+ ) >(For DTD)
<xs:element name="number" type="xs:string" minOccurs="1" maxOccurs="unbounded"/>(For
Schema file)
or
<xs:element name="number" type="xs:string" minOccurs="1" maxOccurs="n"/>(For Schema
file)
Declaring zero or more occurrences of the same element (zero or more)
<!ELEMENT employee ( emp* ) >

<xs:element name="number" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>(For


Schema file)
or
<xs:element name="number" type="xs:string" minOccurs="0" maxOccurs="n"/>(For Schema
file)
Declaring zero or one occurrences of the same element (zero or one)
<!ELEMENT employee ( emp? ) >
<xs:element name="number" type="xs:string" minOccurs="0" maxOccurs="1"/>(For Schema
file)

XML Entity References


An entity reference is a group of characters used in text as a substitute for a single
specific character that is also a markup delimiter in XML. Using the entity reference
prevents a literal character from being mistaken for a markup delimiter For example, if an
attribute must contain a left angle bracket (<), you can substitute the entity reference
"&lt;". Entity references always begin with an ampersand (&) and end with a semicolon (;).
You can also substitute a numeric or hexadecimal reference. The entities predefined in XML
are identified in the following table.
Character

Entity
reference

Numeric
reference

Hexadecimal
reference

&

&amp;

&#38;

&#x26;

<

&lt;

&#60;

&#x3C;

>

&gt;

&#62;

&#x3E;

"

&quot;

&#34;

&#x22;

'

&apos;

&#39;

&#x27;

Character data:
Character data can be either a PCDATA or a CDATA in XML.
PCDATA
PCDATA means parsed character data. i.e. if we have a character data element declared as
PCDATA then all characters or text or data inside the xml tags will be parsed by the XML
parser. In this type of data, if we place a character like "<" or "&" inside an XML element, it
will generate an error because the parser interprets it as the start of a new element. You
cannot write something like this "if salary < 1000 then" It will fire an error. To avoid this,
we have to replace the "<" character with an entity reference, like this, "if salary &lt; 1000
then"
CDATA

CDATA means character data. i.e. if we have a character data element declared as CDATA
then all characters or text or data inside the xml tags will not be parsed by the XML
parser. If we text contains a lot of "<" or "&" characters - as program code often does - the
XML element can be defined as a CDATA section.
Only the characters "<" and "&" are strictly illegal in XML. Apostrophes, quotation marks
and greater than signs are legal, but it is a good habit to replace them.
Metadata from XML, DTD, and XML Schema Files
PowerMart and PowerCenter can create metadata for a source or target definition from
XML, DTD, or XML schema files. XML files provide both data and metadata, while DTD and
XML schema files provide only metadata.
The Designer requires a lot of memory and resources to parse very large XML files and
extract metadata for source or target definitions. To ensure that the Designer creates an
XML source or target definition quickly and efficiently, Informatica recommends that you
import source or target definitions only from XML files that are no larger than 100K or
from DTD or XML schema files. If you want to import from a very large XML file that has
no DTD or XML schema file, decrease the size of the XML file by deleting duplicate data
elements. You do not need all of your data to import an XML source or target definition. You
need only enough data to accurately show the hierarchy of your XML file and enable the
Designer to create a source or target definition.
The XML schema file, like the DTD file, contains only metadata. In addition to the
definition and structure of elements and attributes, an XML schema contains a description
of the type of elements and attributes found in the associated XML file.
Target from XML:
You can create an XML target definition from an XML, DTD, or XML schema file. You can
also create an XML target definition from an XML source definition or from one or more
relational source definitions.
Rules for a Valid Group
An XML group is valid when it follows these rules:

Any element or attribute in an XML file can be included in a group.


A group cannot contain two elements with a many-to-many relationship.
Column names in the groups are unique within a source or target definition.
Group names are unique within a source or target definition.

The Designer validates any group you create or modify. When you try to create a group that
does not follow these constraints, the Designer returns an error message and does not
create the group.

Note: If the target definition consists of only one group, then it does not require a primary
key or a foreign key.
Normalized Groups
A normalized group is a valid group that contains only one multiple-occurring element. In
most cases, XML sources contain more than one multiple-occurring element and convert to
more than one normalized group.
The following rules apply to normalized groups:

A normalized group must be a valid group.


A normalized group cannot contain more than one multiple-occurring element.

Denormalized Groups
A denormalized group has more than one multiple-occurring element. The multiple-occurring
elements can have a one-to-many relationship, but not a many-to-many relationship. All the
elements in a denormalized group belong to the same parent chain.

Source definitions can have denormalized groups, but target definitions cannot have
denormalized groups.
Denormalized groups, like denormalized relational tables, generate duplicate data. It can
also generate null data. Make sure you filter out any unwanted duplicate or null data before
passing data to the target.
The following rules apply to denormalized groups:

A denormalized group must be a valid group.

A denormalized group can contain more than one multiple-occurring element.

Multiple-occurring elements in a denormalized group must have a one-to-many


relationship.

Denormalized groups can exist in a source definition, but not in a target definition.
Group Keys and Relationships
The relationship between elements in the XML hierarchy translates into a combination of
primary and foreign keys that define the relationship between XML groups. If you define a
key in the XML hierarchy, the Designer uses it as a primary key in a group. The Designer
handles group keys and relationships differently for sources and targets.
In a source definition, a group does not have to be related to any other group. A
denormalized group can be independent of any other group. Therefore, groups in a source
definition do not require primary or foreign keys. However, if a group is related to another
group based on the XML hierarchy, and you do not designate any column as a key for the
group, the Designer creates a column called the Generated Primary Key to hold a key for
the group.

In a target definition, each group must be related to one other group. Therefore, each
group needs at least one key to establish its relationship with another group. If you do not
designate any column as a key for a group, the Designer creates a column called Group Link
Key to hold a key for the group.
When you run a session with a mapping that contains an XML source, the Informatica
Server generates the values for the generated primary key columns in the source definition.
When you run a session with a mapping that contains an XML target, you need to pass the
values to the group link columns in the target groups from the data in the pipeline.
Group keys and relationships follow these rules:

Any element or attribute can be marked as a key.

A group can have only one primary key.

A group can be related to only one other group, and therefore can have only one
foreign key.

A column cannot be marked as both a primary key and a foreign key.

A key column can be a column that points to an element in the hierarchy or a column
created by the Designer. A group can have a combination of the two types of key columns.

A source group does not require a key.

A target group requires at least one key.

The target root group requires a primary key. It does not require a foreign key.

A target leaf group requires a foreign key. It does not require a primary key.

A foreign key always refers to a primary key in another group. Self-referencing


keys are not allowed.

A foreign key column created by the Designer always refers to a primary key column
created by the Designer.
Code Pages
XML files contain an encoding declaration that indicates the code page used in the file. The
most commonly used code pages in XML are UTF-8 and UTF-16. All XML parsers support
these two code pages. For information on the XML character encoding specification, go to
the W3C website at http://www.w3c.org.
PowerCenter and PowerMart support the same set of code pages for XML files that they
support for relational databases and other flat files. You can use any code page supported
by both Informatica and the XML specification. For a list of code pages that Informatica
supports, see Code Pages in the Installation and Configuration Guide. Informatica does not
support any user-defined code page.
For XML source definitions, PowerCenter and PowerMart use the repository code page.
When you import a source definition from an XML file, the Designer displays the code page
declared in the file for verification only. It does not use the code page declared in the XML
file.

For XML target definitions, PowerCenter and PowerMart use the code page declared in the
XML file. If Informatica does not support the declared code page, the Designer returns an
error. You cannot import the target definition.
XML writer:
Verify the XML environment is set up correctly, such as the environment variables are set
properly, the .dll files are in the correct location on Windows or the shared libraries on
UNIX, and the supporting .dat files are present.
How XML sources & targets look in Informatica?
XML Source:
Each group in an XML definition is analogous to a relational table, and the Designer treats
each group within the XML Source Qualifier as a separate source of data.
In a mapping, the ports of one group in an XML Source Qualifier can be part of more than
one data flow. However, the ports of more than one group in the same XML Source Qualifier
cannot link to one transformation or be part of the same data flow. This is the biggest
drawback with XML sources. If you need to use data from two different XML source
definitions, you can link a group from each source qualifier and join the data in a Joiner
transformation. You can also use the same source definition more than once in a mapping.
Connect each source definition to a different XML Source Qualifier and join the groups in a
Joiner transformation. The following figure shows how we can join two XML groups in the
same mapping using a Joiner transformation.

If we need to load data from several groups to the same target based on the granularity its
always better to divide those mapping to 2 or 3 mappings & load the data to the target.
When we create a session to extract data from an XML source we need to configure source
properties, such as source file location, in the session properties. Define the XML source
properties on the Properties settings on the Sources tab.

XML Target:
The following figure shows how an XML target looks in Informatica Designer.

When you configure a session to load data to an XML target, you define properties on the
Targets tab and the Transformations tab of the session properties. You can configure the
following properties for XML targets:

Output file options. You can configure the directory and file name to which the
Informatica Server writes the target file.
Code page. You can define the code page declared in the XML target file. Use the Set File
Properties button to define the code page.
Duplicate Group Row Handling. You can configure how the Informatica Server handles
duplicate rows.
DTD/Schema Reference. You can specify a DTD or an XML schema file name for the XML
target.
Points to be taken care while using XML as source or target:
The code page used in the XML/DTD/XML Schema file should be a valid one and
supported by Informatica. It should be taken care while creating the file to match
with the same format. For eg: For a UTF-8 code file, the encoding should be UTF-8
itself. It should not be ANCI.
If we have a DTD/XML Schema file associated with the source/target, then the
XML data file should exactly match with the DTD/XML Schema file.
If we have a large no. of data in the XML source or to load huge data to our XML
target, then divide it into smaller modules with respect to the business
requirement. Informatca will not be able to read or write bigger XML files.
If we got any changes to the source/target DTD/XML schema file, always re-import
the source/target again.
Always make sure that the data type and size for the imported XML metadata is
correct & matching with the requirement. By default it will take only number &
string for all data as data type & size as 10.
We need to make sure that whenever we join two groups in the Joiner
transformation that we select only the smaller group/set as the Master group.
If we have XML as target, we should always make sure that the data sent to the
target is matching with the cardinality defined in the target DTD/XML Schema
file/XML file.
If we have XML as source, decide whether groups in the source to be normalized or
de-normalized based on our requirement. But make sure that the XML sources
contain only one multiple-occurring element.
XML target never can be de-normalized one.