P. 1
Eight Challenges in Data Integration

Eight Challenges in Data Integration

|Views: 534|Likes:
Published by Charteris Plc

More info:

Published by: Charteris Plc on Oct 19, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

04/17/2011

pdf

text

original

EIGHT CHALLENGES IN DATA INTEGRATION Author: Robert P Worden

20 November 2002

©2002 Charteris plc

CONTENTS
1. 2. 2.1 2.2 2.3 3. 3.1 3.2 3.3 3.4 3.5 4. LIMITATIONS OF DATA TRANSLATION TOOLS PROBLEMS OF MANAGEMENT AND COST The N-Squared Problem The N-fold Maintenance Problem The Double Knowledge Problem PROBLEMS OF INTRINSIC TRANSLATION CAPABILITY The Nesting Problem The De-Normalisation Problem The Uncommitted Language Problem The Data Grouping Problem Bringing It Together A WORKING SOLUTION TO THE EIGHT KEY PROBLEMS 3 6 6 6 6 8 8 9 10 10 11 12

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 2 of 12

1.

LIMITATIONS OF DATA TRANSLATION TOOLS

The problem of application integration has been around for a long time. It is a very expensive problem, and most would say it has not been satisfactorily solved. Application integration is a large and ever-increasing fraction of IT budgets – and is at the root of many project failures. For more than ten years, there have been specialist toolsets available for Enterprise Application Integration (EAI). These tools offer a range of facilities - for business process orchestration, transaction management, security, package ‘adaptors’ and so on. They all have facilities to translate data between different applications and databases. As XML has become the preferred choice of inter-application ‘glue’ – both within and across organisations, with or without the label of ‘web services’ – EAI tools have been extended to handle XML. There has also emerged a set of specialist XML translation tools. Most of these translation tools work in the same way. In a design phase, you define equivalences between data items in two different data sources. For this, you use a process of field-to-field mapping. Then at run time, the mappings are used to make the data translations automatically. There are many of these field-to-field data mapping tools on the market. For instance:
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

WebSphere Data Interchange from IBM BizTalk Mapper from Microsoft Tibco Message Broker Mercator Integration Broker E-Biz Integrator from Sybase GoXML from XML Global Embarcadero D/T Designer Data Mirror Transformation Server

Field-to-field mapping works as follows: The design tool automatically captures the structure of some data source (for a database, its relational schema; for an XML source, the nesting of elements from its XML schema or DTD). This structure is displayed as a tree diagram. The nodes and leaves of the tree are ‘fields’ (e.g. columns in a database, or XML attributes) which hold the smallest items of data. You display the tree structures of two different data sources side by side. You drag-and-drop to tell the tool: ‘this field in source A is equivalent to that field in source B’ – drawing a line across, to denote a mapping between the two fields. Maybe you put a box on the line to define some data translation (e.g. between different representations of dates) – using a palette of pre-defined translation functions, or adding custom functions. Once you have defined the mappings, run-time data translation is done automatically by the tool, or by code generated from the tool. Therefore you save the cost of hand-coding data translations. Data mapping would appear to be a big cost-saver. Based on that promise, over the years many people have bought these mapping-style translation tools. However, these tools have not yet had a big impact on the practice of data integration; their use has remained localised. We know of no major IT user which regularly uses a single data integration tool for all its application integration needs. The dominant method of data integration is still hand-coding. Why is this? This note describes how field-to-field mapping fails to tackle eight of the most important challenges of data integration. If these problems are not tackled properly, they come back to bite you, and force you back to hand-coding. That is possibly why the data mapping products have not achieved widespread use – in spite of having been around for many years. Three key challenges of data integration are problems of management and cost:

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 3 of 12

1. The N-squared problem: If you have N different systems or XML languages to translate between, you might have to define as many as N(N-1) sets of mappings to do all the required translations. When N is large – say 20 different systems – even a small fraction of this number makes a very large number of mappings to define, which is very costly. 2. The N-fold Maintenance Problem: Any system or XML language will evolve through successive versions – with typically a new version every few months. You need mappings from any one system to many of the other (N-1) systems. Whenever one system or XML language changes to a new version, all its mappings to any other systems may need to be re-done. This is a big ongoing cost. 3. The Double Knowledge Problem: To build accurate mappings between two complex data sources, you need to have deep knowledge of both data sources at the same time. This combination of knowledge is very rare, and very hard to find in one person. Mapping errors arise from lack of knowledge of one of the two systems. A further five are problems of basic translation capability: 4. The Nesting Problem: this is an XML-specific problem, and is now very prevalent. Some XML languages are deeply nested, representing associations between objects (or relations, in database terminology) by their nesting. Other languages are more shallow, representing the same associations by shared values of fields. Using field-to-field mapping, you cannot make accurate translations between a nested and a shallow XML language. 5. The De-Normalisation Problem: When a relational database is de-normalised, it represents several related objects by one row of a table. When an XML language is de-normalised, it represents several related objects in one XML element. Both of these are very common. Both lead to duplication of data. Accurate translation between normalised and de-normalised forms – or even between different kinds of de-normalisation – often cannot be done with field-to-field mapping. 6. The Uncommitted Language Problem: XML languages and relational databases are often designed to be open-ended, to convey new types of information without any change of schema. This is done by making individual messages or records hold their own metadata. Translating from an open-ended language to a closed language, or vice versa, requires extensions to the basic fieldto-field mapping approach. 7. The Data Grouping Problem: In some XML message formats, data items are grouped according to the values of some properties. Translating from a grouped format to an un-grouped format is hard or impossible for simple mapping-based translators. 8. Bringing it all together: Problems like those above do not occur in isolation. In any real integration project, you will encounter most of them, in combination. Even if field-to-field mapping can, with some effort, tackle one of these problems on its own, how does it cope when they all occur together? You can check out these problems of translation capability for yourself. Using the data mapping tool of your choice, you can try out the specific examples which follow, to see how many of them it can translate. We would be interested to hear your results. The problems (1) – (8) above are not rare or esoteric. In real systems, they occur all the time, and in any large integration project you will probably run into all of them. With field-to-field mapping products, this leaves you in a very awkward position – with an automated tool which does half the job you want it to. You then have to try to understand how the automated tool works (for instance, to understand the code it generates) in order to patch it up to do the whole job. Patching generated code is a configuration nightmare, and often cannot be done. That, we believe, is why field-to-field mapping tools have not been a great success in their ten-year history.

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 4 of 12

In recent months, web services have been heralded as the solution to the problem of application integration – which will dramatically simplify the problem of integrating disparate systems. It is important to realise that the standards which underlie web services – SOAP, WSDL, XML Schema, and so on – do almost nothing to solve the problems of data incompatibility between systems. To integrate two complex systems via web services, you still need to solve the same problems of data integration that you had to solve before. And the most common solution on offer is still field-to-field mapping. The reader may have guessed that we would not be listing these eight key problems of data integration if we did not think there was any solution. In the last section of this paper, we briefly describe the model-centred approach of the Charteris Integration Toolkit, and how it offers practical, working solutions to all of the key challenges (1) – (8). On request from Charteris, you can demonstrate these solutions for yourself.

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 5 of 12

2.
2.1

PROBLEMS OF MANAGEMENT AND COST
The N-Squared Problem

Even medium-sized companies typically have hundreds of different IT systems, which have grown up over many years, and have pressing needs to integrate many of them. They also increasingly need to integrate with their business partners’ systems. If you have N systems to integrate, the maximum number of interfaces required (and thus, the maximum number of sets of mappings you will need to make) is N(N-1) – approximately N squared. Even with N = 60, this would be a prohibitive number of mappings, and would swallow the IT budget many times over. This upper limit of N(N-1) never occurs in practice. Experience suggests that some number between 10*N and 20*N is more realistic, for medium-large companies. When it is first built, a new system will typically interface to anywhere between 5 and 15 existing systems. As time goes by, and other newer systems in turn are interfaced to it, its interface count grows – leading to a range 10-20 interfaces per system in large companies. With, for instance, 60 systems and 15 interfaces per system, this leads to 900 separate interfaces – again, a prohibitive number of interfaces to build, if individual system-to-system mappings must be made for each interface. Typically, because of these costs, the interfaces are just not built, and companies live with a fragmented IT architecture – with the heavy business costs that entails. Some products have tackled the N-squared problem by having a single central translation hub, which translates in two steps via a central representation. Integration hubs have their uses, but relying on them as the sole translation hub has not, in practice, proved popular. The performance costs of twostep translation are more than a factor of two; the hub is a bottleneck and single point of failure. But most important, the political implications of a single hub are unpopular in many organisations. Many parts of the business resent the existence of a hub controlled by someone else, and simply bypass it.

2.2

The N-fold Maintenance Problem

If data integration were just a one-off problem, then given a strong business case, a heroic one-off effort might be made to solve it. But it is not. Technology changes; business requirements change. Systems change; and ‘standard’ e-commerce languages change. Typical IT systems and XML languages undergo ceaseless evolution, with major new versions being released at intervals of a few months. Every time one of these systems or languages changes, all of its interfaces to other systems may also need to change. If these are defined by system-to-system mappings, that makes a lot of re-mapping work. On the rule of thumb above (10-20 interfaces per system), this means revising and updating 10-20 sets of mappings per system, every few months – a massive maintenance workload. The result is that necessary system changes are often just not made, because of the prohibitive interface maintenance effort. Systems fall behind business requirements, and legacy systems are locked in place by their interfaces – again, with heavy costs to the business.

2.3

The Double Knowledge Problem

It is not always evident, from the simple examples seen in marketing material, how complex it is to map just one data source onto another. Typical relational databases have hundreds or thousands of tables – and therefore have many thousands of fields or columns. Widely-used XML languages may have hundreds or thousands of distinct elements. Therefore just one set of mappings between two data sources may involve hundreds or thousands of field-to-field mappings. To make these mappings accurately between any pair of data sources, you need to understand both sources at a deep level. You need to understand their physical structure in detail – because these are physical mappings. More important, you need to understand the detailed semantics of both data sources – because it is wrong to map two fields onto one another if they have different meanings. In a large organisation, it is usually possible to find a person who has this deep knowledge of the physical structure and semantics for just one database, API or XML language. It is practically
20 November 2002 Draft 01 Eight Challenges in Data Integration Page 6 of 12

impossible to find anyone who has such deep knowledge of two sources at the same time. So it is not feasible to ask any one person to make the mappings for any pair of data sources – you need to find two scarce, knowledgeable people, and task them to make the mappings together. We have found this is rarely possible, in practical management terms. A person who is really knowledgeable about one system or database is a valuable and scarce resource, and cannot be spared to make repeated mappings of that system to other systems. More typically, the mappings are made by people who have knowledge of only one side of the mapping, and guess the other side from available documentation. This leads inevitably to expensive errors.

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 7 of 12

3.

PROBLEMS OF INTRINSIC TRANSLATION CAPABILITY

We will illustrate the key data translation problems by examples in XML – although for many of them, the same problem can occur in any other message format, or between relational databases. For each problem, we will show two samples of XML which convey the same information. The requirement is to translate accurately between the two, in either direction.

3.1

The Nesting Problem

Consider the following two fragments of XML:
<message> <driver name = ‘Smith’ age = ‘42’ > <car make = ‘Ford’ reg = ‘KFL942’ /> <car make = ‘VW’ reg = ‘PEZ288’ /> </driver> <driver name = ‘Jones’ age = ‘27’ > <car make = ‘Fiat’ reg = ‘BCC100’ /> </driver> </message>

and:
<message> <drivers> <driver name = ‘Smith’ age = ‘42’ /> <driver name = ‘Jones’ age = ‘27’ > </drivers> <cars> <car make = ‘Ford’ reg = ‘KFL942’ driver = ‘Smith’ /> <car make = ‘VW’ reg = ‘PEZ288’ driver = ‘Smith’ /> <car make = ‘Fiat’ reg = ‘BCC100’ driver = ‘Jones’ /> </cars> </message>

These two messages convey exactly the same information, about a group of drivers and their cars. Each driver has a name and an age, and can drive several cars. Each car has a make and a registration number, and is driven by only one driver. In the first message, the fact that a driver drives a car is denoted by nesting of the <car> element inside the appropriate <driver> element. In the second message, the same fact is denoted not by nesting of elements, but by each <car> element having a ‘driver’ attribute, which matches the ‘name’ attribute in some <driver> element. Because the two messages convey the same information, it should be possible to translate from one to the other. It is our experience that field-to-field mapping tools cannot do this translation accurately in both directions. Typically, when going from the ‘flat’ to the nested form, they lack the capability to group just the correct inner elements (and no others) in any outer element. Furthermore, for mapping tools which generate code, it seems to be very hard to ‘patch up’ the generated code to do the right job. Mapping tools therefore fail to do a basic and important translation task. Denoting an association such as ‘driver drives car’ by nesting of elements, is very common in XML (XML is good for nesting). But it is far from universal. Many associations are denoted in XML languages by some kind of shared value. So this translation problem is a very common one. In any

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 8 of 12

large XML integration or interoperability project, you will probably have to translate between a nested and a non-nested representation of an association. Most mapping tools cannot do it.

3.2

The De-Normalisation Problem

This is illustrated by two fragments of XML:
<products> <product <product <product <product </products>

name name name name

= = = =

‘widget’ mfr = ‘Acme Inc’ mfState = ‘NY’ /> ‘trunnion’ mfr = ‘Acme Inc’ mfState = ‘NY’ /> ‘plunger’ mfr = ‘Acme Inc’ mfState = ‘NY’ /> ‘valve’ mfr = ‘Perfecto’ mfState = ‘CA’ />

and:
<message> <products> <product name = ‘widget’ mfCode = ‘ac’ /> <product name = ‘trunnion’ mfCode = ‘ac’ /> <product name = ‘plunger’ mfCode = ‘ac’ /> <product name = ‘valve’ mfCode = ‘pf’ /> </products> <manufacturers> <mfr name = ‘Acme Inc’ code = ‘ac’ mfState = ‘NY’ /> <mfr name = ‘Perfecto’ code = ‘pf’ mfState = ‘CA’ /> </manufacturers> </message>

The first fragment is de-normalised, in that information about two different kinds of entity (products and manufacturers) is held in the same <product> element. Because the same manufacturer may make many products, some information about the manufacturer (for instance, the state in which it is located) is duplicated across the elements for many different products. The second fragment is normalised, so that information about manufacturers is stored separately from product information, and is not duplicated. ‘Manufacturer state’ is stored only once. De-normalisation occurs very widely in databases and XML message formats; and normalised forms are equally common. So it is a very common requirement to translate between normalised and denormalised forms. Field-to-field mapping techniques have great difficulty in doing this, if they can do it at all. They do not have any natural way to express the re-groupings of data which are required; so they either cannot make the required translation at all, or at best they do it only by using tortuous procedural constructs, which would be much better done in a high-level language. The difficulties of field-to-field mapping in nested and de-normalised data translations have a common origin. Both of these problems are about associations between objects of different kinds – associations such as ‘person owns car’ or ‘manufacturer makes product’. (Associations are the ‘relations’ of EntityRelation Diagrams, and are fundamental to all data models) Field-to-field mappings allow you to talk about properties of things – to say ‘these two fields both represent the same property’ – but they do not allow y to say how associations are represented. ou Using a field-to-field mapping tool, there is no way to say ‘this is how the data source represents this association’. So it is not surprising that field-to-field mapping tools are so bad at translating associations. They can only do so when both data sources represent the association in the same way –

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 9 of 12

for instance, by shared values of properties. When one end of a translation represents an association in some different way – by nesting or de-normalisation – field-to-field mapping breaks down. Associations are the backbone of data, which hold it together. Without associations, a database or an XML message would be just a collection of disconnected facts, and would be of little use. The inability of field-to-field mapping tools to translate association information is a very serious defect.

3.3

The Uncommitted Language Problem

Consider these two fragments of XML:
<people> <person name = ‘Smith’ age = ‘30’ /> <person name = ‘Jones’ age = ‘25’ /> </people>

and:
<people> <person> <prop pName = <prop pName = </person> <person> <prop pName = <prop pName = </person> </people>

‘name’>Smith</prop> ‘age’>30</prop>

‘name’>Jones</prop> ‘age’>25</prop>

The first XML language can be extended, but only by extending its schema. If you want to represent some other property of a person (such as their gender) you will have to extend the definition of the XML language in its schema, to add a new attribute ‘gender’. The second fragment of XML is designed to be extensible without having to extend its schema. You can record genders (or any other property of people) by adding an element <prop pName = ‘gender’>male</prop>, which does not require any extension of the schema. This is a more uncommitted XML language. Translation between committed and uncommitted languages presents a further challenge for field-tofield mapping translators. Since, in the second language, the content of the element ‘prop’ can represent essentially anything, depending on its ‘pName’ attribute, there is no simple way to map it onto an element or attribute of an uncommitted language. Some mapping products have introduced a conditional mapping construct to address this problem, but others cannot do it. This ‘uncommitted’ style of XML language design is used quite commonly (for instance, it is used in OAGIS XML messages), and it is used to determine entity classes and associations, as well as properties. It is essential to be able to translate freely between committed and uncommitted structures.

3.4

The Data Grouping Problem

This is illustrated by the following two fragments of XML:
<college> <student name = ‘Carter’ age = ‘20’ year = ‘2’ /> <student name = ‘LeBrun’ age = ‘19’ year = ‘1’ /> <student name = ‘Schmidt’ age = ‘18’ year = ‘1’ /> </college>

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 10 of 12

and:
<college> <students year = ‘1’ > <student name = ‘Carter’ age = ‘20’ /> </students> <students year = ‘2’ > <student name = ‘Schmidt’ age = ‘18’ /> <student name = ‘LeBrun’ age = ‘19’ /> </students> </college>

In both examples, every student is in a year 1..4. In the second example only, the students in any year have been grouped together. Such grouping is very common, for instance in XML intended for transformation to HTML and display as a report. Translating between the grouped and ungrouped forms requires a structural transformation based on meaning. It has been our experience that field-to-field mapping products generally cannot make the required transformation.

3.5

Bringing It Together

These are not rare or isolated problems. They occur all the time in practical data translations, for XML, databases and other APIs. One of these problems is not likely to occur as above, in isolation – it will occur together with some of the others, and with others we have not mentioned here (such as merging and splitting of fields; superclasses and subclasses; and duplicated representation of objects). As we have seen, each problem on its own presents serious difficulties for the field-to-field mapping approach. These difficulties arise first, because the approach does not recognise the importance of associations in data models, and second, because it is ill-equipped to make complex structural transformations of data. Vendors of mapping products have partially recognised this difficulty. As well as the basic mapping functionality, some products offer procedural constructs (such as ‘IF’ constructs and iterators) which can be built into the mappings to produce more complex behaviour. Sometimes this behaviour can be made (with difficulty) to give the required structural transformations, and solve one of the four translation problems. More often it cannot. Even if it can, there are serious difficulties: 1. The mixture of procedural and non-procedural mapping constructs is very hard to understand and debug 2. If a problem can only be overcome by introducing a complex procedural construct, then the claimed benefit (of avoiding procedural hand-coding) has disappeared. You would do better to do the procedural coding in a high-level language which is designed for the purpose. 3. If procedural constructs are used to solve two or more overlapping problems, their interactions with the automatic functionality and with each other are extremely complex. You are not likely to produce a solution which anybody can understand or maintain. If the mapping tool works by code generation, then the alternative (of hand-tweaking the generated code) is even worse to contemplate. The result is that field-to-field mapping tools are not capable of tackling industrial-strength data translation challenges, where the problems described above are widespread.

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 11 of 12

4.

A WORKING SOLUTION TO THE EIGHT KEY PROBLEMS

The Charteris Integration Toolkit is described in more detail elsewhere. Here we give only the briefest description, to describe how it meets the eight challenges above. The Charteris XML Integration toolkit does not work by field-to-field mapping. Before making any mappings, you first define a business object model of the domain. This can be done directly in one of the tools in the toolkit, or in UML using a CASE tool such as Rational Rose. This object model is technology-independent. Creating it requires business knowledge, not technical knowledge. Then, you do not map the different XML languages or databases onto each other; you map them each onto the business object model. The mapping tool captures XML schemas and relational schemas automatically, and then provides graphical facilities to make and review their mappings onto the object model. A set of mappings defines how a database or XML language conveys the information in the object model. These mappings are exported in an XML format and are used by other tools in the Charteris toolset. Given the mappings for any two languages, a tool in the toolset can translate messages directly from one language to the other. This avoids the inefficiencies of a two-step translation, and does not require any translation hub. This approach solves the 8 problems of data integration, as follows: 1. The N-Squared Problem: Because each language or database is mapped only onto one business object model, and not onto any other languages, the cost of making all the mappings grows only proportional to N; not N 2. 2. The N-Fold Maintenance Problem: When any language or database changes to a new version, you only need to change one set of mappings onto the business model. You do not need to change any mappings onto other systems. 3. The Double Knowledge Problem: To make the mappings for one language or system, you need to understand that language or system in depth, and you need to understand the business object model. The latter is business knowledge, which you need in any case to understand the semantics of the language or system. Deep knowledge of two systems is not required. 4. The Nesting Problem: The Charteris toolkit translates accurately and automatically between nested and ‘flat’ XML languages. This is because the mappings onto the object model define properly how each language represents the associations in the object model. Therefore association information is translated accurately. 5. The De-Normalisation Problem: The Charteris toolkit translates accurately and automatically between different de-normalised forms, both from databases and in XML. Again, this is because the mappings onto the object model define properly how each language represents the associations in the object model, which have been used to de-normalise the data. 6. The Uncommitted Language Problem: Mapping constructs in the toolkit describe uncommitted languages in a simple and natural manner, allowing accurate automatic translation. 7. The Grouping Problem: Translations are accurately made between grouped and un-grouped forms of the same data. 8. Bringing it all together: The automatic translator is designed to handle all these problems simultaneously, in one translation. Because you do not have to do ‘specials’ to solve these problems, you do not have to worry about how the specials interact; and the translator gets it right. You do not have to take our word for it, that the Charteris toolkit really does solve these problems of data integration. On request to robert.worden@charteris.com, we will send you an evaluation pack, with demonstration solutions to the five problems of intrinsic translation capability, and the means for you to construct your own more complex tests.

20 November 2002 Draft 01

Eight Challenges in Data Integration

Page 12 of 12

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->