You are on page 1of 9

TDWI research

TDWI Checklist report


Top Ten Best Practices for Data Integration

By Philip Russom

Sponsored by

tdwi.org
TDWI Checklist report MAY 2010

Top Ten Best Practices for


Data Integration
By Philip Russom
TABLE OF CONTENTS

2 Foreword
2 Number One
Data integration is a family of diverse but related
techniques.
3 Number Two
Data integration practices reach across both
analytics and operations.
3 Number Three
Data integration is an autonomous data management
discipline.
4 Number Four
Data integration is the repurposing of data via
transformation.
4 Number Five
Data integration is a value-adding process.
5 Number Six
Data integration is a green technology that makes
data management more sustainable.
5 Number Seven
A data integration solution should have architecture.
6 Number Eight
A data integration solution should be the product
of collaboration.
6 Number Nine
Data integration must be coordinated with other
data management disciplines.
7 Number Ten
Data integration should be governed, but also
contribute to governance.
8 About our Sponsor
1201 Monster Road SW 8 About the Author
Suite 250 8 About TDWI Research
Renton, WA 98057

T 425.277.9126
© 2010 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights
F 425.687.2842 reserved. Reproductions in whole or part are prohibited except by written permission.
E-mail requests or feedback to info@tdwi.org.
E info@tdwi.org
Product and company names mentioned herein may be trademarks and/or registered
trademarks of their respective companies.
www.tdwi.org

1   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

number one
Data integration is a family of diverse but
FOREWORD related techniques.

The data management discipline known as data integration (DI) Data integration is a family of techniques, most commonly including
has undergone an impressive expansion over the last decade. ETL (extract, transform, and load), data federation, database
Today it has reached a critical mass of multiple techniques used replication, data synchronization, sorting, and changed data capture.
in diverse applications and business contexts. Vendor products All these techniques require support for a wide range of interfaces,
have achieved maturity; users have grown their DI teams to epic so the resulting DI solution can access databases, applications, and
proportions; competency centers regularly staff DI work; and DI as a files to extract or load data. Solutions based on these techniques
discipline has earned its autonomy from related practices like data may be hand-coded, based on a vendor’s tool, or a mix of both.
warehousing and database administration.
Despite the diversity of techniques, all DI solutions take certain
Given all this change, it’s not surprising that people in the field actions. For example, a DI solution (regardless of which technique it’s
might not be up to speed on the current incarnation of DI. Even DI based on) collects data from one or more sources, transforms and
specialists and the colleagues who depend on them sometimes integrates this disparate data into a common data model, and loads
forget the new techniques, diversity, independence, collaboration, the integrated data into a target database, application, or file. In its
and governance typical of modern DI practices. Many suffer simple forms, DI merely extracts data from one source and copies it
misconceptions and out-of-date mindsets that need adjustment. into a target. When done well, DI adds value to data by improving its
content (which may require an additional data quality solution) or by
The 10 practices described in this TDWI Checklist Report paint a
creating data structures that wouldn’t exist without DI (which is key
modern landscape of current DI practices. They also bust a few DI
to data warehousing).
myths that are still too common. Mainly, however, this report raises
the bar on DI, showing how sophisticated and powerful a DI solution Upon hearing the term data integration, people may think only
can be—at least when DI is driven by modern best practices using of the technique they encounter most. For example, some data
up-to-date tools. warehouse professionals believe that DI is synonymous with ETL,
simply because ETL is the preferred form of DI (but not the only
If you let it all soak in, this Checklist Report will redefine DI for you
form) for data warehousing. Likewise, database administrators may
and your peers. And it will help you set higher goals and aspirations
think of replication and data synchronization, which are common
for DI work and its outcome. The practices listed here can be the
in their work. And DI specialists who perform business-to-business
guidelines that help you achieve more modern, high-value, diverse,
data exchange may think of flat files communicated over file
independent, well-designed, far-reaching, green, collaborative, and
transfer protocol or electronic data interchange. In truth, all these
well-governed uses of DI tools and techniques.
techniques—and others—fit under the broad DI umbrella.

2   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

number two number three


Data integration practices reach across Data integration is an autonomous data
both analytics and operations. management discipline.

Data integration techniques are practiced in support of a variety of Data integration’s autonomy is a relatively new—and still evolving—
business initiatives and technology implementations. Hence, in many development. After all, DI has a long history of being staffed and
ways, DI best practices are defined by their associated initiatives managed by larger, related data management teams. For example, in
and implementations. Figure 1 summarizes these initiatives and some old-fashioned organizations, DI (especially the ETL technique)
implementations in a visual taxonomy that reveals the three broad is still considered a subset of data warehousing or database
practice areas of DI: administration. Luckily, DI can still be practiced successfully when
subsumed by a larger team. But some organizations are moving
Analytic data integration (AnDI) is where one or more DI techniques
toward independent teams of DI specialists who perform a wide
are applied in the context of business intelligence (BI) or data
range of DI work, whether analytic, operational, or hybridized.
warehousing. Any DI technique can be used here, but common types
include ETL loading a data warehouse and data federation refreshing Staffing aside, a number of trends are establishing the autonomy of
report data. AnDI work is usually performed and maintained by the DI DI as a data management discipline:
specialists or other members of a BI/data warehousing team.
Data integration has reached critical mass. Today it consists of
Operational data integration (OpDI) involves the access and multiple techniques that are regularly applied to multiple applications
integration of data among operational applications and databases, and business contexts.
whether within one organization or across multiple ones. There are
Vendor DI products have achieved maturity. Most are scalable
three groups of project types within the practice of OpDI, namely
platforms that support a rich and growing set of features and
various forms of data migration, data synchronization, and business-
functions.
to-business data exchange. Operational DI work is usually performed
by data specialists on teams for database administration or All DI practices are growing. According to TDWI survey data, OpDI
applications development. is growing a bit faster than AnDI, but both are growing. Likewise, DI
projects are becoming more varied, numerous, and sizable.
Hybrid data integration (HyDI) practices fall in the middle
ground between AnDI and OpDI. Hybrid DI includes master data OpDI proves DI’s independence from DW. Likewise, AnDI proves
management (MDM) and similar practices like customer data DI’s independence from database administration. Data integration
integration and product information management. These practices techniques are not as tied to specific practice areas and applications
often mix analytic and operational functionality, and so are hybrids. as they used to be.
According to TDWI survey data, most MDM and similar solutions are
Competency centers are the epitome of autonomous DI. Hundreds
homegrown, built atop vendor tools for DI.
have sprung up in the last decade as shared-services organizations
The three practice areas of DI span analytics and operations, plus for staffing all DI work—not just AnDI for DW.
the overlap between them. Across these practices, any DI technique
Given the growing amount and breadth of DI work, DI specialists and
and tool type may be used and all practices assume core skills for
the people who depend on them need to rethink how they organize,
databases, data models, interfaces, and transformations.
staff, train, tool, and coordinate DI work. This is a time of great
Business Initiatives and Data Integration change for DI, and now’s the time to plan for DI’s future.
Technology Implementations Practices
Data Business Analytic
warehousing (DW) intelligence (BI) data integration (AnDI)
Master data
management (MDM)
Hybrid
Customer data Product information data integration (HyDI)
integration (CDI) management (PIM)

Data
Data migration synchronization Operational
Business-to-business (B2B) data integration (OpDI)
data exchange

Figure 1. Data integration practice areas

3   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

number four number five


Data integration is the repurposing of data Data integration is a value-adding process.
via transformation.

True DI is about transforming data, as the T in ETL reminds us. Think about how a manufacturing process consumes material in
The transformation can be simple, as when a federated table join various states of rawness or completeness, processes the material to
changes source schema into a common data model so the tables make it suited to a new purpose, and combines processed material
can be merged. A transformation may also be complex, as when a into a product that’s more valuable than the original material. The
legacy data set is completely remodeled during its migration to a data transformation and repurposing mentioned in the previous
modern database platform. Hence, DI is defined primarily by how Checklist item have an effect similar to manufacturing, in that
it transforms data. But the access, copy, and transfer of data are something truly new (and usually more valuable) results.
secondary, as are the details of an individual DI solution, such as
The value-adding process is evident in many use cases of DI:
interface types and their speed or frequency of operation (based on
data latency requirements). • Consider the calculated values, aggregates, and dimensions
found in the average data warehouse and similar databases (e.g,
Database replication is a possible exception to this rule. For example,
customer data hubs, operational data stores). These databases
replication is the preferred technology for database high availability.
(all generated via DI) manage high-value data that doesn’t exist
For a failing production database to failover to a replica, the replica
elsewhere—not even in the operational applications and other IT
must be an identical, non-transformed copy. Even so, replication is a
systems that serve as a source of raw material for DI.
pliable technology that can also be configured to transform data lightly,
as it does when synchronizing data among heterogeneous tables. • The complete view of a customer resulting from data sync or the
quick multi-system snapshot produced by data federation are
Let’s take a moment to consider why data must be transformed. It’s
likewise unique data sets that offer a higher value than the data in
not just the technicalities of transforming data from one data model
its original, disparate state.
and shoehorning it into another. Equally important is the fact that
the source and target IT systems involved in this kind of data transfer • Migrating a legacy database to a modern database platform offers
serve different business purposes. For example, the reporting and ample opportunity for adding value. Add value by addressing
analysis purposes of BI are very different from the operational common legacy problems, such as arcane hierarchies, poorly
purposes of the systems from which data warehouse data came. As documented data dictionaries, multi-value fields, and various data
another example, consider that customer data is usually transformed quality issues.
as it is synchronized across multiple customer-facing applications,
• Data quality functions add value to data. For this reason, it’s
because these applications automate diverse business functions
common now for a DI tool to invoke a data quality tool to further
including fulfillment, financials, and customer service.
raise the quality of data while it’s being integrated. Likewise, look
In summary, transforming data is a technical task that supports a for opportunities to improve data quality in every data integration
business goal—namely, repurposing data for a business use that job, routine, and data flow.
differs from the one for which the data originated. When defining
• Beyond data quality, a good DI program will also improve
DI, stay focused on the value proposition seen in transforming and
metadata and master data, plus related database features such
repurposing data; avoid definitions that stress the secondary access,
as data dictionaries and metadata repositories.
copy, and transfer of data.
Data integration specialists should always raise the bar by
looking for ways to add further value to data as they integrate and
repurpose it, whether the data integration solution is analytic,
operational, or a hybrid.

4   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

number six number seven


Data integration is a green technology that A data integration solution should
makes data management more sustainable. have architecture.

Recent climate changes and the rising cost of electricity have If you don’t fully embrace the existence of DI architecture, you can’t
led many people to revisit the sustainability of data centers. In address how architecture affects DI’s scalability, staffing, cost,
response, corporations are reducing power consumption and the and ability to support real-time capability, MDM, services, and
physical footprint of data centers and server rooms by consolidating interoperability with other tools. All of these are worth addressing.
redundant data and virtualizing hardware servers. Data integration
Recognize that DI architecture exists. Although it overlaps with
tools and techniques are instrumental in the consolidations that
data warehousing architecture and interacts with the entire BI
make IT more sustainable.
technology stack, DI architecture is an autonomous structure
The system consolidations typical of a green data center require required for an autonomous practice. After all, other types of IT
data integration. That’s also true of similar project types such as solutions have architecture.
the migration, collocation, or upgrade of applications and databases.
Adopt hub-and-spoke architecture for most DI implementations.
Although a few application and database consolidations may be
The hub reduces the number of interfaces and provides a pattern
executed by simply copying data from source A to target B, the vast
that everyone can understand and be productive with. Hub-and-
majority require that data be transformed and cleansed to better fit
spoke architecture is also conducive to other worthy goals, such
the target. That demands tools and techniques for DI.
as reuse, productivity, collaboration, and consistent development
Data integration is preferred over server virtualization, at least standards. But there are many variations of hub-and-spoke structure,
in the data tier. Virtualization is most often applied to application so you need to actively design an architecture for your DI solutions.
servers, which can be collocated and configured in a straightforward
Don’t be dogmatic about hub-and-spoke architecture. Otherwise,
manner to share common memory space and other hardware
you’ll heap a heavy workload on the hub. To accommodate large
resources. Data servers are a different matter entirely, because all are
data volumes or complex transformational processing, distribute the
designed to seize every scrap of hardware resource. For this reason,
workload beyond the hub.
the virtualization of multiple data servers is unlikely to yield desirable
results. Therefore, in the data tier of the green data center, database Embrace services. A DI service extends existing hub-and-spoke
consolidations and collocations—which are best accomplished with DI architectures with new interfaces, so DI hubs can embed functions into
techniques—are preferred over true virtualization. a wide range of traditional and composite application architectures.

Data integration can make data management practices more


sustainable. As enterprises automate more business processes with
software, numerous online systems collect data at unprecedented
and increasing rates. Driven by regulatory and legal requirements,
some organizations retain almost all data, regardless of its age or
usefulness. These two practices result in massive data volumes that
are retained indefinitely, which in turn consumes resources such as
server and storage hardware, administrative personnel, data center
space, and electricity.

The data migration techniques of OpDI can consolidate and collocate


redundant databases, thereby reducing the number of servers,
plus the budgets and resources they consume. Furthermore, DI
techniques such as data federation and data services can assemble
data sets on the fly, as they are needed, without spawning new,
permanent databases that burn up server resources.

5   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

number eight number nine


A data integration solution should be the Data integration must be coordinated with
product of collaboration. other data management disciplines.

TDWI Research defines collaborative data integration as a In most organizations today, data and other information are
collection of user best practices and software tool functions that managed in isolated silos by independent teams using various
foster collaboration among the growing number of technical and data management tools for DI, data quality, data governance
business people involved in DI projects and initiatives. In a recent and stewardship, metadata and MDM, database administration,
TDWI survey, two-thirds of organizations surveyed reported that data architecture, and so on. In response to this situation, some
collaboration is required for DI. organizations are adopting enterprise data management (EDM), a
best practice for coordinating diverse data management disciplines,
Recognize that DI has collaborative requirements. The greater the
so that data is managed according to enterprisewide goals that
number of DI specialists and people who work closely with them,
promote technical efficiencies and support strategic, data-oriented
the greater the need for collaboration around DI. Head count aside,
business goals.
the need is also intensified by the geographic dispersion of team
members, as well as new requirements for regulatory compliance and In many ways, EDM is similar to collaborative DI, except that
data governance. EDM involves several data management disciplines—not just DI.
Furthermore, EDM demands far greater guidance from business
Embrace collaboration for its benefits. The leading benefits of
management, so that all data management work is aligned to support
collaborative DI are its support for governance, business visibility into
strategic, data-driven business objectives, including fully informed
integration projects, reuse in development, and increased options for
operational excellence and BI, plus related goals in governance and
IT management. Collectively, these benefits enable all team members
compliance. The challenge of EDM is to balance its two important
to see the big picture, instead of just their own individual project
goals—uniting multiple data management practices and aligning them
pieces.
with business goals that depend on data for success.
Determine an appropriate scope for collaboration. Each DI project
Data integration plays several critical roles in EDM:
has its own combination of business and technology people who are
stakeholders, as summarized in Figure 2. So you may need to define Core practices. The six most common core practices in EDM are
the scope of each collaborative DI project separately. BI/DW, data quality, MDM, data governance, DI, and enterprise data
architecture. These are considered core because they result in high-
Support DI’s collaboration with organizational structures. These
profile solutions, which makes them a greater priority than other data
structures can be technology focused (such as data management
management practices. Note that DI is a prominent core practice
groups), business driven (data stewardship and governance), or a
that must be coordinated with other data management practices.
hybrid of the two (BI teams and competency centers).
Supporting practices. Common supporting practices for EDM
include metadata management, data stewardship, data modeling,
data profiling, data federation, and data glossaries. These aren’t
Data integration as high-profile or sexy as the core practices, but they still provide
specialists
essential functionality—without which data management wouldn’t
be possible. Data integration depends heavily on all supporting
Various Other data practices, and good DI tools include functions for most
business COLLABORATION management of these practices.
people specialists
(Continued, next page)

Hybrid
business/ technology
colleagues

Figure 2. People involved in collaborative DI

6   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

number ten
Data integration should be governed, but
also contribute to governance.

(Continued) When executed broadly, data governance (DG) influences almost


all data management practices, including DI, quality, warehousing,
Infrastructure. Enterprise data management doesn’t have its own
standards, administration, architecture, and so on. Data governance
autonomous infrastructure. Instead, EDM depends on the servers and
typically requires that adjustments be made in these practices, in
connectivity of the data management tools it’s unifying, plus shared
support of the data usage policies developed by the DG board.
enterprise infrastructure for integration middleware, various types
of buses, and services. Note that DI tools, servers, and interfaces Data integration must be governed. Data governance boards
contribute substantially to EDM’s composite infrastructure.1 regularly decide to control DI solutions, simply because much of an
enterprise’s data flows through DI’s tools, servers, interfaces, and
infrastructure. After all, most DG policies limit data access and tighten
controls on data usage. This is especially true when DG is driven by
issues in regulatory compliance, data security, and data privacy.

Data integration teams can broaden their reach via DG. Ironically,
a DG board may also loosen its control of data. For example, most
DG boards provide procedures through which technical personnel—
say, DI specialists—can request access to data owned by another
organization. If the request is granted, DG helps DI cast an ever-
widening net for data to support project types that are analytic
(feeding a data warehouse), operational (consolidating database
instances), or cross-business (sharing data with partners). Data
governance can both limit these projects to assure compliance
and liberate them to reach more data sources and targets. A data
governance program must strike a pragmatic balance between these
competing goals.

Data governance can assist with DI standards. Data governance


typically focuses on compliance issues from a business perspective,
yet it can be stretched to become a collaborative mechanism for
cross-team technology issues. For example, the review process of
DG can handle technical proposals for DI development standards,
preferred interfaces, data exchange standards, and so on.

Data integration infrastructure helps automate DG processes.


Data governance is mostly about people and processes. Yet DG
needs help from software automation, if it’s to scale up to govern
numerous business initiatives and technical implementations.
Mature DI tools today have a few functions that lend themselves to
governance activities, such as metadata management (to inventory
enterprise data assets), data profiling (to gauge the condition of
such assets), and data monitoring (to track access and usage of
enterprise data).

1. F or more information about coordinating data management tools and techniques, see the
TDWI Checklist Report on enterprise data management, available at tdwi.org/research.

7   TDWI rese a rch tdwi.org


tdwi c h e c k l i s t r e p o r t: t o p t e n b e s t p r a ctic e s f o r d ata i n t e g r ati o n

about our sponsor about the author

SAS is the leader in business analytics Philip Russom is the senior manager of TDWI Research at The Data
software and services, and the largest Warehousing Institute (TDWI), where he oversees many of TDWI’s
independent vendor in the business research-oriented publications, services, and events. He’s been an
intelligence market. Through innovative solutions delivered within industry analyst at Forrester Research, Giga Information Group, and
an integrated framework, SAS offers unparalleled data management Hurwitz Group, where he researched, wrote, spoke, and consulted
capabilities, including data integration and data quality, is the about BI issues. Before that, Russom worked in technical and
market leader in predictive analytics and provides sophisticated marketing positions for various database vendors. You can reach him
reporting. SAS helps customers at more than 45,000 sites improve at prussom@tdwi.org.
performance and deliver value by making better decisions faster.
Since 1976 SAS has been giving customers around the world THE
POWER TO KNOW®.

www.sas.com

about tdwi research

TDWI Research provides research and advice for BI professionals


About the TDWI Checklist Report Series worldwide. TDWI Research focuses exclusively on BI/DW issues and
teams up with industry practitioners to deliver both broad and deep
understanding of the business and technical issues surrounding the
TDWI Checklist Reports provide an overview of success factors for deployment of business intelligence and data warehousing solutions.
specific projects in business intelligence, data warehousing, or TDWI Research offers reports, commentary, and inquiry services via
related data management disciplines. Companies may use this a worldwide Membership program and provides custom research,
overview to get organized before beginning a project or to identify benchmarking, and strategic planning services to user and vendor
goals and areas of improvement for current projects. organizations.

8   TDWI rese a rch tdwi.org

You might also like